July 16th dChain Incident Post-Mortem

Deri Protocol
2 min readJul 24, 2024

--

What Happened

On July 16th, 2024, at 5 AM UTC, the team discovered that both the validator and the batch poster of Deri Chain (dChain) had been non-operational for several hours. While the failures of the validator and the batch poster did not affect the blockchains on the dChain itself, they impacted the validation of the dChain blockchains on its settlement layer (i.e., Arbitrum), thus compromising the security of the dChain. Upon investigation, it was identified that the issue stemmed from a malfunctioning RPC service used by the validator and the batch poster. The team promptly switched to an alternative RPC service, which restored their functionality.

However, due to an internal design feature of the Orbit stack (the framework upon which the dChain is based), a reorganisation (reorg) would be triggered if the validator’s downtime exceeded a certain threshold. Consequently, this design caused a reorg on the dChain, rolling back the blockchain by several hours. This reorg led to significant disruption across the entire Deri Protocol, resulting in an out-of-sync state between the iChains and the dChain.

The Immediate Fix

Upon diagnosing the issue, the team immediately suspended the dChain. We then utilised a backup node to restore the blockchain on the main node to the state immediately prior to the reorg. This recovery process took 14 hours, and the main node was successfully restored and resumed operation on July 16th, 2024, at 7:30 PM UTC.

Long-Term Strengthening Measures

  1. Optimize RPC Service Robustness: Enhance the reliability of the RPC services used by the validator and the batch poster.
  2. Improved Monitoring System: Develop a more advanced monitoring system to detect and address validator and batch poster failures immediately.
  3. Reduce Reorg Likelihood: Optimise the internal setup of the dChain to minimise the chances of reorg.
  4. Enhanced Backup Node: Improve the backup node to ensure faster recovery in case of incorrect reorgs, aiming to significantly reduce recovery time from over 10 hours as experienced this time.

We appreciate your patience and understanding during this incident and are committed to implementing these improvements to prevent future occurrences.

--

--

Deri Protocol

Deri Protocol = (Perpetual Futures + Everlasting Options) x Decentralized.