Why ‘Consensus Fails’ Every Time Bob Crashes

by Jule 45 views
Why ‘Consensus Fails’ Every Time Bob Crashes

In a recent deep dive into Pathfinder’s integration tests, a persistent pattern emerged: consensus_3_nodes_with_failures::case_11_fail_on_proposal_committed repeatedly fails - not due to timing, but because of a race in state handling. Bob crashes and respawns, pulling ahead while others stall, stalling consensus. Here’s the real issue: the system assumes consistent block progression, but Bob’s early exit leaves inconsistent block states across nodes. Alice and Charlie hold, Bob resets, and the chain breaks. This isn’t just a timeout - it’s a design blind spot in failure recovery. The test hangs when Bob’s Result::unwrap() on an Err triggers a panic after 240 seconds. But here is the deal: consensus logic must handle node resets like this as a normal fault, not a crash. Test artifacts vanish in /tmp/consensus-integration-tests/.tmpTyiJGQ - no trace, no rollback. This isn’t flakiness; it’s a fragile state model. Do you trust your consensus test to survive a simple restart? The bottom line: if your test panics on failure recovery, you’re not testing resilience - you’re hiding a fault. Next time Bob crashes, ask: is the system ready to heal, or just collapse?