# The Bug Behind the Bug: Anatomy of a Three-Layer Consensus Halt

> Source: <https://dev.to/0xdevc/the-bug-behind-the-bug-anatomy-of-a-three-layer-consensus-halt-32n5>
> Published: 2026-06-17 14:19:41+00:00

A dev-log from building an AI-native layer one, mostly alone.

I am 18, I build NOVAI, and you can find me as 0x-devc.

NOVAI is a layer-one blockchain written from scratch in Rust, with a chained-BFT consensus in the HotStuff family and AI entities as first-class protocol primitives. This post is about the week my testnet wedged itself at a single block height and would not recover, and about the three-layer bug I eventually found sitting underneath a one-line error message. It is also, honestly, a post about the discipline of not fixing the first thing you see.

I want to be precise up front about what this is and is not. I fixed a deep consensus halt, and the chain is now stable under sustained load. I did not prove my consensus formally safe, and the hardening is not finished. If you skip to the end, read the "What is still open" section, because it is the most important part.

**The symptom**

The chain stopped committing. Four validators, all of which had been running continuously with zero restarts, sat at a committed height around 535,003 and refused to advance. The view-change counter, which on a healthy chained-BFT chain should barely move, was climbing on a steady cadence: a fresh round every sixty seconds, forever, with nothing committing in any of them.

Three of the four validators were emitting the same error on every single leader turn:

Leader self-vote add failed: InvalidVote("Equivocation: voter [...] voted for different blocks at same height")

Each node's error named only its own voter key. The fourth validator was different: it was stuck three blocks further back, never advancing its round, never proposing, and spinning a completely different error, a codec failure complaining about a duplicate voter, at roughly 195 times per second. Over the capture window that one node had written 64 MB of logs; the other three had written about 28 KB each.

So I had two distinct failure signatures on the same wedged chain, no restarts to blame, and a view-change loop that would clearly run until the heat death of the validators. The question I started from was narrow and sharp: with zero restarts, how does a continuously running validator come to hold recorded votes for two different blocks at the same height, such that its own equivocation guard rejects its own leader self-vote?

**The first read, which was correct and incomplete**

The loud error pointed at one structure, and the first read of it was right as far as it went.

My consensus tracked an in-memory map, voted_at_height, from a voter's address to the block hash they had voted for at the current height. Its stated purpose was to catch equivocation: a validator voting for two different blocks at one consensus height. It was cleared when a height committed, and when the view height advanced on a dominating QC. It was not cleared on a plain round advance.

Here is why that one omission is fatal. In chained BFT, a block's hash commits to its round. When a height fails to certify in its first round and the view changes, the next leader proposes a block for the same height in a higher round, and that block necessarily has a different hash. This is not equivocation. It is the entire point of view change: a higher view supersedes a failed lower one.

But voted_at_height did not know that. When a leader self-voted at height H in round R, it recorded voted_at_height[self] = hash(block at H, round R). The proposal went nowhere. Rounds advanced. The leader rotation cycled back around, and the same validator became leader again at the same height H, now in some round R plus k. It built a new block, with a new hash because the round differed, and tried to self-vote. Its own guard read the stale entry from round R, saw a different hash, and rejected the vote as equivocation. From that point on, every leader turn that validator ever took at that height produced exactly the captured error. All three healthy validators had burned their one clean self-vote in their first leader turn after the height wedged, which is why they were failing symmetrically, each only against its own key.

The structural mistake was a scope mismatch I would later recognize as a recurring theme: voted_at_height was a height-scoped data structure being used to enforce a property that is actually round-scoped (within-view), and worse, the property it enforced is not one HotStuff requires at all. HotStuff forbids equivocation within a view; cross-view safety is supposed to come from a locked-QC rule, not from a blanket ban on ever re-voting at a height. The guard was a broken substitute for a safety mechanism that, it turned out, did not exist anywhere else in my code either. More on that later.

This was a real bug. If I had stopped here, killed the guard, and redeployed, I would have learned a painful lesson, because the chain would have stayed wedged. The equivocation error was the loudest layer, not the deepest one.

**Pulling the thread: the root cause was layered**

I treat consensus bugs the way I wish more people did: diagnosis first, and diagnosis is not allowed to authorize a patch. So before touching any code I produced two full read-only diagnoses of the same evidence, from cold, independently. The interesting part is that they agreed on the symptom and disagreed on the primary cause. One read the equivocation guard as the root; the other argued the guard was secondary and something in the sync path was primary. Reconciling that disagreement against the source, line by line, is what actually found the bug. I did not average the two opinions. I went back to the code and let it break the tie.

Here is what the three layers turned out to be.

**Layer three (visible): the equivocation guard misfire**

This is the voted_at_height story above. It is real, it is deterministic, and it is the error you see in the logs. It is also the least important of the three, because fixing it alone does not unwedge the chain.

**Layer one (primary, and restart-proof): the commit cursor outran the QC chain**

This is the one that actually held the chain hostage, and it is the one the second diagnosis was right about.

A node's view of progress lives in two numbers: highest_qc, the highest quorum certificate it has seen, and committed_height, the cursor marking what it considers finalized. On the three healthy validators these two numbers had diverged in a way that should be impossible: the highest QC was at one height, but the commit cursor was sitting a step above it, at the stuck height, with no QC certifying that height anywhere in the cluster.

How does a cursor get ahead of the certificates that are supposed to justify it? Through the catch-up sync path. When a node falls behind, it requests blocks from peers and replays them. My sync handler advanced committed_height and executed the synced blocks gated only on two checks: parent-hash contiguity, and a state-root match against local execution. It did not require that a valid certifying QC for the block exist. The comment in that code even stated the false assumption out loud: these blocks "were already committed by network consensus." They were not necessarily certified by anything. On the serving side, the block responder happily served uncommitted, proposal-time blocks straight from its database and in-memory cache, with no committed-height filter of any kind.

Compose those two and you get the trap. An uncertified block at the stuck height had been distributed and persisted at proposal time. The sync path then "committed" it on the healthy nodes without any certificate. Now the cursor sat above the highest QC, and here is the kill: every follower silently drops a proposal at a height it already considers committed. So no validator would vote on any new proposal at the stuck height. With no votes, no QC could form there. With no QC, the highest QC could never advance. With the highest QC frozen, the cursor divergence could never resolve. The wedge was self-sealing.

And it was restart-proof, which is the property that makes this kind of bug so nasty. The commit cursor is persisted to disk. Restarting a wedged node just reloads the wedge. There is no operational escape that does not touch persisted state.

To see why nothing could move, it helps to know that essentially every view computation in the consensus crate fuses these numbers through one formula:

expected_height = max(self.height, highest_qc.height) + 1

Proposals, votes, and timeouts are all gated on that expected height. With the cursor already past the highest QC and the highest QC frozen, expected height was pinned exactly one step behind where the cursor already was, so every leader kept trying to re-propose a height the followers had already buried. Layer three (the equivocation guard) was just the specific way that doomed re-proposal failed on the leader's own self-vote. Even if I had made the self-vote succeed, the proposal would have died silently at every follower.

**Layer two (the origin, and safety-adjacent): a duplicate-voter QC**

That left the question of how the cluster got into a single failed round in the first place, and why the fourth validator was bricked in a totally different way. The origin event was a malformed quorum certificate: a QC containing the same voter twice.

The QC builder took the first quorum-many votes for a block hash with no per-voter deduplication. It trusted that the vote pool did not contain duplicates. But the network-vote ingestion path had no equivocation or per-hash duplicate guard at all; its only dedup was the round-scoped voted_in_round set, which is cleared on every round advance, while the vote pool itself was deliberately retained across round advances so that late QCs could still form. So the same voter's vote, arriving once before a round advance and once after, landed twice in the pool for that block hash. The builder then formed a "quorum" certificate that actually had two distinct signers where it needed three. That is not just a liveness nuisance; a QC with fewer distinct signers than quorum is a BFT counting violation, a safety-adjacent defect in its own right.

What contained the blast radius was, of all things, my canonical codec. The QC encoder rejects duplicate voters on serialization. So the malformed QC could never be encoded, which meant it could never cross the wire, be persisted, or be embedded in a timeout. It poisoned exactly one node: the one that formed it. That node could no longer encode its own highest QC, so it could not create timeouts (195 times per second of failing to, hence the 64 MB log), could not persist sync commits, and fell out of the voting quorum entirely. Its dropout is what reduced the live set to a fragile three, which is what let a single height fail to certify in one round, which is what armed layers one and three. The codec saved the network from a bad certificate and simultaneously bricked the node that produced it.

**The structural gap underneath all three**

While reconciling the diagnoses I ran one grep that reframed everything: a case-insensitive search for "locked" across the consensus crate returned zero matches. No locked QC, no locked round, no lock-step field anywhere in the consensus state. Standard chained HotStuff leans on the locked-QC rule for cross-view safety. Its absence is the deeper reason all three layers were even possible: the equivocation guard was a broken stand-in for it, the duplicate-voter QC slipped through because nothing tracked per-height certificate uniqueness, and the sync path felt free to commit on contiguity alone because there was no notion of a binding lock to violate. I did not try to introduce a locked-QC rule in this round of work. I flagged it, hard, and you will see it again in the open-questions section.

**The fix**

The reconciled fix had four parts, landed as a sequence of small commits with a stop-and-review gate between each, because this is the most safety-critical code in the system and I did not want a five-fix mega-diff that nobody could review.

**Fix A, the primary one: certify before you commit**. No block may advance the commit cursor unless a valid certifying QC for that block has been verified locally first. This sounds obvious and it was the whole ballgame for layer one. It also had a storage prerequisite I had to clear first: per-height certifying QCs were not durably stored or retrievable. The commit path only wrote the single triggering QC, the sync path wrote no QCs at all, and the block-response wire format had no field to carry them. So Fix A was two stages. Stage one was pure plumbing: dense per-height QC persistence on both the commit and sync paths, a reader to load a QC at a given height, and a QC field added to the block-response message. Stage two was the actual check: verify each synced block's certifying QC, binding it to the block's height and hash and running full quorum-and-signature verification, and advance the cursor only across the contiguous prefix of properly certified blocks, stopping at the first one that fails. This is the change that makes the original wedge unreachable.

**Fix B, the safety one: make QCs duplicate-proof on every path.** Deduplicate votes by voter in the QC builder before counting quorum, so two copies of one voter can never inflate a count. Add a per-voter dedup to the network-vote ingestion path so a duplicate never enters the pool in the first place. And, crucially, route every untrusted external QC-acceptance path , a gossiped QC, a peer's timeout, a synced block , through a single helper, verify_qc_well_formed, that requires a quorum of distinct in-set voters, verifies every signature, and validates canonical encoding. Locally formed QCs and proposal-embedded QCs keep their own per-voter dedup and canonical-encode gate rather than the helper, but every path a QC can arrive from across the network now hits a verification chokepoint. While writing this I found those external-acceptance sites were asymmetric: some verified fully, some verified nothing. The helper closed that gap.

**Fix C, the liveness one: delete the equivocation guard.** I cleared voted_at_height on round advance, and then deleted the structure entirely (the only mentions that survive are comments in the regression test), because once it is round-scoped it does nothing the existing round-scoped voted_in_round check does not already do. Within-round equivocation is still caught, on both vote paths. What I intentionally gave up is the cross-round same-height "equivocation" detection, which was never a real safety property to begin with. Its safety now rests on Fix B's per-voter dedup plus the standard quorum-intersection argument. I want to be honest that this is the one place I reduced local Byzantine detection on purpose, and it is exactly the place a later audit told me to look hardest.

**Fix D, the resilience one.** Two changes. First, a backoff on the failing-timeout path, because the 195-per-second error loop on the bricked node had overrun log retention and erased the onset logs, which is why the original onset is unrecoverable. A consensus node that fails should fail quietly enough to preserve its own forensics. Second, move the timeout's embedded-QC adoption ahead of its height gate, so a node stuck at a wrong view can still learn a dominating, fully verified QC from a peer's timeout and self-heal instead of discarding exactly the message that would rescue it.

The ordering mattered. Fix B lands before Fix C, because B's dedup is the safety net that replaces the guard C removes. All four assume a fresh genesis at deploy, for a reason I will come back to.

**The cold audit that reopened the question**

Before I trusted any of this, I ran a from-scratch adversarial audit of the consensus: a cold, multi-pass review whose governing rule was that nothing is assumed correct because it is committed or because it has a passing test. I assigned each safety and liveness property to an independent reviewer tasked with breaking it, then had separate refuters attack every finding, then reconciled every verdict by hand against the source. It was the most productive thing I did all week, and it did two things I did not expect.

First, it found a critical hole my fix had not touched. A gossiped standalone QC, arriving as its own network message, was installed and persisted with no signature, quorum, or membership verification whatsoever. A single unauthenticated message carrying a QC with an enormous height and an empty vote set would install as the highest QC, pin expected height astronomically high so that every real proposal, vote, and timeout was rejected, and survive restart because it got persisted. That is strictly worse than the halt I was fixing: a permanent, restart-durable wedge from one forged message. It was a clean deploy blocker, and my four fixes sailed right past it because they were focused on the paths the incident had exercised, and the incident had never exercised this one. I designed and landed a guard that routes every gossiped QC through the same verify_qc_well_formed helper before it can touch any state, and I convinced myself with a chokepoint argument, and tests, that this was the only unverified path into the highest QC. A combined verification pass over the whole stack came back clean for that hole.

Second, and more soberingly, the audit reopened the deepest question and refused to close it. With no locked-QC rule, two conflicting QCs at the same height can each form and each independently pass well-formedness, because nothing enforces per-height uniqueness. The audit confirmed that single-node divergence is blocked: once a node installs a QC at a height, its expected height moves on and it will drop a later conflicting vote at that height. But it could not rule out cross-set divergence, where disjoint quorums each extend a different branch under adversarial message scheduling. It marked the escalation to a conflicting commit as unproven but plausible, and it classified the resolution as a design question, introducing a real locked-QC safety rule, not a one-line patch. I am not going to pretend that verdict is something it is not. It is open.

The audit also produced a written backlog of lower-tier hardening items, each diagnosed read-only and each gated on its own future fix-design session. And it taught me a smaller but real lesson about trustworthy monitoring: one residual it logged was that my new certify-before-commit sync path advanced the cursor without updating the highest QC, which would have made the cursor legitimately appear to outrun the QC during soak and blinded the very invariant I most wanted to watch. I closed that specifically so the soak's signal would mean what I thought it meant.

**How I verified it: wipe, fresh genesis, soak**

Recovery was not a restart. It could not be, because the wedge lived in persisted state; reloading it just reloads the wedge. I considered rolling the cursor back and replaying, and certifying the stuck block after the fact, and rejected both: replay has to reconcile already-executed uncertified blocks and inherits a sparse history of QC rows that the new verification cannot check, and retroactive certification is simply impossible because no quorum ever certified that block. So I did a true wipe and fresh genesis. That also happened to satisfy a precondition of Fix A: the certify-before-commit invariant only holds without awkward historical backfill if the chain carries a dense per-height QC from block zero, which a fresh genesis gives you for free.

Then I redeployed onto the binary carrying all of it, re-bootstrapped an oracle AI entity, and let it soak.

The headline number is a view-change counter. On the wedged chain, that since-start counter had climbed to 4,872 as failed rounds churned endlessly at the stuck height. On the fresh chain, it has barely moved: a few dozen view changes across more than 600,000 committed blocks, a rate so low it rounds to zero against the wedged chain's churn.

That contrast is the cleanest evidence I have, and it is worth saying why it is such good evidence. In chained BFT a view change happens only when a round fails to certify before its timer fires. A healthy pipeline certifies in the first round almost every time, so the view-change counter should be nearly stationary over the life of the chain. Thousands of view changes is not noise; it is the precise fingerprint of the wedge, a chain trying and failing to make progress on a loop. A few dozen view changes across more than 600,000 committed blocks means the chain is certifying in the first round essentially always. The wedge mechanism is gone, not masked. It is a black-box, behavioral confirmation that complements the unit tests rather than restating them.

Alongside that, the targeted invariants from the fix design all held through the soak: no equivocation errors, no duplicate-voter codec failures, the commit cursor never exceeded the highest QC, and every committed height had a retrievable valid QC. The oracle entity published continuously the whole time, which is the application-layer way of saying the thing on top of consensus kept working because consensus kept working. And the regression tests pin the mechanisms directly: the view-change re-proposal that used to throw equivocation now passes, an uncertified synced block no longer advances the cursor, a duplicate-voter set no longer forms a quorum QC, and the forged gossiped QC is rejected before it touches any state.

**What is still open**

This is the section I most want you to read, because the temptation after a soak like that is to declare victory, and that would be dishonest.

I fixed a liveness halt. I did not prove safety. Concretely, the consensus still has no locked-QC rule, and the cold audit left the conflicting-commit question unproven. Single-node divergence is blocked; cross-set divergence under adversarial scheduling is not ruled out. Introducing a proper HotStuff-style lock, or proving rigorously that the commit depth alone suffices, is design-level work, and it is the gate I have set for myself before I would ever let an external, untrusted validator into the set. Until that question is closed, "the chain is stable under sustained load with the validators I run" is a true statement, and "the consensus is safe" is not one I am willing to make.

Beyond that, the audit left a written, prioritized backlog of consensus-hardening items that I am working through one fix-design session at a time: making round adoption evidence-gated rather than trusting a single peer, adding an explicit equivocating-leader guard, adding a commit-time fork guard so that if a conflicting commit ever did occur something would detect and halt rather than proceed, and confirming the crash-atomicity of the executed state root against the commit cursor. I am deliberately not writing exploit recipes for these here. The point is the honest shape of the work: these are known, diagnosed, and scheduled, not surprises.

So, plainly: this is not formally verified, it is not proven safe against a Byzantine validator across all message schedules, and it is not mainnet-ready. What it is, is a chain whose halt is genuinely fixed, whose recovery and sync paths are meaningfully hardened, and which now soaks clean under continuous load with an AI agent riding on top of it.

**Closing**

The lesson I keep relearning is that the loudest error is rarely the deepest one. The equivocation message in my logs was real, and chasing only it would have shipped a fix that left the chain just as wedged, because the cursor had quietly outrun its certificates underneath. Two independent diagnoses, reconciled against the source rather than against each other, found the layer that mattered. A cold audit that assumed nothing surfaced a worse bug than the one I started with, and then had the integrity to leave the hardest question open instead of waving it through.

The halt is fixed. The hardening continues. The next stage is ahead, and I will write it up when it is real.

0x-devc