# Fail loudly: a plea to stop hiding bugs

> Source: <https://alejo.ch/3he>
> Published: 2026-06-12 17:42:10+00:00

# AI juries

*This is part of a series I’m writing on
generative AI.*

**State: Withdrawn.** This text is predicated on an
unjustified assumption: that multiple evaluations of the same prompt by
the same generative AI are equivalent to independent jurors. That may
not be true, in which case the conclusion doesn’t follow.

## Condorcet’s Jury Theorem

Imagine a jury of *N* experts
trying to decide if a binary (true/false) fact is true. Each expert has
an independent probability *p*
of being right. Since they are experts, assume *p* > 0.5 –they’re at least better
than a coin flip.

We can ask for a vote: have each expert state their binary answer and
take the most popular answer. Let’s call the probability that the
majority of the jury is right *j*. This “jury accuracy” probability
increases dramatically as *N*
grows.

For example, with a jury of *N* = 25 experts, each likely to be
right with a probability *p* = 0.75, the probability *j* that the *majority* is
right is already 99.663%!

The table below shows how powerful this effect is. It shows the jury
accuracy *j* for different jury
sizes *N* (rows) and individual
expert accuracies *p*
(columns):

| N (Experts) | p = 0.51 | p = 0.55 | p = 0.60 | p = 0.70 | p = 0.75 | p = 0.80 | p = 0.90 | p = 0.95 |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.510 | 0.550 | 0.600 | 0.700 | 0.750 | 0.800 | 0.900 | 0.950 |
| 3 | 0.515 | 0.575 | 0.648 | 0.784 | 0.844 | 0.896 | 0.972 | 0.993 |
| 5 | 0.519 | 0.593 | 0.683 | 0.837 | 0.896 | 0.942 | 0.991 | 0.999 |
| 11 | 0.527 | 0.633 | 0.753 | 0.922 | 0.966 | 0.988 | 1.000 | 1.000 |
| 15 | 0.531 | 0.654 | 0.787 | 0.950 | 0.983 | 0.996 | 1.000 | 1.000 |
| 25 | 0.540 | 0.694 | 0.846 | 0.983 | 0.997 | 1.000 | 1.000 | 1.000 |
| 35 | 0.547 | 0.725 | 0.886 | 0.994 | 0.999 | 1.000 | 1.000 | 1.000 |
| 45 | 0.554 | 0.751 | 0.914 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 |
| 55 | 0.559 | 0.772 | 0.934 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| 75 | 0.569 | 0.808 | 0.960 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 101 | 0.580 | 0.844 | 0.979 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 201 | 0.612 | 0.923 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 501 | 0.673 | 0.988 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 1001 | 0.737 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 5001 | 0.921 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

The following is roughly the same data in a plot (x axis in log scale):

This “wisdom of the crowds” effect works, at least in this idealized scenario where all experts are independent.

There’s an interesting flip side: if the “experts” are more likely to
be *wrong* (*p* < 0.5), the jury’s performance
plummets. The majority vote just amplifies the error, converging towards
0% accuracy (i.e., being reliably wrong). You can see this by swapping
the definitions of “right” and “wrong” in the description above.

This model isn’t new; it goes back to the dawn of the French revolution. It was first proposed in 1785 by the Marquis of Condorcet and is known as his Jury Theorem.

## Scaling AI juries

In [Dumb AI and the software revolution](3h2) I mentioned
a key advantage of generative AI: in addition to working at light speed
(answering questions in seconds, not days), it can be **scaled
instantly**:

The

[infinite monkey]is no longer hitting keys at random –just somewhat stupidly. But at this speed and with instant scaling, the difference is monumental.

If you can get an AI juror to answer a binary question with an
accuracy *p* > 0.5, you can
achieve arbitrarily high performance simply by scaling resources:
running bigger juries.

Even if your AI is barely better than a coin flip (say, *p* = 0.51), you can *still*
get a jury with *j* > 0.99
accuracy, though it can be quite expensive (with $p = 0.51%, you’d need
a jury of *N* = 5001 just to hit
*j* = 0.92).

But, as the table above shows, if you improve your agent’s accuracy
*p* to, say, 0.70, with *N* = 25 you’re already above 98%
accuracy.

And, interestingly, scaling the jury doesn’t impact latency. The
questions can be executed in parallel, so the total time is determined
by the slowest agent. I suspect strategies like hedging (e.g., run *N* + *M* jurors, take the
answer from the *N* first
responders) may be applicable.

## What types of questions?

So what kinds of questions can this be useful for? I have a few practical, everday examples in mind:

This unit test fails. Is the implementation correct (regarding the tested property)? I intend to use this to determine the next step: ask an agent to fix the implementation, or ask it to fix the test.

A variation of the above: Does this code implement a specific property (described in natural language)?

I’ve received a code change (possibly from AI, possibly from a human; it doesn’t matter). Does it adhere to a specific coding style guideline? The jury’s answer determines if I run another agent to fix the style issue.

Is this function’s documentation (docstring) still accurate for the code it describes?

I have written a new

[recipe](1f8). Does it uphold one of[my principles for my recipes](3dq)?

## The practical trade-off

I’m excited to have a tool where I can create my own AI juries. Based
on their observed performance (false positives/negatives ratios), I can
tweak the contexts (increase *p*) or adjust the jury size (increase
*N*), until I find the right
balance between final accuracy and cost.

Every time the jury gets things wrong, I get involved and steer the process. Every time it gets things right, it saves me from analyzing things myself.

This creates a very concrete trade-off: **spend money (bigger
juries) to save time (less manual intervention)**.

## Related

- Up:
[Essays on AI](3h7)
