{"slug": "fail-loudly-a-plea-to-stop-hiding-bugs", "title": "Fail loudly: a plea to stop hiding bugs", "summary": "A new essay argues that software teams should stop hiding bugs and instead \"fail loudly\" by making errors visible and obvious. The author contends that concealing failures prevents learning and allows small issues to compound into larger problems. The piece calls for a cultural shift toward transparency in software development, where bugs are treated as opportunities for improvement rather than sources of shame.", "body_md": "# AI juries\n\n*This is part of a series I’m writing on\ngenerative AI.*\n\n**State: Withdrawn.** This text is predicated on an\nunjustified assumption: that multiple evaluations of the same prompt by\nthe same generative AI are equivalent to independent jurors. That may\nnot be true, in which case the conclusion doesn’t follow.\n\n## Condorcet’s Jury Theorem\n\nImagine a jury of *N* experts\ntrying to decide if a binary (true/false) fact is true. Each expert has\nan independent probability *p*\nof being right. Since they are experts, assume *p* > 0.5 –they’re at least better\nthan a coin flip.\n\nWe can ask for a vote: have each expert state their binary answer and\ntake the most popular answer. Let’s call the probability that the\nmajority of the jury is right *j*. This “jury accuracy” probability\nincreases dramatically as *N*\ngrows.\n\nFor example, with a jury of *N* = 25 experts, each likely to be\nright with a probability *p* = 0.75, the probability *j* that the *majority* is\nright is already 99.663%!\n\nThe table below shows how powerful this effect is. It shows the jury\naccuracy *j* for different jury\nsizes *N* (rows) and individual\nexpert accuracies *p*\n(columns):\n\n| N (Experts) | p = 0.51 | p = 0.55 | p = 0.60 | p = 0.70 | p = 0.75 | p = 0.80 | p = 0.90 | p = 0.95 |\n|---|---|---|---|---|---|---|---|---|\n| 1 | 0.510 | 0.550 | 0.600 | 0.700 | 0.750 | 0.800 | 0.900 | 0.950 |\n| 3 | 0.515 | 0.575 | 0.648 | 0.784 | 0.844 | 0.896 | 0.972 | 0.993 |\n| 5 | 0.519 | 0.593 | 0.683 | 0.837 | 0.896 | 0.942 | 0.991 | 0.999 |\n| 11 | 0.527 | 0.633 | 0.753 | 0.922 | 0.966 | 0.988 | 1.000 | 1.000 |\n| 15 | 0.531 | 0.654 | 0.787 | 0.950 | 0.983 | 0.996 | 1.000 | 1.000 |\n| 25 | 0.540 | 0.694 | 0.846 | 0.983 | 0.997 | 1.000 | 1.000 | 1.000 |\n| 35 | 0.547 | 0.725 | 0.886 | 0.994 | 0.999 | 1.000 | 1.000 | 1.000 |\n| 45 | 0.554 | 0.751 | 0.914 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 55 | 0.559 | 0.772 | 0.934 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 75 | 0.569 | 0.808 | 0.960 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 101 | 0.580 | 0.844 | 0.979 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 201 | 0.612 | 0.923 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 501 | 0.673 | 0.988 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 1001 | 0.737 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |\n| 5001 | 0.921 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |\n\nThe following is roughly the same data in a plot (x axis in log scale):\n\nThis “wisdom of the crowds” effect works, at least in this idealized scenario where all experts are independent.\n\nThere’s an interesting flip side: if the “experts” are more likely to\nbe *wrong* (*p* < 0.5), the jury’s performance\nplummets. The majority vote just amplifies the error, converging towards\n0% accuracy (i.e., being reliably wrong). You can see this by swapping\nthe definitions of “right” and “wrong” in the description above.\n\nThis model isn’t new; it goes back to the dawn of the French revolution. It was first proposed in 1785 by the Marquis of Condorcet and is known as his Jury Theorem.\n\n## Scaling AI juries\n\nIn [Dumb AI and the software revolution](3h2) I mentioned\na key advantage of generative AI: in addition to working at light speed\n(answering questions in seconds, not days), it can be **scaled\ninstantly**:\n\nThe\n\n[infinite monkey]is no longer hitting keys at random –just somewhat stupidly. But at this speed and with instant scaling, the difference is monumental.\n\nIf you can get an AI juror to answer a binary question with an\naccuracy *p* > 0.5, you can\nachieve arbitrarily high performance simply by scaling resources:\nrunning bigger juries.\n\nEven if your AI is barely better than a coin flip (say, *p* = 0.51), you can *still*\nget a jury with *j* > 0.99\naccuracy, though it can be quite expensive (with $p = 0.51%, you’d need\na jury of *N* = 5001 just to hit\n*j* = 0.92).\n\nBut, as the table above shows, if you improve your agent’s accuracy\n*p* to, say, 0.70, with *N* = 25 you’re already above 98%\naccuracy.\n\nAnd, interestingly, scaling the jury doesn’t impact latency. The\nquestions can be executed in parallel, so the total time is determined\nby the slowest agent. I suspect strategies like hedging (e.g., run *N* + *M* jurors, take the\nanswer from the *N* first\nresponders) may be applicable.\n\n## What types of questions?\n\nSo what kinds of questions can this be useful for? I have a few practical, everday examples in mind:\n\nThis unit test fails. Is the implementation correct (regarding the tested property)? I intend to use this to determine the next step: ask an agent to fix the implementation, or ask it to fix the test.\n\nA variation of the above: Does this code implement a specific property (described in natural language)?\n\nI’ve received a code change (possibly from AI, possibly from a human; it doesn’t matter). Does it adhere to a specific coding style guideline? The jury’s answer determines if I run another agent to fix the style issue.\n\nIs this function’s documentation (docstring) still accurate for the code it describes?\n\nI have written a new\n\n[recipe](1f8). Does it uphold one of[my principles for my recipes](3dq)?\n\n## The practical trade-off\n\nI’m excited to have a tool where I can create my own AI juries. Based\non their observed performance (false positives/negatives ratios), I can\ntweak the contexts (increase *p*) or adjust the jury size (increase\n*N*), until I find the right\nbalance between final accuracy and cost.\n\nEvery time the jury gets things wrong, I get involved and steer the process. Every time it gets things right, it saves me from analyzing things myself.\n\nThis creates a very concrete trade-off: **spend money (bigger\njuries) to save time (less manual intervention)**.\n\n## Related\n\n- Up:\n[Essays on AI](3h7)", "url": "https://wpnews.pro/news/fail-loudly-a-plea-to-stop-hiding-bugs", "canonical_source": "https://alejo.ch/3he", "published_at": "2026-06-12 17:42:10+00:00", "updated_at": "2026-06-12 17:50:21.834766+00:00", "lang": "en", "topics": ["generative-ai", "ai-research", "ai-ethics"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/fail-loudly-a-plea-to-stop-hiding-bugs", "markdown": "https://wpnews.pro/news/fail-loudly-a-plea-to-stop-hiding-bugs.md", "text": "https://wpnews.pro/news/fail-loudly-a-plea-to-stop-hiding-bugs.txt", "jsonld": "https://wpnews.pro/news/fail-loudly-a-plea-to-stop-hiding-bugs.jsonld"}}