Fail loudly: a plea to stop hiding bugs A new essay argues that software teams should stop hiding bugs and instead "fail loudly" by making errors visible and obvious. The author contends that concealing failures prevents learning and allows small issues to compound into larger problems. The piece calls for a cultural shift toward transparency in software development, where bugs are treated as opportunities for improvement rather than sources of shame. AI juries This is part of a series I’m writing on generative AI. State: Withdrawn. This text is predicated on an unjustified assumption: that multiple evaluations of the same prompt by the same generative AI are equivalent to independent jurors. That may not be true, in which case the conclusion doesn’t follow. Condorcet’s Jury Theorem Imagine a jury of N experts trying to decide if a binary true/false fact is true. Each expert has an independent probability p of being right. Since they are experts, assume p 0.5 –they’re at least better than a coin flip. We can ask for a vote: have each expert state their binary answer and take the most popular answer. Let’s call the probability that the majority of the jury is right j . This “jury accuracy” probability increases dramatically as N grows. For example, with a jury of N = 25 experts, each likely to be right with a probability p = 0.75, the probability j that the majority is right is already 99.663% The table below shows how powerful this effect is. It shows the jury accuracy j for different jury sizes N rows and individual expert accuracies p columns : | N Experts | p = 0.51 | p = 0.55 | p = 0.60 | p = 0.70 | p = 0.75 | p = 0.80 | p = 0.90 | p = 0.95 | |---|---|---|---|---|---|---|---|---| | 1 | 0.510 | 0.550 | 0.600 | 0.700 | 0.750 | 0.800 | 0.900 | 0.950 | | 3 | 0.515 | 0.575 | 0.648 | 0.784 | 0.844 | 0.896 | 0.972 | 0.993 | | 5 | 0.519 | 0.593 | 0.683 | 0.837 | 0.896 | 0.942 | 0.991 | 0.999 | | 11 | 0.527 | 0.633 | 0.753 | 0.922 | 0.966 | 0.988 | 1.000 | 1.000 | | 15 | 0.531 | 0.654 | 0.787 | 0.950 | 0.983 | 0.996 | 1.000 | 1.000 | | 25 | 0.540 | 0.694 | 0.846 | 0.983 | 0.997 | 1.000 | 1.000 | 1.000 | | 35 | 0.547 | 0.725 | 0.886 | 0.994 | 0.999 | 1.000 | 1.000 | 1.000 | | 45 | 0.554 | 0.751 | 0.914 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | | 55 | 0.559 | 0.772 | 0.934 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | | 75 | 0.569 | 0.808 | 0.960 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | | 101 | 0.580 | 0.844 | 0.979 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | | 201 | 0.612 | 0.923 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | | 501 | 0.673 | 0.988 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | | 1001 | 0.737 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | | 5001 | 0.921 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | The following is roughly the same data in a plot x axis in log scale : This “wisdom of the crowds” effect works, at least in this idealized scenario where all experts are independent. There’s an interesting flip side: if the “experts” are more likely to be wrong p < 0.5 , the jury’s performance plummets. The majority vote just amplifies the error, converging towards 0% accuracy i.e., being reliably wrong . You can see this by swapping the definitions of “right” and “wrong” in the description above. This model isn’t new; it goes back to the dawn of the French revolution. It was first proposed in 1785 by the Marquis of Condorcet and is known as his Jury Theorem. Scaling AI juries In Dumb AI and the software revolution 3h2 I mentioned a key advantage of generative AI: in addition to working at light speed answering questions in seconds, not days , it can be scaled instantly : The infinite monkey is no longer hitting keys at random –just somewhat stupidly. But at this speed and with instant scaling, the difference is monumental. If you can get an AI juror to answer a binary question with an accuracy p 0.5, you can achieve arbitrarily high performance simply by scaling resources: running bigger juries. Even if your AI is barely better than a coin flip say, p = 0.51 , you can still get a jury with j 0.99 accuracy, though it can be quite expensive with $p = 0.51%, you’d need a jury of N = 5001 just to hit j = 0.92 . But, as the table above shows, if you improve your agent’s accuracy p to, say, 0.70, with N = 25 you’re already above 98% accuracy. And, interestingly, scaling the jury doesn’t impact latency. The questions can be executed in parallel, so the total time is determined by the slowest agent. I suspect strategies like hedging e.g., run N + M jurors, take the answer from the N first responders may be applicable. What types of questions? So what kinds of questions can this be useful for? I have a few practical, everday examples in mind: This unit test fails. Is the implementation correct regarding the tested property ? I intend to use this to determine the next step: ask an agent to fix the implementation, or ask it to fix the test. A variation of the above: Does this code implement a specific property described in natural language ? I’ve received a code change possibly from AI, possibly from a human; it doesn’t matter . Does it adhere to a specific coding style guideline? The jury’s answer determines if I run another agent to fix the style issue. Is this function’s documentation docstring still accurate for the code it describes? I have written a new recipe 1f8 . Does it uphold one of my principles for my recipes 3dq ? The practical trade-off I’m excited to have a tool where I can create my own AI juries. Based on their observed performance false positives/negatives ratios , I can tweak the contexts increase p or adjust the jury size increase N , until I find the right balance between final accuracy and cost. Every time the jury gets things wrong, I get involved and steer the process. Every time it gets things right, it saves me from analyzing things myself. This creates a very concrete trade-off: spend money bigger juries to save time less manual intervention . Related - Up: Essays on AI 3h7