Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition

The Future of Life Foundation launched a competition with approximately $200,000 in prizes for developing AI workflows and methodologies that produce reliable, trustworthy knowledge bases. Participants must create generalizable tooling to investigate three challenging case studies: COVID-19 origins, the risk of synthetic black holes from the Large Hadron Collider, and the health impact of eggs. The competition aims to advance the state of the art in epistemic investigations and create reusable, compounding knowledge artifacts that can withstand adversarial scrutiny.

FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in real-world cases. We’re open-minded on the types of submissions we receive and on how they address the problem. We’ve set aside approximately $200k for prizes. Winning submissions may receive a prize from $5k-$50k and if submissions warrant, multiple $50k prizes are possible. Winners may be offered opportunities for further funded work. You can express interest https://docs.google.com/forms/d/e/1FAIpQLSeBqNCI4Klaq6FO8CbhYCxr6cYAUMjeosExOjatfCHYfEvNVQ/viewform?usp=header right away to receive commentary, information, and updates — whether you’d like to participate or are just interested in the outcomes of the competition. The heights of human epistemic investigation are impressive and valuable, but rare and difficult to reach — see our abridged collection https://docs.google.com/document/d/1wtKAjpvEiMWn-RpFDi 2Vqcvt5i3sCFPmUt3MtsKOjo/edit?tab=t.ik0s2kqs0a0s of strong examples https://docs.google.com/document/d/1wtKAjpvEiMWn-RpFDi 2Vqcvt5i3sCFPmUt3MtsKOjo/edit?tab=t.ik0s2kqs0a0s . 1 The limiting factor is rarely exquisite insight though this helps , and more often diligence, a curious and open mindset, and the time and effort needed to do the thorough work investigating background on a topic: activities AI is well placed to assist with. Existing AI-assisted knowledge base work demonstrates real pieces of this — agent memory e.g., Claude Code's memory and skills , LLM-curated personal wikis Karpathy's https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f perhaps the highest-profile , and deep-research tools. But these mostly produce single-user artifacts tuned to one investigator's context, not the kind that travel, combine, or survive especially adversarial scrutiny. We’re particularly excited by the compounding potential — if structured analyses 2 become reusable, refineable artifacts, every serious investigation enables future work, on the same or related topics, and by the same or different people, to reach further from a more solid epistemic foundation. Who knows, you might even This competition provides three challenging case studies — with deliberately varied challenge profiles — and invites you to produce tooling and techniques to help people navigate them. First, the debated and impactful question of COVID-19 origins. Second, the risk that the Large Hadron Collider LHC creates synthetic black holes perhaps destroying the Earth . Third, the health impact of eggs as a human food source . The tooling should be general: we’ll judge against these and also other difficult case studies. We want to see workflows and methodologies using AI that advance the state of the art in carrying out epistemic investigations and producing compounding knowledge bases. We aren’t asking you to build an entire, robust, fully-featured system. Instead, we’re excited by any submission that advances the state-of-the-art on a component. 3 https://www.lesswrong.com/feed.xml fn4u73i3qa14 We’ve found it useful to think of these investigations as being split into several different layers: ingestion, structure, and assessment more here https://www.lesswrong.com/posts/DMswzhPQqkqx2XAma/a-full-epistemic-stack-knowledge-commons-for-the-21st-1 . When stacked together and operating in concert, they’d create useful trusted artifacts. Something like a superior deep research, generating and interacting with a structured knowledge base, aimed at the truly epistemically discerning consumer. Below are a set of ideas for potential desiderata for a workflow. We’d expect most submissions to not be solely focused on a single layer, as we’re guessing for something to be useful it needs to work across the layers — but some discipline in separating these responsibilities may be useful for producing interoperable, shareable, compounding benefits. How do you take a messy, multi-source evidence base and turn it into something structured enough to reason over? How do you document the relationships between claims so that the full shape of the argument becomes navigable? How do you evaluate what to actually believe, or what to look at next, given everything above? We’ll offer a minimum of $5k to entries which we judge to meaningfully improve on the state of the art in faithful, scalable AI-assisted investigations, and up to $50k for entries which are truly inspiring to us. This might be by for example reliably producing accessible, thorough, highly-interoperable knowledge-enabling content across diverse domains which is readily shared and expanded on by others. We aren’t prescribing a single, specific type of submission 4 https://www.lesswrong.com/feed.xml fnrxxyn1anfzf . A couple shapes we'd be excited to see: A submission might be of a different shape, look like one of these, or may combine these for example a spec including protocol discussion and a reference prototype . Some stepping-stone alternatives which could contribute to putting a team in a great position to achieve the biggest wins but which we expect are unlikely to win the biggest prizes without follow-up work : Optionally, submit a description of your plan or a briefer, less complete implementation of it by Jun 21, 2026, and we will weigh in on whether the work seems on track for a prize and potentially provide feedback . Use the main submission form https://docs.google.com/forms/d/e/1FAIpQLScHGLJRH5ex27i0hpL0wPqyZFqp1ykYqFmJxgvg zYKD6g1mw/viewform and check the early feedback box. What we care about most: Would this actually help someone reason better about this case? Does it generalize? Does it scale with improvements to AI or more compute? Does it compound, with multiple people or teams building on each others’ work? We’ll ask judges to use the following criteria when assessing submissions: Epistemic Case Study Competition - Judging Criteria https://docs.google.com/document/d/1wtKAjpvEiMWn-RpFDi 2Vqcvt5i3sCFPmUt3MtsKOjo/edit?tab=t.v8o9nnadfvtm . In addition to the potential prizes, strong entries that demonstrate real promise may also lead to an offer for further funded work with us we estimate an 75% chance that a $50k-winning entry receives an offer like this . 6 https://www.lesswrong.com/feed.xml fnovmn7jlwa4n FLF’s general contest rules https://docs.google.com/document/d/1rTX-Z23HuR68E9CKn87FmrMin9w rOFjStr3HgzwlXA/edit?tab=t.0 heading=h.9kv9jflko3ab apply. We’ve allocated roughly $200k for this competition with the size of any individual award reflecting how much an entry moves us. We'd rather award fewer, larger prizes for entries that genuinely impress us than spread the pool out. If a wave of strong work arrives, we'll happily expand the total prize pool. Concretely, we expect to award up to: Want to compete, follow along, or join the conversation? Express interest https://docs.google.com/forms/d/e/1FAIpQLSeBqNCI4Klaq6FO8CbhYCxr6cYAUMjeosExOjatfCHYfEvNVQ/viewform?usp=header to receive updates, commentary, and see how you can participate as the competition unfolds. We're building toward what we call a https://flf.org/projects/epistack/ full epistemic stack https://flf.org/projects/epistack/ , layered infrastructure for making the provenance, structure, and assessment of knowledge transparent and traversable at scale. We think recent AI advances make this newly tractable, but the hard problems are in methodology and workflow design, as well as usability, not just capability. Not only do we expect these tools to be of widespread benefit, but we expect some organizations like ours to be eager early adopters https://flf.org/timelines/ . FLF hopes to meaningfully inform its strategy and prioritisation based on insights from these tools, meaning that great work here could move millions of dollars per year and help us and others be more effective. Much gratitude to Ben Goldhaber formerly FLF , Joel Chan, Saif Haobsh, Austin Chen, Andreas Stuhlmüller, and Dustin Kimmel for contributions and feedback. In early 2024, a $100,000 judged debate took place between Saar Wilf founder of Rootclaim and Peter Miller on the origins of COVID-19. Over 15 hours of structured argument, two smart people marshalled epidemiological data, viral genetics, Bayesian inference, and institutional analysis to reach opposite conclusions. Two expert judges ruled decisively for zoonosis. Six independent Bayesian analyses of the same evidence spanned 23 orders of magnitude. For more read Scott Alexander’s https://www.astralcodexten.com/p/practically-a-book-review-rootclaim detailed writeup https://www.astralcodexten.com/p/practically-a-book-review-rootclaim . We feel that the debate videos, judge decisions, and comment threads it links to form one of the richest publicly available records of a complex real-world epistemic dispute on an important issue. And yet all this information is still incredibly difficult to navigate, interrogate, and use to inform one’s beliefs. Further, this intense epistemic effort represents a point in time in a conversation which continues to evolve. We feel this makes it a strong stress test for tools and methods that aim to make reasoning more transparent, traversable, updateable, and trustworthy. Your job: craft the AI-assisted methodologies that build a structure to help people navigate this topic successfully. 8 https://www.lesswrong.com/feed.xml fn0cpt6c678hh5 CERN, home of the world’s largest particle accelerator, the Large Hadron Collider LHC , has a frequently asked question: Will CERN generate a black hole? https://home.cern/resources/faqs/will-cern-generate-black-hole What?? As in some previous science experiments https://blog.nuclearsecrecy.com/wp-content/uploads/2018/06/1946-LA-602-Konopinski-Marvin-Teller-Ignition-fo-the-Atmsophere.pdf , noting that novel circumstances might produce unprecedented outcomes, some participants had apocalyptic concerns. How were these put to rest? Were they truly? What does that hinge on? Unlike COVID, this is we hope essentially a closed case, and uncontested. It nevertheless rests on a huge body of accumulated and interacting knowledge which enabled scientists and the officials and public supporting them to move forward with confidence. The key challenge here may be in probing this argument for its dependencies and key considerations, and perhaps noting the weakest or most speculative points — all in an accessible way. Are eggs good to eat? Bad to eat? Great in moderation? How can we tell? Does it vary across people, and what predicts this? What else should we be paying attention to here? This vague and open-ended topic, though mundane, is representative of a huge number of everyday questions — and hopefully also a microcosm of many more impactful debates. Sometimes getting resolution on what are the important things to answer and what are the appropriate ways of knowing is more than half of the challenge. Forecasting the shape and capability of future AI is difficult, but we are excited to imagine a world where epistemic investigations of this and greater quality are commonplace. We’re aiming to catalyse that path through activities like this competition. By structure , we mean capturing the relations between different sources, claims, authors, and so on. Who said what and when? What evidence or reasons support that? What counterarguments exist or reasons for doubt? We give further ideas below. Keeping this structure alive means less loss by compression, and preserving space for nuance https://aiprospects.substack.com/p/when-ideas-round-to-false — even if we don’t consume it right away. The human urge to apply technology to knowledge-provision isn’t new: consider libraries, citations, indexes, encyclopedias including Wikipedia , databases, web search — all of which push the frontier in this space. Written discussions should aim to not exceed 10 pages, not including appendix-like material and worked examples. Worked examples and fully-fledged example knowledge bases can be arbitrarily sized within reason but should be navigable. Consider including curated pointers to particularly effective regions of worked examples. Code should either be brief, legible pseudo code or well-documented and ready to install and run with close to a single click. See here https://docs.google.com/document/d/1wtKAjpvEiMWn-RpFDi 2Vqcvt5i3sCFPmUt3MtsKOjo/edit?tab=t.s5p8ga2p1drq for more detail. Ideally such that judges can easily reimplement on a new case. One type of further work might be incorporating workflows into forecasting and prediction — perhaps grounded in forecasting bot competitions. Rootclaim thinks that one reason they lost the debate was that the “structure provided a major advantage to the debater with more memorized knowledge of the issue”. We envision these as acting as eventually becoming living knowledge bases, not merely snapshots in time.