{"slug": "human-judgment-as-a-specification", "title": "Human Judgment as a Specification", "summary": "Researchers at Brown University propose a human-in-the-loop approach to formal specification generation using LLMs, introducing PICK to help programmers select correct formal specifications from multiple LLM-generated options without requiring deep formal methods expertise.", "body_md": "##\n[Human Judgment as a Specification](https://blog.brownplt.org/2026/06/09/pick.html)\n\nTags: [Large Language Models](/tags/LLMs/), [Semantics](/tags/semantics/), [Tools](/tags/tools/), [Linear Temporal Logic](/tags/LTL/)\n\n*Posted on 09 June 2026.*\n\nThe rise of GenAI in programming clearly requires an accompanying rise\nin formal methods, to confirm that AI systems running wild are producing the\nsolutions we actually want. That in turn requires that we specify what we\n*want*. This specification is necessarily mathematical, to take advantage of the\nformal methods tools. But most programmers know far less about formal specification\nthan they do about programming. What can they do?\n\n### Getting Specifications the Bad Way\n\nThe key problem we’re tackling is: how do we go from the informal (usually prose)\nto the formal.\nA natural solution is: use LLMs to translate prose into the formal\nspecifications. On the one hand, this is not absurd: LLMs can do a fairly good\njob at generating terms in many contemporary formal notations. Here’s\n[Ron Minsky](https://signalsandthreads.com/future-of-programming/#5515-1),\ntongue-in-cheek:\n\nI wonder if a more plausible model is, you go to your large language model and say, ‘Please write me a specification for a function that sorts a list.’ And then it, like, spits something out. And then you look at it and think, yeah, that seems about right.\n\nRichard Eisenberg’s response to this gets to the heart of the matter: How can we be\nsure that the generated specification is the *right* one? The human may have\njust plain been wrong. They may have been obviously wrong, or they could have\nbeen wrong in subtle ways. They may have been been ambiguous, and the LLM may\nhave taken the wrong interpretation. There may also be\n[common](/2022/11/05/little-tricky-logics.html)\n[misconceptions](/2024/07/07/little-tricky-logics-2.html)\nabout the language (which may then also be embedded in the language models).\nOr they may have been referring to things for which there is no clear ground\ntruth, or the only truth is in their head: what they mean depends on *their*\ncontext. None of these problems is fixable by an LLM alone.\n\n### Humans in the Loop\n\nWe therefore think it’s important for humans to be in the loop while formalizing\nspecifications. A true vibe-coder, by definition, isn’t going to care. Instead,\nwe want to target the **responsible** programmer: they care about their work\nquality, but they are also human, i.e., busy, lazy, and so on. What can we do to\nhelp *them*?\n\nWe believe any solution should have two key characteristics:\n\n-\nIt must be\n\n*meaningful*. Asking humans to pass judgment on complex and abstract statements is unlikely to be effective. Laziness, automation bias, inability to form good judgments, and a desire to get things done will all lead to meaningless confirmation. -\nIt must be\n\n*moderate*. Asking lots of questions, no matter how simple, can be exhausting and will also lead to errors as the number of questions grows. We should try to make every human action be highly impactful and not ask users to perform too many actions.\n\nObserve that it’s easy to have one and not the other. Telling people “you must read all the code generated by an LLM” is definitely meaningful—but it is not at all moderate (so most people won’t do it). Classical security alerts, which ask just one yes/no question, are moderate effort, but not meaningful (because the alternative is to not get the job done). The challenge is to push along both dimensions just far enough to get real value but not so far as to lose people.\n\n### Our Solution\n\nA few months ago, [we posted about PICK](/2025/12/11/pick-regex.html),\na tool we built to help us make better use of LLMs to generate regular\nexpressions. You give the LLM a prompt and get back not one regex but rather\nseveral plausible ones. Your usual options are to take what the model gives you,\nor to read the regexes yourself and try to figure out which is right. PICK does\nsomething else: it shows you *concrete strings* chosen to *distinguish* the\ncandidates from one another, and asks you to upvote or downvote each. The regex\nthat survives your votes wins. It’s worth reading that post to get a sense of\nthe workflow.\n\nThat post doesn’t mention two important things.\n\nFirst, that we now have experimental results that show that this workflow works very well.\n\nSecond, that this isn’ tied to regular expressions alone. We have built\nPICK for three illustrative domains so far, intentionally chosen to be unlike\none another: **regular expressions**, **linear temporal logic (LTL)**,\nand **attribute-based access control (ABAC)**. In all three the algorithm is the\nsame: generate candidates, sample from set differences, present a pair of\nscenarios, update scores, converge or admit defeat. The workflow does not have\nto be redesigned per domain.\n(Readers of our older post on [differential analysis](/2024/06/27/differential-analysis.html) will see the family resemblance. PICK is the in-the-loop version, where the semantic differences between still-viable candidates drive the next question.)\n\nWhat lets the same algorithm work across all three domains is that they share two key properties:\n\n**Closure under negation and intersection**— so the difference between two candidates is itself expressible.** Sampling from that difference**— so the system can show the user concrete cases where the candidates disagree.\n\nThe machinery here is not exotic. It is the stuff of a sophomore theory-of-computation course: closure properties, set differences, and witness generation. Many of the formalisms programmers use every day already have it — either inherently (Boolean logic, network routing rules, package-version constraints) or by the standard trick of bounding the universe of discourse (almost anything at which you would point a SAT/SMT solver). The properties everyone is told in class are important and then never quite sees applied are, in 2026, what stand between you and a confidently wrong access-control policy. And motivated by, of all things, cognitive science principles. So at the very least, maybe we can improve how we teach the theory of computation!\n\n### How Synthesis Also Subtly Fails\n\nSo yes, PICK is a *validation* workflow: you have some intent, the model proposes candidates, and PICK helps you check those candidates against what you meant. But that framing undersells the idea. What PICK also does is recover something synthesis tends to erase: an independent witness to user intent.\n\nTo see why that witness matters, it helps to remember what verification was for.\n\nVerification is famously written *P* ⊧ ɸ: a program *P* implements a property ɸ.\nThe check is informative precisely because *P* and ɸ are written\n*independently*. If both encode the same misconception, agreement rules out\nnothing; the redundancy disappears. (And this is the danger of having both sides\nof the verification coin generated by an LLM. PICK intervenes to make sure the\nLLM is not the *only* source of ɸ.)\n\nNow consider synthesis: ɸ ⟹ *P*. The program is correct-by-construction.\nHowever, that means it is also\n**in** correct-by-construction. When ɸ is wrong, the resulting *P* is wrong in\nprecisely the same way, and no cross-check of *P* against ɸ can catch that. This was\nalready true of classical deductive synthesis and of programming-by-example. It\nis wildly more true of synthesis-by-LLM. The\nLLM only sees the user’s natural language, which is a lossy hint of what they want.\n\nPiling on more LLMs does not necessarily fix this. They share training data,\nshare priors, and often share misconceptions. More models give you more\n*agreement*, faster. They do not necessarily give you more *redundancy*, which\nis what verification has always been about. And neither can know the user’s true\nintent: which is exactly what PICK is about.\n\n### Human Judgment as Specification\n\nIn PICK,\nthat independent witness is not a separately-written spec — it lives in the user’s classifications: each accept or reject is a commitment to a concrete behavior, and the candidates that survive must be consistent with all of them together. Taken as a whole, those commitments expose what the original prompt left unstated. Suppose the prompt was “a regex for dates”, and the model came back with several candidates. PICK puts strings in front of you: yes to `1/15/2025`\n\nand no to `13/01/2025`\n\ndeclares a position on day-month-year versus month-day-year — a question the prompt left implicit and the user may never have answered formally, even to themselves. The user arrives with a vague intent; PICK helps sharpen it — call it spec *elucidation* — not by interrogating them about formulae but by forcing them to commit on questions the prompt leaves implicit.\n\nThis is also why PICK can usefully *fail*. Sometimes none of the model’s candidates is right, and PICK ends with zero survivors. Under the spec-elucidation reading, that outcome means: the commitments you made through classification could not be satisfied by anything the model produced. Better to know than to ship the regex anyway.\n\nThis is also why we do not believe PICK becomes less useful as models improve. Better models do not make user intent more articulate — asked for “a regex matching countries of North America”, a more capable model still cannot tell you whether *you* want the Caribbean included, or where *you* want to stop heading south. Better models produce better candidates, faster — which shifts user effort precisely *toward* the work PICK is built to support.\n\nTo learn more, read our [ECOOP 2026 paper](https://www.siddharthaprasad.com/papers/pafk-pick.pdf), or try out the [PICK:Regex tool in VS Code](https://marketplace.visualstudio.com/items?itemName=SiddharthaPrasad.pick-regex).\n\nIf you have a formal language with the closure properties above — we suspect you would be surprised how many do — we would very much like to hear from you.", "url": "https://wpnews.pro/news/human-judgment-as-a-specification", "canonical_source": "https://blog.brownplt.org/2026/06/09/pick.html", "published_at": "2026-06-17 12:45:10+00:00", "updated_at": "2026-06-17 12:52:28.794121+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "developer-tools"], "entities": ["Brown University", "Ron Minsky", "Richard Eisenberg", "PICK"], "alternates": {"html": "https://wpnews.pro/news/human-judgment-as-a-specification", "markdown": "https://wpnews.pro/news/human-judgment-as-a-specification.md", "text": "https://wpnews.pro/news/human-judgment-as-a-specification.txt", "jsonld": "https://wpnews.pro/news/human-judgment-as-a-specification.jsonld"}}