Don't let the LLM speak, just probe it

Researchers have developed a method to extract classification answers from large language models (LLMs) by reading their internal hidden states instead of generating text, achieving faster and cheaper results. The technique uses a frozen LLM and a small MLP probe to act as any classifier described in English, returning calibrated probabilities in milliseconds without per-criterion training. This approach bypasses the slow, expensive generation step by capturing the model's already-formed decision from its residual stream at the final prompt token.

2026-06-10 TL;DR: When an LLM reads "here's some text, here's a criterion — does it satisfy it?", the answer often already exists in its hidden state before it generates a single token. So skip generation entirely: grab the hidden state at the last prompt token ~70% of the way up the model's layers , feed it to a tiny MLP, calibrate the output. Because the training data varies the criterion, you get one frozen model that acts as any classifier you can write in English. The problem : As part of my work at NOPE https://nope.net I need to ask lots of questions about lots of text. Not "what topic is this" questions — embedding classifiers with vanilla cosine distances handle those fine — but structural ones. So, given a transcript, I want to know Is the speaker themselves the one struggling, or are they describing someone else? Is this sarcasm? Does "I used to hate this, but now I love it" express current dislike? Embeddings are mostly blind to that sort of thing; they see hate-words and love-words and a topic. The usual escalation is an LLM judge: send the text to a big model with a rubric, get prose back, parse it. Judges work, but they're slow, they're pricey if you're running them on everything, and the confidence they report is vibes — a judge's "7/10" isn't a probability of anything. The thing I eventually internalized is that when an LLM reads a prompt like this: <content to be judged I used to hate this product, but honestly now I love it. </content to be judged <judgment criteria The writer currently likes the product. </judgment criteria Criteria met? …it has already decided the answer before it generates anything not accounting for CoT but allow me this small grace . The comparison between criterion and content has been done, inside the forward pass, and the result is sitting there as geometry in the residual stream. Generation — the slow, expensive, parse-the-prose part — is just the model translating a decision it has already made into words. So: don't let it speak . Take the hidden state at that final token, at some middle-ish layer where rich representations of meaning tend to live , and train a small MLP https://en.wikipedia.org/wiki/Multilayer perceptron or more simple linear probe on it that outputs one number. That's the whole trick. None of the ingredients are new. Linear probes are old Alain & Bengio, 2016 https://arxiv.org/abs/1610.01644 ; the logit lens https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens and its tuned descendants Belrose et al., 2023 https://arxiv.org/abs/2303.08112 read forward-pass internals as in-flight decisions. What's a bit new is using the probe as a general zero-shot classifier. I.e. Having a criterion supplied in English at inference time. To be fair, even that is a solved problem at the BERT-scale of encoding. See NLI Cross Encoders e.g. GLIClass https://github.com/Knowledgator/GLiClass – but crucially these will never reach the deeper understanding and causality/reasoning savvy of modern LLMs across 100k+ tokens. They just don't have the parameter or context size. My recipe, if you want to do this yourself: "Assessment:" . The seed is designed as a prefix to channel geometries, it's not arbitrary.What you end up with is strange and lovely: one frozen model and one tiny head that together form any classifier. You write the criterion at request time, in English, and get back a calibrated probability in a few tens of milliseconds, for roughly embedding-classifier money. No per-criterion training, ever. A LoRA on the backbone sharpens it, but honestly the base model alone gets you most of the way; the capability is in the model already; you're just reading it out instead of asking for it. You're getting the non -generative part of generative models. A note on that optional LoRA, because how it's trained is my favorite part of the recipe. You don't train it to classify anything. You train it to write. For each training triple, a frontier model is handed the label and writes a one-sentence verdict ASSESSMENT: The content does not satisfy the criterion, because… — justifying a known answer, not re-judging . The LoRA learns, with plain next-token loss, to produce that text from the exact prompt the probe will see at inference. And then at inference none of it is ever generated — we stop at the seed and read the hidden state. The text is scaffolding: its only job is to reshape the geometry at the seed position so the decision is laid out more legibly for the MLP. You teach the model to say the answer so it crystallizes before it speaks ... then never let it speak. A small aside on the seed token, because I find it quietly fascinating. The last token of your prompt isn't just where the prompt stops; it's the address where the answer gets written . Causal attention funnels everything the model has worked out into the position it's about to speak from; end the prompt with a judgment cue and the judgment crystallizes right there, one token wide. The whole industry already relies on this without saying so: every reward model in RLHF is a scalar head on the final token of a templated prompt. Same trick. The knobs to play with are the seed token itself e.g. "Assessment: " or "Criteria met? " , its exact position relative to the criterion and content, and the precise layer s of extraction. Often it is not the final layer that holds most value, but somewhere after the middle. This needs lots of experimentation to figure out, and depends on the underlying model and problem domain. The expensive part of a forward pass is reading the content. So if you want to score one piece of content against twenty criteria and once you have this hammer, you do , there's an optimization begging to be made: prefill the content once, cache the KV, then run each criterion as a cheap thirty-token continuation against the cache. I guess this can be called KV-popping? It works, and the scores are bit-identical to doing it the long way. Except, of course, the cached content gets encoded before the model has seen any criterion . Whereas if we go the slower route of putting the criterion before the content, then every content token attends to the criterion at every layer. The model reads the content already knowing the question very beneficial to some sorts of questions . In the cached version however it reads blind, and the criterion only gets one look back at the end. On most content this costs nothing — on our real-world eval sets the two paths are statistically identical. But if pointed at more realistic longer-form content where the criterion you're trying to extract is not straightforward e.g. containing counterfactuals or complex phrasing , then the problem needs criterion and content to interact at every layer, and the cache forecloses exactly that. EDIT: Apparently this is a typical cross-encoder versus late-interaction trade-off ColBERT https://arxiv.org/abs/2004.12832 . This technique now powers a thing called Predicate https://labs.nope.net/predicate that I run inside NOPE's https://nope.net safety stack, where "run a structural question against every message of every conversation" is the whole job and judge economics simply don't work, both on cost and latency. But the approach is general and the parts are commodity: a small open model, a prompt template, a few thousand generated triples, an MLP, isotonic regression. An afternoon of plumbing, honestly. If you build one, I'd love to hear where it breaks for you. By James https://j11y.io . Thanks for reading :