Don't let the LLM speak, just probe it Researchers have developed a method to extract classification answers from large language models (LLMs) by reading their internal hidden states instead of generating text, achieving faster and cheaper results. The technique uses a frozen LLM and a small MLP probe to act as any classifier described in English, returning calibrated probabilities in milliseconds without per-criterion training. This approach bypasses the slow, expensive generation step by capturing the model's already-formed decision from its residual stream at the final prompt token. 2026-06-10 TL;DR: When an LLM reads "here's some text, here's a criterion — does it satisfy it?", the answer often already exists in its hidden state before it generates a single token. So skip generation entirely: grab the hidden state at the last prompt token ~70% of the way up the model's layers , feed it to a tiny MLP, calibrate the output. Because the training data varies the criterion, you get one frozen model that acts as any classifier you can write in English. The problem : As part of my work at NOPE https://nope.net I need to ask lots of questions about lots of text. Not "what topic is this" questions — embedding classifiers with vanilla cosine distances handle those fine — but structural ones. So, given a transcript, I want to know Is the speaker themselves the one struggling, or are they describing someone else? Is this sarcasm? Does "I used to hate this, but now I love it" express current dislike? Embeddings are mostly blind to that sort of thing; they see hate-words and love-words and a topic. The usual escalation is an LLM judge: send the text to a big model with a rubric, get prose back, parse it. Judges work, but they're slow, they're pricey if you're running them on everything, and the confidence they report is vibes — a judge's "7/10" isn't a probability of anything. The thing I eventually internalized is that when an LLM reads a prompt like this: