Research Suggests Probing LLM Hidden States Could Replace Some Full Generations

Research points to faster way to use LLMs for classification

A new approach to large language model inference suggests that some yes-no judgments may already be encoded inside a model before it produces any text. Instead of asking the model to generate a full answer, the method reads its hidden state at a key prompt position and uses a small classifier head to return a score.

The idea is aimed at tasks where users want an LLM to judge whether a piece of text satisfies a criterion written in English. Examples include identifying sarcasm, determining whether a speaker is describing their own feelings or someone else’s, or checking whether a statement reflects current sentiment rather than past sentiment. The proposal argues that these kinds of structural judgments are often too subtle for standard embedding-based classifiers, but too expensive to handle with repeated full LLM generations.

In the approach described by James Padolsey, the model is prompted with content and a criterion, then stopped before it generates a response. The hidden state at the final token is extracted, often from a middle layer where semantic information is thought to be especially rich. That vector is then passed to a small multilayer perceptron, or even a linear probe, which outputs the classification result.

The method builds on older research into probes and interpretability, but applies them in a broader way. Rather than training a separate model for each question, the system is trained on many examples with different criteria written in natural language. The training data consists of triples containing the criterion, the content, and a label indicating whether the criterion is met. Because the criteria vary across examples, the probe is meant to learn the general task of deciding whether a criterion matches a passage, not a single fixed rule.

Padolsey says the technique is designed for smaller open models, such as IBM’s Granite 4.0 micro, often with a LoRA adapter to improve performance. At inference time, the prompt ends with a seed token such as “Assessment:” or “Criteria met?” The model is not allowed to continue generating. Instead, the hidden representation at that point is used to make the decision.

A calibration step is also part of the pipeline. Using isotonic regression, the output scores are adjusted so that they behave more like probabilities. The goal is for a score such as 0.7 to correspond closely to actual empirical frequency, rather than to the loosely defined confidence scores that often come from LLM judges.

The technique is presented as a way to reduce both latency and cost. A single frozen model and a small head can, in principle, act as a flexible classifier across many English-language criteria. That could make it practical to run structural checks on large volumes of text without paying for full-generation responses each time.

The article also describes an optimization for repeated scoring. If the same content must be checked against many criteria, the content can be processed once and cached. Then each criterion can be applied as a short continuation against the cache, rather than re-running the entire prompt from scratch. The trade-off, however, is that the criterion no longer influences every layer of the content encoding in the way it does in a standard cross-encoder setup.

According to the source, the method is already being used inside a safety product called Predicate at NOPE. The broader claim is that the necessary pieces are ordinary: a prompt template, a small open model, a few thousand generated training examples, a probe, and calibration. The article frames the approach as a practical way to extract decision-making signals from LLMs without asking them to verbalize those decisions.