Fragile Correctness: Cases of reasoning harming performance

wpnews.pro

Sometimes a reasoning model appears to pass through the correct answer before ending up wrong

Figure 1: From the Opus 4.8 system card (page 196) Figure 1 shows that Opus 4.8 on max thinking has a lower pass rate on SWE-Bench Pro than Opus 4.8 on x-high thinking. There are further examples of this in the Fable and Mythos system card in the appendix (Figures A1 and A2). This counter-intuitive result means that using more tokens has reduced accuracy. Inference time scaling helps on average, but not always. This is explored in more detail (in a different way) by Ghosal et al.

In this post, we track a model's answer over its chain of thought and look for cases where the final answer switches from correct to incorrect, and how this behaviour could be prevented.

Looking at these cases in more detail could be useful for understanding sandbagging better. For example, there could be interesting insights gained by comparing the cases where more reasoning leads to worse performance to cases of sandbagging.

“Fragile correctness” is introduced as a hypothesised state where the model’s current answer distribution favors the correct answer, but continued reasoning can move it away from that answer.

I use the reasoning trajectory probe technique outlined in Ballon et al.

Figure 2: From Ballon et al* outlining the process of probing reasoning traces* To probe the reasoning trajectory of a model, we follow the process shown in Figure 2.

The reasoning trajectory probing was run on 1198 questions from MMLU-pro (1000 questions split evenly across question categories) and GPQA-diamond (all 198 questions).

The types of answer outcomes were split into categories for analysis, Figure 3.

Figure 3: Reasoning trace answer categories. The normalised loss category is a subset of answer loss.

Dataset | Answered | Answer loss | Normalised loss | MMLU-Pro | 1000 | 122, 12.2% | 37, 3.7% | GPQA Diamond | 198 | 57, 28.8% | 15, 7.6% | Combined | 1198 | 179, 14.9% | 52, 4.3% |

Table 1: 14.9% answer loss across all questions

Table 1 shows 14.9% answer loss. This is not direct evidence for the model entering a fragile correctness state - it could just be due to the model guessing the answer and happening to land on the correct answer.

Figure 4, shows the results of rerunning the questions that displayed an answer loss - 46.8% of cases that originally showed an answer loss display an answer loss again. This is much higher than the final-correct and stable-wrong control cases. However, we could still argue that the answer loss cases are questions that the model struggles on (the questions are hard so the model is more likely to guess and land on the correct answer).

Figure 5 shows that 25.2%/24% of cases again show an answer loss with answer confidence of >0.9, suggesting that these cases are less likely to be random guessing.

Figure 4: On repeated loss cases, on both datasets 46.8% experienced a loss again. This is much higher than the other question categories.

Figure 5: On the same cases as Figure 1, 25.2%/24% of the original answer loss cases experienced a loss again where the correct answer occurred with probability > 0.9.

If we could detect when the model had entered this fragile correct state, we could terminate the reasoning trace to preserve the answer. We train four linear probes, Figure 6, on 1500 cases (75 questions per dataset (25 answer loss questions, 25 final correct questions, 25 stable wrong questions) across two datasets with 10 seeds for each question) over every 4th layer of Gemma 4-12b-it. For each attempt we used the nine pre-final checkpoints, 10% through 90%, giving 13,500 supervised examples.

For each example, we reconstruct the prompt plus reasoning prefix, run Gemma 4-12b-it and extract the final-token hidden state at layers 0, 4, 8, ..., 48. We train a linear classifier at each layer for three labels: Figure 6: The four types of probe trained. Future loss and Future change to wrong are subsets of Future answer flip, and Future loss is a subset of Future change to wrong

Figure 7 shows the answer flip detection probe was the most accurate. This makes sense as it isn’t trying to also detect whether the current answer is correct and the final answer will be wrong. All probes have peak accuracy higher than 0.5 (just guessing), but the accuracy of the correctness probe at this threshold is particularly low.

However, even though the answer flip detection probe is most accurate, it is least accurate for interventions. Figure 8 shows the future loss probe is achieves the highest accuracy increase. It detects when the current prediction is correct and the final prediction is wrong, so aligns closest with the intervention objective (terminating at a checkpoint where the answer is correct and the final answer is wrong).

Figure 7: Accuracy of probes by model layer

Figure 8: Accuracy of halting intervention by model layer

The maximum accuracy increase on only answer loss questions was 5.2% and the maximum accuracy increase on only normalised loss questions was 8.3%.

Previous work by Zhang et al has used correctness probes to truncate reasoning traces to decrease the number of tokens while maintaining accuracy (shown in Figure 9). However, this work doesn’t focus on whether the probe could be used for increasing accuracy in fragile correctness cases. We train a similar correctness probe and test its halting intervention accuracy. Figure 8 shows that the correctness probe intervention doesn’t quite beat the accuracy increase of the future loss probe.

Figure 9: From Zhang et al, using a correctness probe to reduce number of tokens while maintaining high accuracy

We are not certain that this pattern will generalise between models as experiments were only on Gemma 4-12b-it. There is some evidence that it could as the Ballon et al paper includes a figure showing that a range of models and sizes lose some amount of answers (although there is a smaller proportion of answer losses compared to my experiments).

Figure 10: Answer losses from Ballon et al (page 19) The accuracy of the current probes is disappointing. Part of the reason for this could be the relatively low number of examples of positive cases to train the probes on. For example, the future_loss probe had only 1085 out of 13,500 (8%) positive cases. Furthermore, for the majority of these cases, the model will have given the correct answer with a low confidence, adding noise to the signal. More clean cases would help to train a more accurate probe which would help to validate the fragile correctness hypothesis.

There is evidence to suggest that models sometimes enter a “fragile correctness” state where the model obtains the correct answer but further reasoning leads to it getting the answer wrong. The current probe intervention technique had limited success in preventing the loss of answers from this state. Further work will focus on improving the probe outcomes, primarily by finding more clean cases to train probes on, and extending the range of models tested. This will determine if the concept of “fragile correctness” is a helpful assessment of model behaviour and has utility in improving model accuracy.

Figure A1: FrontierCode Main accuracy decreases from xhigh to max on both Mythos 5 and Opus 4.8

Figure A2: DeepSearchQA accuracy decreases from medium to high on Mythos 5 and from high to max on Mythos Preview

source & further reading

lesswrong.com — original article The Reverse AI Box Announcing the Safe Pareto Improvements (SPI) Fundamentals Program Fable #6: The Return of the King

Fragile Correctness: Cases of reasoning harming performance

Run your AI side-project on zahid.host