# Direct Preference Optimization Beyond Chatbots

> Source: <https://huggingface.co/blog/Dharma-AI/direct-preference-optimization-beyond-chatbots>
> Published: 2026-06-03 12:55:11+00:00

Image-Text-to-Text • 4B • Updated • 2.14k • 12

# Direct Preference Optimization Beyond Chatbots

[Team Article](/blog)Published June 3, 2026

Using Rejection Pairs From Your Model's Own Failures

In April, we released DharmaOCR, our specialized structured OCR model ([available on Hugging Face](https://huggingface.co/Dharma-AI/Dharma-OCR-LITE)) along with a [paper](https://arxiv.org/abs/2604.14314) detailing the methodology behind it and a benchmark demonstrating its superior quality and cost efficiency.
The paper benchmarked leading vision-language model families - both open-source and commercial - on a structured document extraction task: OCR on Brazilian Portuguese text. Among the reported metrics was text degeneration rate: the frequency with which a model produces a repetition loop instead of a transcription.

Across the tested open-source families, vanilla degeneration rates ranged from below 1% to above 33%. Supervised fine-tuning reduced those rates for most models - but rarely to production-acceptable levels. The pattern points to a structural limitation: SFT optimizes for correct outputs, but does not explicitly penalize degeneration. There appears to be a ceiling on how much task-focused fine-tuning alone can reduce this failure mode ([Text Degeneration Article](https://huggingface.co/blog/Dharma-AI/text-degeneration-a-production-failure-mode)).

A second training stage - applied after supervised fine-tuning (SFT), on the same documents, using the same model - reduced text degeneration in every family tested. No exceptions. Average reduction: 59.4%. Best case: 87.6%.

[
](https://cdn-uploads.huggingface.co/production/uploads/69d815b52c6db28cfdfdd422/o-TBg6d-3_PbbSouY5tGM.png)`Figure 1: DPO reduced degeneration relative to SFT in every family tested - average reduction of 59.4%, peak of 87.6% (Nanonets-OCR2–3B: 1.61% to 0.20%). The direction is invariant; only the magnitude varies.`

That second stage was Direct Preference Optimization (DPO). Almost all published DPO applications target chat alignment - models trained on human judgments about helpfulness or harmlessness (example: Rafailov et al., 2023). OCR carries none of that subjectivity: the task is objective, and there is no conversational context. There is, however, a clear preference signal. A correct transcription is chosen; a degeneration loop is rejected. DharmaOCR used that binary to construct a DPO training set, testing the technique not for alignment, but as a direct mitigation tool for a specific failure mode.

The training signal came from the model itself - specifically from the outputs it produced when it failed. How a failure mode becomes a training signal is a structural question about the failure, not the model.

### The Loop Survives Fine-Tuning

Why SFT has a ceiling on degeneration is still an open question - but the leading conjecture points to loss granularity. SFT trains token by token: each prediction is evaluated in isolation, and a repetition loop is never penalized as a completion-level failure. DPO inverts that logic. The training signal is the full output - chosen or rejected - which means a degenerated completion can be explicitly labeled as the wrong outcome, not just a sequence of locally probable tokens.

When a training objective maximizes the likelihood of observed sequences, it concentrates probability mass in the regions of distribution space those sequences occupy. A model that enters one of those high-probability attractor regions during inference assigns elevated probability to the same token at the next step - which increases the probability further, which sustains the loop until the sequence hits the maximum token limit. Text degeneration is the output of this geometry: a self-reinforcing repetition loop that an autoregressive model cannot exit without external intervention (Holtzman et al., 2020). It is not purely a decoding artifact. The attractor involves the training objective, the learned distribution, and how probability mass concentrates during inference - a systems-level failure rather than a failure localized to any single component.

The geometry of this failure is visible at the token level.

[
](https://cdn-uploads.huggingface.co/production/uploads/69d815b52c6db28cfdfdd422/kJIgJGSsa8HJqLNkm4Ho8.png)`Figure 2: When a token dominates its own conditional distribution, every sampling step deepens the attractor. The decoder samples from this geometry; it does not determine it.`

Inference-layer interventions - repetition penalties, temperature adjustments, early-abort logic - operate on the sampling step. They contain the symptom without touching the distribution that produces it. The attractor persists.

Supervised fine-tuning moves the distribution closer to the task domain. For a structured generation pipeline, this means training on domain-specific documents, in the target language, with the required output format. The model gains fluency with longer sequences, constrained syntax, domain vocabulary. What SFT does not do is attack degeneration directly. Its objective - maximizing the likelihood of observed sequences - has no term that penalizes repetition loops. The failure mode is simply outside the scope of what the training signal optimizes for.

One model family in the DharmaOCR benchmark showed an unexpected pattern: vanilla degeneration rate of 0.60%, rising to 3.23% after SFT, before a subsequent DPO stage brought it to 1.41%. It is a single data point - an exception, not a rule - and it would be overstating the evidence to treat it as proof of a mechanism. What it does illustrate is that SFT does not reliably reduce degeneration. Capability and degeneration resistance can move independently.

The distinction matters structurally. SFT and DPO are not interchangeable training stages performing the same operation at different intensities. SFT closes the distance between the model's prior distribution and the task domain. What it does not do is target degeneration as an objective - its effect on the failure mode is incidental, and the benchmark results show it is not consistent. The attractor that produces degeneration is not a problem with the model's proximity to the task - it is a problem with the shape of the distribution space the model now occupies.

Addressing that geometry requires a training signal built specifically to point the model away from its own failure modes. For a structured, non-conversational task with no human preference labels and no conventional "helpful versus harmful" distinction, constructing that signal is a design decision.

### The Design Decision: Degenerate Outputs as Rejection Pairs

The DharmaOCR pipeline's contribution to DPO methodology is specific: it used the SFT model's own degenerate outputs as the rejected examples - not as noise to remove, but as the negative training signal the optimization needed.

DPO requires preference pairs: a chosen output and a rejected output for the same input, with a quality difference clear enough for the optimization to learn from. In chat alignment, human annotators produce those judgments - rating responses as more or less helpful, accurate, or safe. Structured generation tasks have no equivalent annotation source. An OCR pipeline either produces a correct transcription or it does not. Quality differences exist, but they are not produced by human preference rankings - they are produced by the task's own criteria for correctness.

The DharmaOCR pipeline identified a preference signal that structured generation tasks already produce: the range of outputs the SFT model generates in inference. A model capable of performing a structured task is also capable of failing at it in characteristic ways. Those failures - outputs that enter the degeneration attractor - are not noise to filter. They are the most informative negative signal available.

The paper implemented this on 23,726 training documents, generating multiple candidate responses per document with the SFT model and scoring each with an automated LLM judge. The pipeline is shown below.

[
](https://cdn-uploads.huggingface.co/production/uploads/69d815b52c6db28cfdfdd422/o6z7skgMtLq22aFGkeYvw.png)`Figure 3: The critical design decision is not in the pipeline's structure - it is in what the pipeline preserved: outputs displaying text degeneration were deliberately labeled as rejected examples, not filtered out as low-quality noise.`

The conventional response when degenerate outputs appear in training data is to remove them. They are low-quality signal; filtering produces a cleaner dataset. The DharmaOCR approach inverted this logic. Degenerate outputs were deliberately retained as the rejected examples in each (chosen, rejected) pair, because they represent exactly the failure mode the DPO stage was designed to suppress. Removing them would have discarded the clearest target available.

The paper describes this as "preference-guided implicit unlikelihood" - the model is trained not only toward better outputs but away from a specific class of failure. Where SFT maximizes the likelihood of high-quality outputs, the DPO stage simultaneously penalizes outputs displaying the degeneration attractor geometry. The direction of the optimization is explicit in a way SFT alone cannot achieve.

Degenerate outputs are particularly well-suited as rejection examples because they represent a consistent failure mode rather than varied low-quality outputs. A transcription that misses words is low quality, but its failure is case-specific. Repetition loops, by contrast, appeared persistently across documents and model families even after SFT - a pattern consistent with a failure mode that likelihood-based optimization does not reliably correct. DPO applies its loss differently: at the completion level, with explicit rejection signals. The post-hoc analysis cannot establish causality, but the evidence suggests that what SFT's objective leaves unresolved, DPO's may address.

This approach requires no specialized annotation infrastructure - only a model capable of producing both acceptable and identifiable-failure outputs, and a scoring model to label preference pairs. A rule-based mechanism could detect repetition loops mechanically - but it could not identify which outputs represented high-quality transcriptions worth preserving as chosen examples.

The scoring model does both: it flags degeneration as the rejected output and validates clean extractions as the chosen one, keeping the model's extraction capability intact while the DPO signal penalizes the failure mode. Whether the resulting training signal successfully moves the distribution in the intended direction - and whether it does so consistently across architectures - is the evidence question.

### Consistent Across Five Model Families

The DPO stage reduced text degeneration in every model family tested - with reductions ranging from 37% to 88% and an average of 59.4% relative to SFT alone. The result held across architectures, parameter scales, and starting degeneration profiles that differed by more than one order of magnitude. One case in the dataset saw degeneration increase after the SFT stage before DPO corrected it. That case does not complicate the consistency. It confirms the mechanism more directly than any of the others.

Figure 1 shows the three-stage degeneration rate for each of the five model families tested: Vanilla, SFT, and SFT+DPO. In four of the five families, degeneration falls at each stage. The fifth family's bars move differently - and that difference is the most analytically important data point in the study.

The Qwen2.5-VL-3B result, read carefully, is not a complication. It is a confirmation. The model's vanilla degeneration rate was 0.60% - not because it was stable, but because it was too generic to produce long structured outputs at all. The model was not entering the degeneration attractor because it was not attempting the task seriously enough to find it.

SFT changed that. After domain adaptation, Qwen2.5-VL-3B became capable of the task - producing longer, more structured outputs with the domain vocabulary and format the pipeline required. That capability brought it into proximity with the degeneration attractor for the first time. Its degeneration rate rose to 3.23%.

This is the mechanism made empirically visible: SFT moved the model toward the task and toward the task's failure geometry simultaneously. These are not necessarily the same operation. A training stage that increases task capability can increase failure-mode exposure as a side effect - particularly when the failure mode lives at the edge of the capability frontier. Treated as the same operation, the Qwen2.5-VL-3B result looks like an error. Treated as distinct operations - which is what the SFT + DPO pipeline formally does - the result is consistent with the hypothesis that SFT and DPO address different failure dimensions.

The DPO stage then brought the degeneration rate to 1.41%. It did not restore the vanilla baseline because it was not designed to: the model after SFT was more capable than it had been, and a return to 0.60% would have required undoing that capability. What the DPO stage did was address the failure geometry the SFT stage had introduced.

The remaining four model families add quantitative weight to the same conclusion. Figure 1 shows the SFT-to-SFT+DPO comparison for all five.

[
](https://cdn-uploads.huggingface.co/production/uploads/69d815b52c6db28cfdfdd422/zhNY8YL4WieHlJ1JV0Rd-.png)`Figure 1: DPO reduced degeneration relative to SFT in every family tested - average reduction of 59.4%, peak of 87.6% (Nanonets-OCR2–3B: 1.61% to 0.20%). The direction is invariant; only the magnitude varies.`

No model family showed degeneration increasing after DPO. No family was immune to its effect. The consistency extends to gemma-3–4b-it, which entered the benchmark with the highest vanilla degeneration rate by an order of magnitude - 33.96%, compared to the next highest at 2.62% - and still reached a 75% reduction after the DPO stage. The reduction range - 37.3% to 87.6% - reflects differences in starting configuration and architecture, not inconsistency in the intervention's direction.

This is not a proof of universal applicability. DPO may not transfer to every domain, failure mode, or model family. What the DharmaOCR benchmark provides is evidence across five OCR architectures that the core hypothesis holds: optimizing over complete preference pairs - rather than maximizing token-level likelihood - addresses a failure mode that SFT structurally cannot target. The result was consistent in direction across every model family tested. That consistency, within the scope of this benchmark, is what the evidence supports.

### The Pattern Beyond OCR

The DharmaOCR approach was possible because this pipeline satisfied a set of structural conditions that allowed a DPO training stage to function as designed - conditions whose presence or absence determines whether the same methodology applies elsewhere ([Dharma OCR Paper on ArXiv](https://arxiv.org/abs/2604.14314)). It was not possible because OCR is a unique domain.

The first condition is that the failure mode be identifiable as a distinct class of output, not just a point on a quality continuum. Text degeneration qualifies because a repetition loop is categorically different from a transcription that misses words or misreads a character. The output is not merely suboptimal - it is broken in a specific, behaviorally recognizable way. That categorical distinctness is what allowed the pipeline to construct preference pairs where the rejected examples represented a coherent failure geometry, not noise. A task whose failure modes blend into its range of acceptable variation lacks this property.

The second condition is that a scoring mechanism can reliably distinguish acceptable outputs from failure-mode outputs without requiring human annotation. In the DharmaOCR pipeline, an automated LLM judge scored candidate responses against four task-specific criteria. The scoring did not need to be perfect - it needed to be consistent enough to produce preference pairs with a meaningful quality gap between chosen and rejected. Pairs with ambiguous quality differences contribute noise to DPO training, not signal. The judge's consistency was a design requirement, not an incidental feature.

The third condition is sufficient volume - enough inference outputs to generate a preference dataset with meaningful variance in quality. This is not an extraordinary requirement by fine-tuning standards, but it is a real one.

When all three conditions are present, the methodological move is structurally available. The design decision at the center of the DharmaOCR pipeline - treating the model's own failure outputs as the rejected examples rather than filtering them - applies wherever a model's failures are categorically identifiable, scoreable, and sufficiently numerous.

The practical implication for ML engineers building structured generation pipelines is direct. SFT is necessary - it closes the distance between a generalist model and a task-capable one. It is not sufficient for structured output reliability, because task capability and degeneration resistance are different properties of the distribution. A DPO stage after SFT is a one-time training investment. In the DharmaOCR results, the degeneration reduction did not come at the cost of extraction quality - the paper's benchmark results show both moving together ([Specialization Beats Scale article](https://huggingface.co/blog/Dharma-AI/specialization-beats-scale)).

What makes a failure mode usable as training signal is not the domain - it is whether the failures are consistent enough, identifiable enough, and numerous enough to constitute a legible signal. In the DharmaOCR pipeline, they were. Whether the same holds in another context is a structural question about the task's failure mode, not a question about the model family or the domain.

The DharmaOCR result does not depend on the domain being special. It depends on the failures being useful.

Text degeneration qualifies as useful because it is categorically distinct from acceptable outputs, consistently produced across inference runs, and reliably scoreable without human annotation. Those three properties - not the OCR context, not the model family, not the language - determined whether the preference dataset was tractable. A failure mode that satisfies them is not noise to remove. It is the most direct evidence available of where the distribution should not go.

The DPO stage used that evidence. Degeneration fell in every model family tested - in models that entered the benchmark with vanilla rates below 1% and in models that entered with rates above 33%. The direction held. The pipeline did not discard its failures. It trained on them.

### Sources

- Cardoso, Gabriel Pimenta de Freitas, et al. "DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines."
[arXiv preprint arXiv:2604.14314](https://arxiv.org/abs/2604.14314)(2026). - Dharma AI. "
[Text Degeneration: The Production Failure Mode That LLM Benchmarks Ignore](https://huggingface.co/blog/Dharma-AI/text-degeneration-a-production-failure-mode)." Medium (2026). - Dharma AI. "
[Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook](https://huggingface.co/blog/Dharma-AI/specialization-beats-scale)." Medium (2026). - Holtzman, Ari, et al. "The Curious Case of Neural Text Degeneration."
[arXiv preprint arXiv:1904.09751](https://arxiv.org/abs/1904.09751)(2020). - Rafailov, Rafael, et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model."
[arXiv preprint arXiv:2305.18290](https://arxiv.org/abs/2305.18290)(2023).