OCR Just Got Better: DPO Slashes Text Degeneration by 59%

In April, we introduced DharmaOCR, our cutting-edge structured Optical Character Recognition (OCR) model, which is now readily available on Hugging Face. Accompanying its release was a detailed paper outlining its innovative methodology and a benchmark that unequivocally demonstrated its superior quality and remarkable cost efficiency.

Our research rigorously benchmarked leading vision-language model families, encompassing both open-source and commercial solutions, specifically on structured document extraction from Brazilian Portuguese text. Among the critical metrics evaluated was the text degeneration rate — the unfortunate frequency at which a model falls into a repetitive loop instead of delivering an accurate transcription.

Unmasking the Problem: Text Degeneration

Our findings revealed a startling range in vanilla degeneration rates among open-source model families, spanning from under 1% to over 33%. While supervised fine-tuning (SFT) managed to reduce these rates for most models, it rarely brought them down to acceptable production levels. This persistent issue points to a fundamental limitation: SFT primarily optimizes for correct outputs but doesn’t explicitly penalize the insidious problem of degeneration.

It appears there’s a ceiling to how much task-focused fine-tuning alone can mitigate this particular failure mode. SFT trains models token by token, evaluating each prediction in isolation. Consequently, a repetition loop is never penalized as a complete, unacceptable output, leaving the underlying structural issue unaddressed.

Direct Preference Optimization to the Rescue

The good news is that a second training stage, applied after supervised fine-tuning and using the same documents and model, dramatically reduced text degeneration across every single family we tested. Without exception. This innovative approach resulted in an average reduction of 59.4%, with the most impressive cases seeing an 87.6% drop in degeneration.

This powerful second stage is called Direct Preference Optimization (DPO). While DPO is commonly used for chatbot alignment, teaching models to be more helpful or harmless based on human judgments, DharmaOCR applied it differently. We leveraged DPO not for subjective alignment, but as a direct mitigation tool for a very objective problem: text degeneration in OCR.

Unlike conversational AI, OCR has no subjective preferences or chat context. However, it offers a clear binary signal: a correct transcription is chosen, and a degeneration loop is rejected. DharmaOCR ingeniously used this objective distinction to construct a DPO training set, turning the model’s own failures into a powerful learning signal.

The Power of Rejection Pairs: Learning from Failures

So, how does DPO accomplish what SFT cannot? The answer lies in how it processes the training signal. DPO evaluates the full output — whether chosen or rejected. This means a complete degenerated completion can be explicitly labeled as the wrong outcome, rather than just a sequence of locally probable tokens.

When an autoregressive model falls into a degeneration loop, it’s essentially entering a high-probability “attractor region” in its distribution space. The model assigns an elevated probability to the same token at the next step, further increasing that probability and sustaining the loop. Inference-layer interventions, like repetition penalties, only contain the symptom; they don’t fix the underlying distribution. DPO, however, tackles this root cause by explicitly moving the model away from these failure geometries.

Our approach for DharmaOCR was unique: we used the SFT model’s own degenerate outputs as the rejected examples. Instead of filtering out these “low-quality” signals, we purposefully retained them as the negative training signal DPO needed. This seemingly counterintuitive choice proved incredibly effective.

DPO requires preference pairs: a desired output and an undesired output for the same input.
In structured generation tasks like OCR, there are no human preference ratings.
DharmaOCR identified the model’s own consistent failures — degeneration loops — as the most informative negative signal available.
These degenerate outputs were deliberately preserved and labeled as rejected examples, serving as a clear target for DPO to suppress.

Consistent Results Across Diverse Models

The DPO stage delivered consistent results across all five model families tested, regardless of their architecture, parameter scale, or initial degeneration profiles. Reductions ranged from 37% to 88%, averaging 59.4% relative to SFT alone. This remarkable consistency underscores the robustness and effectiveness of our DPO methodology.

One particular case, Qwen2.5-VL-3B, provided a fascinating confirmation of our mechanism. Its vanilla degeneration rate was initially a low 0.60%, not because it was stable, but because it wasn’t capable enough to produce long, structured outputs. After SFT, its capabilities increased, but so did its degeneration rate, rising to 3.23% as it entered the task’s failure geometry for the first time.

The subsequent DPO stage then brought its degeneration rate down to 1.41%. This example powerfully illustrates that SFT improves capability, but DPO specifically addresses and corrects the underlying distribution issues that lead to degeneration. The distinction between SFT adapting the model to the task and DPO reshaping its output distribution to avoid specific failure modes is crucial, and empirically validated by our findings.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.

Unmasking the Problem: Text Degeneration

Direct Preference Optimization to the Rescue

The Power of Rejection Pairs: Learning from Failures

Consistent Results Across Diverse Models

Kristine Vior

Related Posts