PaddleOCR-VL-1.5 Achieves 94.5% on OmniDocBench with 0.9B VLM
BEIJING — Baidu’s PaddlePaddle team has released PaddleOCR-VL-1.5, a 0.9-billion-parameter vision-language model that sets a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5, the company announced Thursday.
The upgraded multimodal model is designed for robust in-the-wild document parsing, delivering significant gains in table, formula, and text recognition while adding native support for automatic cross-page table merging and cross-page paragraph heading recognition. The release marks the latest advancement in Baidu’s open-source PaddleOCR-VL series, which targets complex, real-world document understanding tasks where traditional OCR systems often struggle with layout fragmentation and diverse visual formats.
According to the arXiv preprint (2601.21957) and the official Hugging Face repository, PaddleOCR-VL-1.5-0.9B builds on the previous version by adopting a progressive three-stage training paradigm using PaddleFormers, Baidu’s high-performance training toolkit built on the PaddlePaddle deep learning framework. The model retains effective post-adaptation strategies from its predecessor while substantially expanding data scale, task diversity, and overall robustness.
Technical Improvements and Capabilities
The 0.9B parameter model demonstrates clear gains over the prior SOTA model, PaddleOCR-VL, particularly in handling challenging document elements such as complex tables, mathematical formulas, and dense text layouts. Developers highlighted that the new version effectively mitigates content fragmentation issues common in long-document parsing through automatic cross-page table merging and paragraph heading recognition.
The model is available on Hugging Face under the PaddlePaddle organization. Recent updates to the repository include support for llama.cpp inference of the vision-language model component, improving accessibility for local deployment and edge-device use cases.
PaddleOCR-VL-1.5 employs a progressive training approach across three distinct stages, as detailed in the technical report. This methodology allows the model to first build foundational visual-language alignment before specializing in document-specific tasks, resulting in stronger generalization across varied document types and quality levels encountered in real-world scenarios.
Competitive Context
The release positions Baidu’s open-source efforts as a strong contender in the lightweight vision-language model space for document intelligence. At under 1 billion parameters, PaddleOCR-VL-1.5 offers a favorable balance between performance and computational efficiency compared to significantly larger proprietary or open models targeting similar document parsing tasks.
The achievement of 94.5% accuracy on OmniDocBench v1.5 represents a meaningful improvement in standardized document understanding benchmarks, where even small gains can translate to substantial reductions in error rates for enterprise document processing pipelines.
Impact for Developers and Industry
For developers and enterprises working with document automation, the model’s open release provides an immediately accessible solution for high-accuracy parsing of invoices, research papers, financial reports, and other complex documents. The addition of cross-page understanding capabilities addresses a common pain point in processing multi-page PDFs and scanned materials.
The model’s relatively small 0.9B parameter count makes it suitable for deployment in resource-constrained environments while still delivering state-of-the-art results on document benchmarks. Support for llama.cpp inference further lowers the barrier to local and privacy-sensitive deployments.
What's Next
The PaddlePaddle team has not yet disclosed a specific timeline for additional updates or larger model variants. The current release focuses on the 0.9B model, with the technical report emphasizing continued improvements in training methodology and data diversity.
Developers can access the model weights, code, and documentation through the official PaddlePaddle/PaddleOCR-VL-1.5 repository on Hugging Face and the accompanying arXiv paper. The release follows Baidu’s ongoing commitment to open-source contributions in the OCR and document AI space through the broader PaddlePaddle ecosystem.
This update arrives as demand grows for efficient, accurate document parsing solutions that can handle the messy reality of real-world documents rather than clean, synthetic data.

