ERNIE 5.0: A 2.4 Trillion-Parameter Unified Multimodal Foundation Model — news

ERNIE 5.0: Baidu Unveils 2.4 Trillion-Parameter Unified Multimodal Model

BEIJING — Baidu on Thursday introduced ERNIE 5.0, a 2.4 trillion-parameter unified multimodal foundation model trained from scratch that jointly processes text, images, video, and audio within a single autoregressive framework. The release, announced at Baidu World 2025, marks the Chinese tech giant’s latest push to advance native multimodal AI capabilities as its ERNIE-powered AI assistant reached 200 million monthly active users.

The model overcomes limitations of traditional late-fusion architectures by integrating multiple modalities into one unified autoregressive system. According to Baidu’s official announcement, this approach enables seamless cross-modal understanding and generation, allowing more cohesive optimization across diverse data types compared to models that process modalities separately before combining results.

Technical Architecture and Training Challenges

ERNIE 5.0 was developed to address specific bottlenecks in scaling reinforcement learning to trillion-parameter multimodal systems. A technical report on arXiv details the ERNIE 5.0 RL Infrastructure, a disaggregated system designed to handle computational consistency, data distribution bias, and heterogeneous resource utilization through large-scale asynchronous training.

The model jointly models text, images, audio, and videos for comprehensive multimodal understanding and generation. Baidu positioned ERNIE 5.0 as a foundational advancement in unified modeling that supports integrated data optimization across all four modalities.

User Growth and Strategic Context

The launch coincides with significant adoption of Baidu’s AI assistant, which has grown to 200 million monthly active users. This user milestone underscores the company’s progress in deploying large-scale AI models in consumer and enterprise applications within China’s competitive AI landscape.

Baidu has been iteratively developing its ERNIE series, with ERNIE 5.0 representing a substantial leap in both scale and architectural unification. The company is simultaneously ramping up its global push with a series of AI applications built on the new foundation model.

Impact on Developers and Industry

For developers, ERNIE 5.0 offers a unified multimodal framework that could simplify building applications requiring cross-modal reasoning and generation. The single autoregressive architecture potentially reduces the complexity of managing separate models for different data types and their interactions.

The 2.4 trillion-parameter scale places ERNIE 5.0 among the largest publicly disclosed multimodal models, highlighting Baidu’s continued heavy investment in foundational AI research despite growing international competition from U.S. and other Chinese labs.

What's Next

Baidu has not yet disclosed specific availability details, pricing, or benchmark comparisons for ERNIE 5.0. The company is expected to provide additional technical documentation and application showcases in the coming weeks following the Baidu World 2025 event.

Industry observers will watch whether Baidu open-sources components of the model or makes it available through its cloud platform, as well as how the unified architecture performs against specialized late-fusion competitors on standard multimodal benchmarks. Further details on the training infrastructure and specific capabilities are likely to emerge as Baidu publishes more comprehensive reports.

The release reinforces China’s position in the global race toward larger and more integrated foundation models, with Baidu betting that native multimodal unification will deliver superior performance across understanding and generation tasks.

ERNIE 5.0: A 2.4 Trillion-Parameter Unified Multimodal Foundation Model — news

Original Source

Comments