Voyage AI's Voyage-Multimodal-3.5: Video Retrieval Revolution

voyage-multimodal-3.5: Voyage AI Adds Video Support to Leading Multimodal Embeddings

SAN FRANCISCO — Voyage AI on Wednesday released voyage-multimodal-3.5, a next-generation multimodal embedding model that extends its predecessor’s interleaved text-and-image capabilities with native video support, achieving higher retrieval accuracy than offerings from Cohere, Google and Amazon.

The model builds directly on voyage-multimodal-3, which launched more than a year ago as the industry’s first production-grade multimodal embedding system capable of handling documents containing both text and visuals. voyage-multimodal-3.5 maintains the unified transformer architecture that eliminated the “modality gap” common in CLIP-style models while adding explicit video-frame processing and Matryoshka representation learning for flexible embedding dimensions.

According to Voyage AI’s official blog post, the new model delivers 4.56% higher retrieval accuracy than Cohere Embed v4 across 15 visual document retrieval datasets and 4.65% higher than Google’s Multimodal Embedding 001 across three video retrieval benchmarks. It also matches state-of-the-art text-only models on pure-text retrieval tasks.

Unified Architecture Eliminates Modality Gap

voyage-multimodal-3.5 processes both visual and textual inputs through a single transformer encoder rather than routing them through separate towers, as CLIP-based models do. This design choice, first introduced in voyage-multimodal-3, ensures that similarity in the embedding space reflects semantic meaning rather than input modality.

The modality gap problem in earlier multimodal models often caused text queries to preferentially retrieve other text documents over highly relevant images or screenshots simply because text embeddings clustered together in vector space. By using a shared backbone, Voyage AI’s models embed text, document screenshots, PDFs, figures, tables and now video frames into a common semantic space.

The company positions this unified approach as a key differentiator against competing multimodal embeddings from Cohere, Amazon and Google, which still rely on separate vision and language encoders in many cases.

Native Video Support With Practical Guidelines

A major addition in voyage-multimodal-3.5 is explicit support for video retrieval. Videos are handled as ordered sequences of frames and fed to the model as images. The model accepts up to 32k tokens, with every 1,120 pixels of video content counting as one token.

Voyage AI provides detailed best practices for effective video embedding:

Split long videos into semantically coherent scenes rather than arbitrary chunks.
When transcripts are available from speech-to-text systems, align scene boundaries with natural breaks in spoken content.
For scenes that still exceed the context window, reduce image resolution or frames-per-second to stay within token limits.

The company also released a code snippet demonstrating these techniques using its official voyageai Python package and a sample notebook with more advanced examples.

Matryoshka Embeddings and Quantization Options

voyage-multimodal-3.5 introduces Matryoshka representation learning, allowing users to request embeddings at 2048, 1024, 512 or 256 dimensions without retraining or maintaining multiple models. This flexibility enables significant storage and compute savings while preserving retrieval quality at lower dimensions.

The model further supports multiple quantization formats — 32-bit floating point, signed and unsigned 8-bit integers, and binary — giving developers additional control over the trade-off between precision, storage requirements and latency.

Comprehensive Benchmark Results

Voyage AI evaluated the model across 18 multimodal datasets covering two primary tasks: visual document retrieval and video retrieval. The visual document retrieval benchmarks included ViDoRe, ViDoRe v2 and MIRACL-VISION. Video retrieval testing used MSR-VTT, YouCook2 and DiDeMo.

Additional evaluations on 38 standard text retrieval datasets across law, finance, conversation, code, web and technology domains confirmed that the multimodal model does not sacrifice text-only performance.

Performance highlights (as reported by Voyage AI):

Visual document retrieval: Outperforms Google Multimodal Embedding 001 by 30.57%, Cohere Embed v4 by 2.26%, Amazon Nova 2 Multimodal by 8.38%, and the previous voyage-multimodal-3 by 3.03%.
Video retrieval: 4.65% higher accuracy than Google Multimodal Embedding 001.
Text retrieval: Matches leading specialized text embedding models.

These gains were measured with text queries against mixed-modality documents containing figures, photos, screenshots and video content.

Integration With MongoDB Atlas Vector Search

The release coincides with expanded availability through MongoDB. Voyage AI’s models, including the new multimodal offering, are being integrated into MongoDB Atlas Vector Search. Developers can sign up for Voyage AI directly or register interest in the Atlas Vector Search private preview to access voyage-multimodal-3.5.

MongoDB’s announcement emphasizes the model’s ability to enable semantic search over complex enterprise content including documents, images and video using natural language queries.

Impact on Developers and Enterprise Retrieval Systems

For developers building retrieval-augmented generation (RAG) applications, the new model simplifies pipelines that previously required maintaining separate embedding models for text, images and video. A single model can now handle mixed corpora containing PDFs with figures, presentation slides, screenshots and video recordings.

The Matryoshka and quantization features are particularly significant for production deployments. Organizations can start with full 2048-dimensional embeddings during development and reduce dimensionality in production to lower vector database storage costs and improve query latency without retraining retrieval systems.

Enterprise use cases expected to benefit include:

Internal knowledge bases containing technical documentation with diagrams and embedded videos
Customer support systems searching across video tutorials, PDFs and text articles
Legal and financial document review involving both text contracts and supporting visual materials
E-commerce product search combining description text, images and demonstration videos

The competitive improvements over Google, Cohere and Amazon embeddings may accelerate adoption among companies seeking best-in-class multimodal retrieval performance.

What’s Next

Voyage AI has not announced a specific timeline for additional multimodal capabilities or larger model variants. The company continues to iterate on its embedding family, having released both voyage-3.5 and voyage-3.5-lite text models alongside the multimodal update.

As video content grows exponentially across enterprise systems, the ability to perform accurate text-to-video semantic search without modality-specific workarounds represents a meaningful step toward unified multimodal retrieval infrastructure.

Developers can access voyage-multimodal-3.5 immediately through Voyage AI’s platform. Documentation, code samples and the evaluation methodology are available in the company’s blog post and accompanying technical notebook.

voyage-multimodal-3.5: a new multimodal retrieval frontier with video support — news

Unified Architecture Eliminates Modality Gap

Native Video Support With Practical Guidelines

Matryoshka Embeddings and Quantization Options

Comprehensive Benchmark Results

Integration With MongoDB Atlas Vector Search

Impact on Developers and Enterprise Retrieval Systems

What’s Next

Sources

Original Source

Related Topics

Comments