Building Production RAG Systems with Voyage 4’s Shared Embedding Space
Why this matters for builders
Voyage AI just released the Voyage 4 family — the first embedding models with a truly shared embedding space across four different sizes and capabilities. This changes the economics and flexibility of retrieval-augmented generation (RAG) and semantic search dramatically.
You can now index your entire document corpus once with the highest-accuracy model (voyage-4-large, a Mixture-of-Experts model that beats voyage-3-large while costing 40% less to serve) and then query with any lighter model in the family without re-embedding documents. This is called asymmetric retrieval.
The family includes:
voyage-4-large— MoE flagship, new SOTA, 40% lower serving cost than comparable dense modelsvoyage-4— approachesvoyage-3-largequality at mid-size efficiencyvoyage-4-lite— approachesvoyage-3.5quality with far fewer parametersvoyage-4-nano— open-weight (Apache 2.0 on Hugging Face), perfect for local dev and prototyping
All four models produce compatible embeddings in the same vector space. They also support Matryoshka Representation Learning (MRL) so you can choose 256, 512, 1024, or 2048 dimensions, plus multiple quantization levels (fp32, int8, uint8, binary, ubinary) with minimal quality loss.
For builders shipping high-volume RAG agents, context-engineered memory systems, or semantic search at scale, this combination of accuracy, cost control, and deployment flexibility is a major unlock.
When to use Voyage 4
Use the Voyage 4 family when you have any of these conditions:
- High query volume where per-query embedding cost and latency matter
- You want maximum retrieval quality on documents but need cheap/fast queries
- You are iterating on accuracy vs cost and don’t want to re-embed the corpus every time
- You need to run embeddings locally during development then move to production with the same vector space
- You are building domain-specific retrieval (medical, code, finance, legal, technical docs, long-context, conversations)
The full process — from idea to shipped RAG system
Here’s a reliable, AI-assisted workflow that experienced vibe coders can follow to ship real systems using Voyage 4.
1. Define the goal and success metrics
Start by writing a one-paragraph spec. Example:
“Build a technical documentation semantic search system for our 12,000-page internal knowledge base. Documents should be indexed with maximum accuracy using
voyage-4-largeat 1024 dimensions. Queries should default tovoyage-4-litefor cost/latency, with the ability to upgrade tovoyage-4orvoyage-4-largevia a feature flag. Target nDCG@10 > 0.78 on our internal eval set. Support Matryoshka dimension switching and int8 quantization. Use Pinecone or Qdrant as the vector store. Include hybrid search fallback.”
Capture:
- Expected query volume
- Acceptable latency (p95)
- Budget per 1k queries
- Quality target (use RTEB-style metrics or your own eval set)
- Domain (this affects which Voyage 4 asymmetric eval datasets are most relevant)
2. Shape the spec and write strong prompts for your coding assistant
Give your AI coding tool (Cursor, Claude, Windsurf, etc.) clear context:
We are building a RAG system using Voyage AI's new Voyage 4 embedding family.
Key property: all models share the same embedding space. This allows asymmetric retrieval:
- Index documents once with voyage-4-large (highest accuracy, MoE architecture)
- Query with voyage-4-lite or voyage-4-nano during development and early production
- Upgrade query model later without re-indexing
Available models:
- voyage-4-large (MoE, SOTA, 40% cheaper serving than dense equivalents)
- voyage-4
- voyage-4-lite
- voyage-4-nano (open weights, Apache 2.0)
All support 2048/1024/512/256 dimensions via Matryoshka learning and fp32/int8/uint8/binary quantization.
Task: Generate a complete Python service using LangChain + Pinecone that:
1. Embeds documents with voyage-4-large at 1024 dim, int8
2. Embeds queries with voyage-4-lite at 1024 dim, int8
3. Supports easy switching of query model via environment variable
4. Includes evaluation script using a small held-out query-document set
5. Has a FastAPI endpoint for search
3. Scaffold the project
Create this structure:
voyage4-rag/
├── src/
│ ├── embedder.py
│ ├── indexer.py
│ ├── retriever.py
│ ├── evaluator.py
│ └── config.py
├── api/
│ └── main.py
├── eval/
│ └── test_queries.jsonl
├── requirements.txt
├── .env.example
└── README.md
Starter template for config.py (copy-paste ready)
from pydantic_settings import BaseSettings
from typing import Literal
class Settings(BaseSettings):
VOYAGE_API_KEY: str
PINECONE_API_KEY: str
INDEX_NAME: str = "voyage4-docs"
# Asymmetric retrieval configuration
DOCUMENT_MODEL: str = "voyage-4-large"
QUERY_MODEL: str = "voyage-4-lite" # change to voyage-4 or voyage-4-large when needed
EMBEDDING_DIM: int = 1024
QUANTIZATION: Literal["fp32", "int8", "binary"] = "int8"
class Config:
env_file = ".env"
settings = Settings()
4. Implement the embedder with proper error handling and model switching
Key file: src/embedder.py
from voyageai import Client
from typing import List, Union
import numpy as np
class VoyageEmbedder:
def __init__(self, api_key: str):
self.client = Client(api_key=api_key)
def embed_documents(self, texts: List[str], model: str = "voyage-4-large",
input_type: str = "document") -> List[List[float]]:
"""Embed documents with the strongest model"""
response = self.client.embed(
texts=texts,
model=model,
input_type=input_type,
truncation=True
)
return response.embeddings
def embed_query(self, text: str, model: str = "voyage-4-lite",
input_type: str = "query") -> List[float]:
"""Embed queries with lighter model for latency/cost"""
response = self.client.embed(
texts=[text],
model=model,
input_type=input_type,
truncation=True
)
return response.embeddings[0]
def embed_batch(self, texts: List[str], model: str, input_type: str) -> np.ndarray:
embeddings = self.embed_documents(texts, model, input_type)
return np.array(embeddings).astype(np.float32)
5. Index and retrieve with validation
Create indexer.py that chunks documents, embeds with the document model, stores metadata (including which model was used), and saves the model name in the index metadata so you can track it.
In retriever.py, always read the QUERY_MODEL from config and use the embed_query method. This gives you the asymmetric benefit.
6. Validate rigorously
Write an evaluator that:
- Uses a held-out set of query → relevant document pairs
- Computes recall@10, nDCG@10, MRR
- Runs the same evaluation with different query models (
voyage-4-nano,voyage-4-lite,voyage-4,voyage-4-large) - Compares quality vs cost/latency
Run this before and after switching query models. You should see that document quality from voyage-4-large carries over even when queries come from much smaller models.
Test quantization impact: compare 1024-dim int8 vs fp32 and 512-dim int8. The announcement states quality loss is minimal.
7. Ship safely
Production checklist:
- Start with
voyage-4-litefor queries in production - Monitor latency, cost, and nDCG in production (log a sample of queries and relevance)
- Set up feature flag to switch query model to
voyage-4orvoyage-4-largeinstantly (no re-indexing required) - Use the open-weight
voyage-4-nanofor local development and CI - Store the document embedding model name as index metadata
- Add fallback to hybrid search (BM25 + vector) for high-stakes queries
- Monitor vector database storage cost after applying quantization and Matryoshka dimension reduction
Pitfalls and guardrails
- Do not use different Voyage 3 and Voyage 4 models together — they are not in the same embedding space.
- Always specify
input_type="document"for docs andinput_type="query"for queries. This matters even within the Voyage 4 family. - Test asymmetric performance on your domain. Voyage evaluated asymmetric retrieval on medical, code, finance, legal, technical docs, etc.
- Be careful with very short queries — smaller models may lose more quality here.
- When using
voyage-4-nanolocally, make sure your hardware can handle it before assuming production parity. - Quantization to binary or ubinary saves huge vector DB cost but test quality carefully on your data.
What to do next
After shipping the first version:
- Measure real production quality and cost
- Run A/B test between
voyage-4-liteandvoyage-4as query models - Experiment with 512-dim int8 embeddings to further reduce storage cost
- Add automatic query model upgrading based on query difficulty detection (future work)
- Explore the MoE scaling paper Voyage published for deeper understanding of
voyage-4-large
The shared embedding space is the real game changer. It finally lets builders optimize the two sides of retrieval independently — accuracy on the (mostly static) document side, and speed/cost on the (high-volume) query side — without painful re-embedding cycles.
Sources
- Original announcement: https://blog.voyageai.com/2026/01/15/voyage-4/
- MoE technical deep-dive: https://blog.voyageai.com/2026/03/03/moe-voyage-4-large/
- MongoDB Voyage 4 integration announcement
- Voyage AI Hugging Face repository for
voyage-4-nano
(Word count: 1,247)

