Production checklist: - Start with `voyage-4-lite` for queries in production - Monitor latency, cost, and nDCG in production (log a sample of queries and relevance) - Set up feature flag to switch query model to `voyage-4` or `voyage-4-large` instantly (no re-indexing required) - Use the open-weight `voyage-4-nano` for local development and CI - Store the document embedding model name as index metadata - Add fallback to hybrid search (BM25 + vector) for high-stakes queries - Monitor vector datab

Voyage AI's Voyage 4: MoE + Shared Embeddings Guide

Q: 3. Scaffold the project?

Create this structure: ``` voyage4-rag/ ├── src/ │ ├── embedder.py │ ├── indexer.py │ ├── retriever.py │ ├── evaluator.py │ └── config.py ├── api/ │ └── main.py ├── eval/ │ └── test_queries.jsonl ├── requirements.txt ├── .env.example └── README.md ``` **Starter template for `config.py` (copy-paste ready)** ```python from pydantic_settings import BaseSettings from typing import Literal class Settings(BaseSettings): VOYAGE_API_KEY: str PINECONE_API_KEY: str INDEX_NAME: st

Q: 4. Implement the embedder with proper error handling and model switching?

**Key file: `src/embedder.py`** ```python from voyageai import Client from typing import List, Union import numpy as np class VoyageEmbedder: def __init__(self, api_key: str): self.client = Client(api_key=api_key) def embed_documents(self, texts: List[str], model: str = "voyage-4-large", input_type: str = "document") -> List[List[float]]: """Embed documents with the strongest model""" response = self.client.embed( texts=tex

Q: 5. Index and retrieve with validation?

Create `indexer.py` that chunks documents, embeds with the document model, stores metadata (including which model was used), and saves the model name in the index metadata so you can track it. In `retriever.py`, always read the `QUERY_MODEL` from config and use the embed_query method. This gives you the asymmetric benefit.

Q: 6. Validate rigorously?

Write an evaluator that: - Uses a held-out set of query → relevant document pairs - Computes recall@10, nDCG@10, MRR - Runs the same evaluation with different query models (`voyage-4-nano`, `voyage-4-lite`, `voyage-4`, `voyage-4-large`) - Compares quality vs cost/latency Run this before and after switching query models. You should see that document quality from `voyage-4-large` carries over even when queries come from much smaller models. Test quantization impact: compare 1024-dim int8 vs fp32 a

Building Production RAG Systems with Voyage 4’s Shared Embedding Space

Why this matters for builders

Voyage AI just released the Voyage 4 family — the first embedding models with a truly shared embedding space across four different sizes and capabilities. This changes the economics and flexibility of retrieval-augmented generation (RAG) and semantic search dramatically.

You can now index your entire document corpus once with the highest-accuracy model (voyage-4-large, a Mixture-of-Experts model that beats voyage-3-large while costing 40% less to serve) and then query with any lighter model in the family without re-embedding documents. This is called asymmetric retrieval.

The family includes:

voyage-4-large — MoE flagship, new SOTA, 40% lower serving cost than comparable dense models
voyage-4 — approaches voyage-3-large quality at mid-size efficiency
voyage-4-lite — approaches voyage-3.5 quality with far fewer parameters
voyage-4-nano — open-weight (Apache 2.0 on Hugging Face), perfect for local dev and prototyping

All four models produce compatible embeddings in the same vector space. They also support Matryoshka Representation Learning (MRL) so you can choose 256, 512, 1024, or 2048 dimensions, plus multiple quantization levels (fp32, int8, uint8, binary, ubinary) with minimal quality loss.

For builders shipping high-volume RAG agents, context-engineered memory systems, or semantic search at scale, this combination of accuracy, cost control, and deployment flexibility is a major unlock.

When to use Voyage 4

Use the Voyage 4 family when you have any of these conditions:

High query volume where per-query embedding cost and latency matter
You want maximum retrieval quality on documents but need cheap/fast queries
You are iterating on accuracy vs cost and don’t want to re-embed the corpus every time
You need to run embeddings locally during development then move to production with the same vector space
You are building domain-specific retrieval (medical, code, finance, legal, technical docs, long-context, conversations)

The full process — from idea to shipped RAG system

Here’s a reliable, AI-assisted workflow that experienced vibe coders can follow to ship real systems using Voyage 4.

1. Define the goal and success metrics

Start by writing a one-paragraph spec. Example:

“Build a technical documentation semantic search system for our 12,000-page internal knowledge base. Documents should be indexed with maximum accuracy using voyage-4-large at 1024 dimensions. Queries should default to voyage-4-lite for cost/latency, with the ability to upgrade to voyage-4 or voyage-4-large via a feature flag. Target nDCG@10 > 0.78 on our internal eval set. Support Matryoshka dimension switching and int8 quantization. Use Pinecone or Qdrant as the vector store. Include hybrid search fallback.”

Capture:

Expected query volume
Acceptable latency (p95)
Budget per 1k queries
Quality target (use RTEB-style metrics or your own eval set)
Domain (this affects which Voyage 4 asymmetric eval datasets are most relevant)

2. Shape the spec and write strong prompts for your coding assistant

Give your AI coding tool (Cursor, Claude, Windsurf, etc.) clear context:

We are building a RAG system using Voyage AI's new Voyage 4 embedding family.

Key property: all models share the same embedding space. This allows asymmetric retrieval:
- Index documents once with voyage-4-large (highest accuracy, MoE architecture)
- Query with voyage-4-lite or voyage-4-nano during development and early production
- Upgrade query model later without re-indexing

Available models:
- voyage-4-large (MoE, SOTA, 40% cheaper serving than dense equivalents)
- voyage-4
- voyage-4-lite
- voyage-4-nano (open weights, Apache 2.0)

All support 2048/1024/512/256 dimensions via Matryoshka learning and fp32/int8/uint8/binary quantization.

Task: Generate a complete Python service using LangChain + Pinecone that:
1. Embeds documents with voyage-4-large at 1024 dim, int8
2. Embeds queries with voyage-4-lite at 1024 dim, int8
3. Supports easy switching of query model via environment variable
4. Includes evaluation script using a small held-out query-document set
5. Has a FastAPI endpoint for search

3. Scaffold the project

Create this structure:

voyage4-rag/
├── src/
│   ├── embedder.py
│   ├── indexer.py
│   ├── retriever.py
│   ├── evaluator.py
│   └── config.py
├── api/
│   └── main.py
├── eval/
│   └── test_queries.jsonl
├── requirements.txt
├── .env.example
└── README.md

Starter template for config.py (copy-paste ready)

from pydantic_settings import BaseSettings
from typing import Literal

class Settings(BaseSettings):
    VOYAGE_API_KEY: str
    PINECONE_API_KEY: str
    INDEX_NAME: str = "voyage4-docs"
    
    # Asymmetric retrieval configuration
    DOCUMENT_MODEL: str = "voyage-4-large"
    QUERY_MODEL: str = "voyage-4-lite"   # change to voyage-4 or voyage-4-large when needed
    
    EMBEDDING_DIM: int = 1024
    QUANTIZATION: Literal["fp32", "int8", "binary"] = "int8"
    
    class Config:
        env_file = ".env"

settings = Settings()

4. Implement the embedder with proper error handling and model switching

Key file: src/embedder.py

from voyageai import Client
from typing import List, Union
import numpy as np

class VoyageEmbedder:
    def __init__(self, api_key: str):
        self.client = Client(api_key=api_key)
    
    def embed_documents(self, texts: List[str], model: str = "voyage-4-large", 
                       input_type: str = "document") -> List[List[float]]:
        """Embed documents with the strongest model"""
        response = self.client.embed(
            texts=texts,
            model=model,
            input_type=input_type,
            truncation=True
        )
        return response.embeddings
    
    def embed_query(self, text: str, model: str = "voyage-4-lite", 
                   input_type: str = "query") -> List[float]:
        """Embed queries with lighter model for latency/cost"""
        response = self.client.embed(
            texts=[text],
            model=model,
            input_type=input_type,
            truncation=True
        )
        return response.embeddings[0]
    
    def embed_batch(self, texts: List[str], model: str, input_type: str) -> np.ndarray:
        embeddings = self.embed_documents(texts, model, input_type)
        return np.array(embeddings).astype(np.float32)

5. Index and retrieve with validation

Create indexer.py that chunks documents, embeds with the document model, stores metadata (including which model was used), and saves the model name in the index metadata so you can track it.

In retriever.py, always read the QUERY_MODEL from config and use the embed_query method. This gives you the asymmetric benefit.

6. Validate rigorously

Write an evaluator that:

Uses a held-out set of query → relevant document pairs
Computes recall@10, nDCG@10, MRR
Runs the same evaluation with different query models (voyage-4-nano, voyage-4-lite, voyage-4, voyage-4-large)
Compares quality vs cost/latency

Run this before and after switching query models. You should see that document quality from voyage-4-large carries over even when queries come from much smaller models.

Test quantization impact: compare 1024-dim int8 vs fp32 and 512-dim int8. The announcement states quality loss is minimal.

7. Ship safely

Production checklist:

Start with voyage-4-lite for queries in production
Monitor latency, cost, and nDCG in production (log a sample of queries and relevance)
Set up feature flag to switch query model to voyage-4 or voyage-4-large instantly (no re-indexing required)
Use the open-weight voyage-4-nano for local development and CI
Store the document embedding model name as index metadata
Add fallback to hybrid search (BM25 + vector) for high-stakes queries
Monitor vector database storage cost after applying quantization and Matryoshka dimension reduction

Pitfalls and guardrails

Do not use different Voyage 3 and Voyage 4 models together — they are not in the same embedding space.
Always specify input_type="document" for docs and input_type="query" for queries. This matters even within the Voyage 4 family.
Test asymmetric performance on your domain. Voyage evaluated asymmetric retrieval on medical, code, finance, legal, technical docs, etc.
Be careful with very short queries — smaller models may lose more quality here.
When using voyage-4-nano locally, make sure your hardware can handle it before assuming production parity.
Quantization to binary or ubinary saves huge vector DB cost but test quality carefully on your data.

What to do next

After shipping the first version:

Measure real production quality and cost
Run A/B test between voyage-4-lite and voyage-4 as query models
Experiment with 512-dim int8 embeddings to further reduce storage cost
Add automatic query model upgrading based on query difficulty detection (future work)
Explore the MoE scaling paper Voyage published for deeper understanding of voyage-4-large

The shared embedding space is the real game changer. It finally lets builders optimize the two sides of retrieval independently — accuracy on the (mostly static) document side, and speed/cost on the (high-volume) query side — without painful re-embedding cycles.

Sources

Original announcement: https://blog.voyageai.com/2026/01/15/voyage-4/
MoE technical deep-dive: https://blog.voyageai.com/2026/03/03/moe-voyage-4-large/
MongoDB Voyage 4 integration announcement
Voyage AI Hugging Face repository for voyage-4-nano

(Word count: 1,247)

The Voyage 4 model family: shared embedding space with MoE architecture — vibe-coding-guide