What Embedding Models Actually Do
An embedding model maps text to a dense vector in a high-dimensional space where semantically similar texts are geometrically close. The quality of that mapping determines retrieval quality: a model that maps 'termination clause' and 'contract termination provision' close together will retrieve the right passage; one that does not will miss it.
The key insight: embedding quality is task-specific. A model trained on general web text excels at general semantic similarity. A model trained on legal documents understands that 'indemnification' and 'hold harmless' are synonymous. If your corpus and queries have domain-specific vocabulary, general-purpose embeddings will underperform.
The Embedding Model Landscape
- OpenAI text-embedding-3-small / large: Strong general-purpose baseline, easy API integration, 1536 and 3072 dimensions. The large model is competitive with open-source alternatives. Managed, no infrastructure required.
- BGE-M3 (BAAI): State-of-the-art open-source model. Supports dense, sparse, and multi-vector retrieval in one model. Multilingual (100+ languages). Runs locally — essential for data residency requirements.
- E5-mistral-7b-instruct: Instruction-tuned embedding model. Outperforms most models on MTEB benchmarks, especially for asymmetric retrieval (short query, long document). Higher compute cost.
- Cohere Embed v3: Strong multilingual support, native input type distinction (query vs document), integrated reranking ecosystem. Managed API.
- domain fine-tuned models: Models fine-tuned on domain-specific data consistently outperform general models in that domain. Worth considering when retrieval quality directly affects business outcomes.
Evaluating Retrieval Quality
The only way to know which embedding model is right for your use case is to measure retrieval quality on your actual data. The evaluation process:
- 1Build an evaluation set: 50–200 query/relevant-document pairs. The query is what a real user would ask; the relevant document is the correct passage to retrieve. Use production queries if available; write synthetic ones otherwise.
- 2Measure recall@k: for each query, retrieve the top-k chunks and check whether the relevant document appears. Recall@5 (is the right answer in the top 5?) is the most useful metric for RAG.
- 3Measure MRR (Mean Reciprocal Rank): the average of 1/rank for the first relevant result. Penalizes models that rank the correct answer at position 5 vs position 1.
- 4Compare models: run every candidate model through the same evaluation set and compare recall@5 and MRR. The difference between models on your domain data is often surprising.
def evaluate_retrieval(
eval_set: list[dict], # [{"query": str, "relevant_doc_id": str}]
retriever,
k: int = 5,
) -> dict:
recall_hits = 0
reciprocal_ranks = []
for item in eval_set:
results = retriever.search(item["query"], top_k=k)
result_ids = [r.id for r in results]
# Recall@k
if item["relevant_doc_id"] in result_ids:
recall_hits += 1
# MRR
try:
rank = result_ids.index(item["relevant_doc_id"]) + 1
reciprocal_ranks.append(1.0 / rank)
except ValueError:
reciprocal_ranks.append(0.0)
return {
f"recall@{k}": recall_hits / len(eval_set),
"mrr": sum(reciprocal_ranks) / len(reciprocal_ranks),
}Dimensionality: More Is Not Always Better
Higher-dimensional embeddings can capture more nuance but have real costs: more storage, slower similarity search at scale, and higher memory usage in your vector database. The tradeoffs:
- 768 dimensions: typical for mid-size open-source models (BGE-base, E5-base). Good quality, efficient storage. Right for most production RAG systems.
- 1536 dimensions: OpenAI text-embedding-3-small, many BERT-large derived models. Higher quality ceiling, roughly 2x storage vs 768.
- 3072 dimensions: OpenAI text-embedding-3-large, some frontier models. Best quality for general text, but 4x storage cost vs 768. Diminishing returns for most domain-specific use cases.
- Matryoshka Representation Learning (MRL): models like text-embedding-3 support truncating to lower dimensions with graceful quality degradation. You can use 256 dimensions for coarse retrieval and 1536 for re-ranking without two separate models.
When to Fine-Tune Embeddings
Fine-tuning an embedding model on domain-specific data consistently delivers 5–20% retrieval improvement on that domain. The signal to pursue it: you have run the evaluation above, a general-purpose model is underperforming on your eval set, and the domain has genuinely specialized vocabulary that general training data does not cover well.
The training data you need: query/relevant-document pairs, ideally 1,000–10,000 examples. You can generate synthetic training pairs by prompting an LLM to write questions that would be answered by each document in your corpus — a technique called synthetic query generation.
- Use sentence-transformers library for fine-tuning — well-documented, efficient, supports all major base models.
- Fine-tune with MultipleNegativesRankingLoss or TripletLoss on (query, positive_doc, negative_doc) triples.
- Start from BGE-M3 or E5-base, not from scratch — the general semantic understanding transfers.
- Evaluate on a held-out eval set before and after fine-tuning to confirm improvement.
Asymmetric Retrieval: Query vs Document Embeddings
In most RAG systems, queries are short (one sentence) and documents are long (multiple paragraphs). Some embedding models handle this asymmetry explicitly by using different representations for queries and documents. BGE models prefix queries with 'Represent this sentence:' and documents with 'Represent this passage:'. E5 models use 'query:' and 'passage:' prefixes. Using the wrong prefix — or no prefix — can silently degrade retrieval quality by 5–15%.