Embedding Types

WIP nlp-llms binary colbert dense embeddings embeddings-vectors llm matryoshka nlp sparse 4 min read

Sparse, Dense, Quantized, Binary, Variable Dimensions, and Multi-Vector embeddings

Overview

Embeddings are numerical representations of data. The choice of embedding type deeply impacts retrieval performance, storage requirements, and computational cost.

1. Sparse Embeddings

Sparse embeddings represent text by creating high-dimensional vectors where most values are exactly zero.

Mechanism: Often based on term frequency (like TF-IDF or BM25) or learned sparse representations (like SPLADE).
Pros: Excellent for exact keyword matching and domain-specific terminology (e.g., product codes, rare medical terms).
Cons: Struggles with semantic understanding (synonyms, paraphrasing).

2. Dense Embeddings

Dense embeddings represent text in lower-dimensional, continuous vector spaces (e.g., 768 or 1536 dimensions) where most values are non-zero.

Mechanism: Generated by transformer models (e.g., OpenAI text-embedding-3, BGE, E5).
Pros: Captures deep semantic meaning and context. Maps synonymous phrases closely together.
Cons: Cannot reliably perform exact keyword lookups. Requires more intensive vector distance calculations (Cosine/Dot Product).

3. Quantized Embeddings

Quantization reduces the precision of the floating-point numbers in an embedding.

Mechanism: Converts standard FP32 (32-bit floating point) embeddings to FP16, INT8, or even INT4.
Pros: Drastically reduces memory footprint (e.g., INT8 reduces size by 4x) and speeds up retrieval with minimal loss in accuracy.
Cons: Slight degradation in retrieval precision. Requires a vector database that supports quantized indexing.

4. Binary Embeddings

The extreme form of quantization, where each dimension is reduced to a single bit (0 or 1).

Mechanism: Often created by taking the sign of dense embedding dimensions (positive -> 1, negative -> 0). Distance is computed using highly efficient Hamming Distance.
Pros: Massive storage reduction (32x smaller than FP32). Blazing fast similarity search using bitwise operations.
Cons: Noticeable drop in semantic nuance, often requiring a rescoring step using full-precision vectors for the top results.

5. Variable Dimension (Matryoshka) Embeddings

Embeddings trained such that their early dimensions contain the most critical information, similar to nested Matryoshka dolls.

Mechanism: Models trained with Matryoshka Representation Learning (MRL) explicitly optimize a nested set of dimensions during training. For an embedding of size $d$, the loss is computed not just on the full $d$ dimensions, but on subsets like $d/2, d/4, d/8$, etc. This forces the model to pack the most important semantic information into the earliest dimensions. You can safely truncate a 1536-dimensional vector down to 256 dimensions using simple array slicing (vector[:256]) without destroying the representation.
Pros: Highly flexible. It enables a “two-pass” retrieval strategy within the same embedding space: use truncated dimensions (e.g., 256) for a fast, cheap first-pass search to get top-K results, and then use the full dimensions (1536) to rerank those top-K results for high precision.
Cons: The truncated vector is slightly less accurate than a model that was natively and exclusively trained for that specific smaller dimension.

6. Multi-Vector (Late Interaction) Embeddings

Instead of compressing an entire document into a single vector (which acts as an information bottleneck), multi-vector models generate an embedding for every single token in the text.

Mechanism: Models like ColBERT (Contextualized Late Interaction over BERT) generate a sequence of vectors for the query and a sequence of vectors for the document.

During search, they use a MaxSim (Maximum Similarity) operation. For each token in the query, it finds the most similar token in the document, and then sums these maximum similarities to get a final score.

Mathematically: $S(q, d) = \sum_{i=1}^{

} \max_{j=1}^{

} (q_i \cdot d_j)$

Pros: Extremely high accuracy, particularly for long-context retrieval, out-of-vocabulary terms, and complex queries, as token-level nuance and structure are preserved until the very end (“late interaction”).
Cons: Massive storage overhead (a 500-token document requires 500 vectors in the database instead of 1). Computationally expensive at scale, though modern vector databases (like Vespa and Qdrant) have optimized MaxSim hardware acceleration.