Vision & Late Interaction RAG

ColBERT, ColPali, MUVERA, and Vision-based RAG (VisRAG) for multi-modal document retrieval

Late Interaction Models (ColBERT)

Traditional embeddings (Bi-Encoders) squash an entire document into a single vector. While fast, this loses fine-grained token-level detail. ColBERT introduces Late Interaction:

  • Mechanism: Both the query and the document are encoded into multiple vectors (at the token level). Instead of computing a single dot product, ColBERT computes the maximum similarity (MaxSim) between every query token and every document token, summing them up.
  • Benefit: Highly accurate and excellent for complex queries while remaining much faster than heavy Cross-Encoders.

MUVERA (MUlti-VEctor Retrieval Algorithm)

While late interaction models (like ColBERT) are highly accurate, searching across multiple vectors per document is computationally extremely expensive and incompatible with standard single-vector databases (like FAISS or Qdrant).

  • Mechanism: MUVERA solves this by reducing multi-vector similarity search to a single-vector similarity search. It asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, essentially creating a single-vector proxy that mathematically guarantees a high-quality approximation of the multi-vector MaxSim calculation.
  • Benefit: Allows you to use off-the-shelf Maximum Inner Product Search (MIPS) solvers for ColBERT-style retrieval.

ColPali

ColPali applies the ColBERT late-interaction mechanism to Vision Language Models (VLMs) (specifically PaliGemma).

  • Visual Document Retrieval: Instead of relying on OCR to parse text out of PDFs/images, ColPali takes raw images of document pages and produces high-quality multi-vector embeddings of the visual patches.
  • Why it matters: It entirely removes the brittle OCR/parsing step, drastically outperforming modern text-based pipelines by natively understanding layouts, charts, and text visually.

VisRAG (Vision-based RAG)

VisRAG introduces a parsing-free RAG system supported entirely by VLMs.

  • The Problem: Traditional RAG converts PDFs to text, completely losing visual information like complex layouts, figures, and charts.
  • The Solution: The VisRAG-Retriever fetches whole visual documents based on multimodal evidence, and passes them directly to a generative VLM to answer the query, ensuring zero information loss from bad text parsing.

M3DocVQA & M3DocRAG

M3DocRAG introduces a framework designed for the M3DocVQA benchmark.

  • The Problem: Previous Document VQA benchmarks asked questions based on a single document. Real enterprise RAG requires searching across thousands of multi-page documents where evidence is scattered.
  • The Benchmark: M3DocVQA is the first benchmark for open-domain DocVQA over 3,000+ PDF documents (40,000+ pages).
  • The Solution (M3DocRAG): It proves that a pure visual-retrieval pipeline (e.g., using ColPali for retrieval + Qwen2-VL for generation) significantly outperforms traditional text-based RAG (e.g., ColBERT + Llama 3) for document-rich corpora, validating the shift towards Vision-based RAG architectures.

TODO: Add diagrams of the MaxSim computation in late interaction.