Image Captioning

Architectures for generating natural language descriptions of images

Overview

Image Captioning requires fusing an image encoder (like a CNN or ViT) with a language decoder (like a Transformer) to translate visual features into text.

  • BLIP / BLIP-2
  • LLaVA

TODO: Add details on cross-attention mechanisms and training objectives.