Overview
Image Captioning requires fusing an image encoder (like a CNN or ViT) with a language decoder (like a Transformer) to translate visual features into text.
Popular Architectures
- BLIP / BLIP-2
- LLaVA
TODO: Add details on cross-attention mechanisms and training objectives.