Image Captioning

WIP cv computer-vision cv image-captioning multimodal vision-language 1 min read

Architectures for generating natural language descriptions of images

Overview

Image Captioning requires fusing an image encoder (like a CNN or ViT) with a language decoder (like a Transformer) to translate visual features into text.

Popular Architectures

BLIP / BLIP-2
LLaVA

TODO: Add details on cross-attention mechanisms and training objectives.