Visual Question Answering (VQA)

WIP cv computer-vision cv multimodal vision-language vqa 1 min read

Answering natural language questions about visual content

Overview

VQA models take both an image and a text question as input, outputting a text answer. This requires deep multimodal alignment to reason over the image context.

TODO: Add details on datasets (VQA v2, GQA) and reasoning mechanisms.