Overview
VQA models take both an image and a text question as input, outputting a text answer. This requires deep multimodal alignment to reason over the image context.
TODO: Add details on datasets (VQA v2, GQA) and reasoning mechanisms.
Answering natural language questions about visual content
VQA models take both an image and a text question as input, outputting a text answer. This requires deep multimodal alignment to reason over the image context.
TODO: Add details on datasets (VQA v2, GQA) and reasoning mechanisms.