rohit.vision
Notes Graph Search IDE About Portfolio
Notes / Computer Vision / Multimodal AI

Multimodal AI

Vision-Language Models, Multimodal architectures, and Video Understanding

1.
Image Captioning WIP
Architectures for generating natural language descriptions of images
2.
Visual Question Answering (VQA) WIP
Answering natural language questions about visual content
3.
Video Understanding WIP
Processing, analyzing, and extracting spatio-temporal features from video data
4.
Unified Latent Architectures WIP
Cross-domain competence through value-aligned latent representations without full model-based planning
GitHub LinkedIn Google Scholar

© 2026 Rohit Kumar. rohit.vision