Overview
Unlike static image processing, video understanding involves analyzing the temporal dimension. This includes processing sequential data efficiently without running massive encoders on every single frame.
Key Domains
- Action Recognition
- Video Object Segmentation
- Dense Video Tracking
Architectures
- 3D CNNs (C3D, I3D)
- Video Vision Transformers (ViViT)
Data Efficiency
- Frame dropping and keyframe extraction
- Temporal shift modules
- Reusing optical flow or cached feature maps
TODO: Add notes on temporal modeling and efficient video parsing.