Video Understanding

Processing, analyzing, and extracting spatio-temporal features from video data

Overview

Unlike static image processing, video understanding involves analyzing the temporal dimension. This includes processing sequential data efficiently without running massive encoders on every single frame.

Key Domains

  • Action Recognition
  • Video Object Segmentation
  • Dense Video Tracking

Architectures

  • 3D CNNs (C3D, I3D)
  • Video Vision Transformers (ViViT)

Data Efficiency

  • Frame dropping and keyframe extraction
  • Temporal shift modules
  • Reusing optical flow or cached feature maps

TODO: Add notes on temporal modeling and efficient video parsing.