The Problem
Git is designed for tracking changes in plain text source code. Attempting to commit large binary files (like multi-gigabyte datasets .csv, image folders, or model weights .pt/.safetensors) will drastically bloat the Git repository, making it un-cloneable and violating GitHub’s file size limits (usually 100MB).
Git LFS (Large File Storage)
An official Git extension specifically built to handle large binary files.
- Mechanism: When you commit a large file, Git LFS replaces the actual file in the Git history with a tiny text pointer. The actual massive binary file is uploaded to a separate LFS cache/server.
- Usage: Run
git lfs track "*.pt"to ensure all PyTorch models are stored externally. - Pros: Native to GitHub, seamless developer experience (standard
git push/git pullcommands work normally). - Cons: Storage and bandwidth costs on GitHub can scale up quickly. Does not natively understand ML pipelines.
DVC (Data Version Control)
An open-source tool purpose-built for machine learning projects. It makes ML models, data sets, and intermediate files shareable and reproducible.
- Mechanism: Similar to Git LFS, it replaces large files with small metadata pointers (
.dvcfiles) that are tracked by Git. However, DVC is storage-agnostic. You configure DVC to push the massive files to your own cloud storage bucket (AWS S3, Google Cloud Storage, Azure Blob, or even a local NAS). - Pipelines: DVC is more than just storage; it can track the execution graph. If your pipeline is
Raw Data -> Cleaning Script -> Clean Data -> Training Script -> Model, DVC tracks the hashes of the inputs and scripts. If you run the pipeline again, DVC knows exactly which steps can be skipped because the inputs haven’t changed. - Workflow:
dvc add data/dataset.csvgit add data/dataset.csv.dvcgit commit -m "Add dataset"dvc push(Pushes binary to S3)git push(Pushes source code and.dvcpointer to GitHub)