Overview

ML projects have unique version control challenges: large files, experiment tracking, and model versioning. This guide covers Git best practices for ML.

.gitignore for ML

# Data
data/
*.csv
*.parquet
*.json
!config.json

# Models
*.pt
*.pth
*.onnx
*.pkl
models/

# Checkpoints
checkpoints/
*.ckpt

# Logs
logs/
wandb/
mlruns/

# Environment
.venv/
__pycache__/
*.pyc

# Notebooks
.ipynb_checkpoints/

# IDE
.vscode/
.idea/

Git LFS for Large Files

# Install Git LFS
git lfs install

# Track large files
git lfs track "*.pt"
git lfs track "*.onnx"
git lfs track "data/*.parquet"

# Commit .gitattributes
git add .gitattributes
git commit -m "Configure Git LFS"

DVC for Data Versioning

# Install DVC
pip install dvc

# Initialize
dvc init

# Track data
dvc add data/training.csv

# Push to remote storage
dvc remote add -d storage s3://my-bucket/dvc
dvc push

# Pull data
dvc pull

Branching Strategy

main
├── develop
│   ├── feature/new-model
│   ├── feature/data-pipeline
│   └── experiment/bert-large
└── release/v1.0

Commit Messages

# Format: type(scope): description

feat(model): add BERT classifier
fix(data): handle missing values in preprocessing
exp(training): test learning rate 1e-4
docs(readme): add installation instructions
refactor(pipeline): simplify data loading

Experiment Tracking with Git

# Create experiment branch
git checkout -b exp/lr-sweep-001

# Run experiment
python train.py --lr 0.001

# Commit results
git add results/
git commit -m "exp: lr=0.001, acc=0.92"

# Tag successful experiments
git tag -a exp-lr001-acc92 -m "Best LR experiment"

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
      - id: check-added-large-files
        args: ['--maxkb=1000']
pip install pre-commit
pre-commit install

Best Practices

  1. Never commit data: Use DVC or Git LFS
  2. Never commit secrets: Use environment variables
  3. Small commits: One logical change per commit
  4. Meaningful messages: Describe what and why
  5. Branch per experiment: Easy to compare and revert

Key Resources