Git Best Practices for ML Projects

Overview

ML projects have unique version control challenges: large files, experiment tracking, and model versioning. This guide covers Git best practices for ML.

.gitignore for ML

# Data
data/
*.csv
*.parquet
*.json
!config.json

# Models
*.pt
*.pth
*.onnx
*.pkl
models/

# Checkpoints
checkpoints/
*.ckpt

# Logs
logs/
wandb/
mlruns/

# Environment
.venv/
__pycache__/
*.pyc

# Notebooks
.ipynb_checkpoints/

# IDE
.vscode/
.idea/

Git LFS for Large Files

# Install Git LFS
git lfs install

# Track large files
git lfs track "*.pt"
git lfs track "*.onnx"
git lfs track "data/*.parquet"

# Commit .gitattributes
git add .gitattributes
git commit -m "Configure Git LFS"

DVC for Data Versioning

# Install DVC
pip install dvc

# Initialize
dvc init

# Track data
dvc add data/training.csv

# Push to remote storage
dvc remote add -d storage s3://my-bucket/dvc
dvc push

# Pull data
dvc pull

Branching Strategy

main
├── develop
│   ├── feature/new-model
│   ├── feature/data-pipeline
│   └── experiment/bert-large
└── release/v1.0

Commit Messages

# Format: type(scope): description

feat(model): add BERT classifier
fix(data): handle missing values in preprocessing
exp(training): test learning rate 1e-4
docs(readme): add installation instructions
refactor(pipeline): simplify data loading

Experiment Tracking with Git

# Create experiment branch
git checkout -b exp/lr-sweep-001

# Run experiment
python train.py --lr 0.001

# Commit results
git add results/
git commit -m "exp: lr=0.001, acc=0.92"

# Tag successful experiments
git tag -a exp-lr001-acc92 -m "Best LR experiment"

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
      - id: check-added-large-files
        args: ['--maxkb=1000']

pip install pre-commit
pre-commit install

Best Practices

Never commit data: Use DVC or Git LFS
Never commit secrets: Use environment variables
Small commits: One logical change per commit
Meaningful messages: Describe what and why
Branch per experiment: Easy to compare and revert

Git Best Practices for ML Projects

Overview

.gitignore for ML

Git LFS for Large Files

DVC for Data Versioning

Branching Strategy

Commit Messages

Experiment Tracking with Git

Pre-commit Hooks

Best Practices

Key Resources

Why It Matters

BlogIA Team

💬 Comments