Git Best Practices for ML Projects
Overview ML projects have unique version control challenges: large files, experiment tracking, and model versioning. This guide covers Git best practices for ML. .gitignore for ML # Data data/ *.csv *.parquet *.json !config.json # Models *.pt *.pth *.onnx *.pkl models/ # Checkpoints checkpoints/ *.ckpt # Logs logs/ wandb/ mlruns/ # Environment .venv/ __pycache__/ *.pyc # Notebooks .ipynb_checkpoints/ # IDE .vscode/ .idea/ Git LFS for Large Files # Install Git LFS git lfs install # Track large files git lfs track "*.pt" git lfs track "*.onnx" git lfs track "data/*.parquet" # Commit .gitattributes git add .gitattributes git commit -m "Configure Git LFS" DVC for Data Versioning # Install DVC pip install dvc # Initialize dvc init # Track data dvc add data/training.csv # Push to remote storage dvc remote add -d storage s3://my-bucket/dvc dvc push # Pull data dvc pull Branching Strategy main ├── develop │ ├── feature/new-model │ ├── feature/data-pipeline │ └── experiment/bert-large └── release/v1.0 Commit Messages # Format: type(scope): description feat(model): add BERT classifier fix(data): handle missing values in preprocessing exp(training): test learning rate 1e-4 docs(readme): add installation instructions refactor(pipeline): simplify data loading Experiment Tracking with Git # Create experiment branch git checkout -b exp/lr-sweep-001 # Run experiment python train.py --lr 0.001 # Commit results git add results/ git commit -m "exp: lr=0.001, acc=0.92" # Tag successful experiments git tag -a exp-lr001-acc92 -m "Best LR experiment" Pre-commit Hooks # .pre-commit-config.yaml repos: - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.1.6 hooks: - id: ruff args: [--fix] - id: ruff-format - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.5.0 hooks: - id: check-yaml - id: end-of-file-fixer - id: trailing-whitespace - id: check-added-large-files args: ['--maxkb=1000'] pip install pre-commit pre-commit install Best Practices Never commit data: Use DVC or Git LFS Never commit secrets: Use environment variables Small commits: One logical change per commit Meaningful messages: Describe what and why Branch per experiment: Easy to compare and revert Key Resources DVC Documentation Git LFS Pre-commit