CI/CD for ML: GitHub Actions + DVC + MLflow 2.0 π
Table of Contents
- CI/CD for ML: GitHub Actions + DVC + MLflow 2.0 π
- Load the dataset (example with Boston housing data)
- Initialize MLFlow
- Start MLflow run
πΊ Watch: Neural Networks Explained
Video by 3Blue1Brown
Introduction
In this comprehensive guide, we’ll explore how to set up a Continuous Integration and Continuous Deployment (CI/CD) pipeline for machine learning projects using GitHub Actions as the orchestrator, Data Version Control (DVC) for data management, and MLflow 2.0 for experiment tracking and model deployment. This integration streamlines your development process by automating tests, experiments, and deployments while keeping track of all changes in a reproducible manner.
Prerequisites
- Python 3.10 or higher installed.
- Git version control system.
- A GitHub account with access to repositories.
- DVC (Data Version Control) version 2.6+ installed.
- MLflow 2.0+ installed and running on your local machine or a remote server.
To install the necessary tools, run:
pip install dvc mlflow==2.0 boto3 scikit-learn pandas numpy
Step 1: Project Setup
Before diving into CI/CD setup, ensure you have an existing ML project to work with or follow this step to create a new one.
- Initialize Git and DVC: In your project directory, initialize both Git and DVC.
git init && dvc init - Configure Remote Storag [1]e for DVC:
- Choose an S3 bucket or other remote storage service supported by DVC (e.g., Google Cloud Storage).
- Add the remote storage to your project using
dvc remote addcommand.
dvc remote add myremote s3://your-bucket-name --profile default-s3 - Add MLflow Tracking URI:
- Set up a local or cloud-based MLflow server and configure it in your project’s
.envfile (ormlflow_config.py). Replace placeholders with actual values.
import os mlflow_tracking_uri = "http://localhost:5000" # Local URI os.environ["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri - Set up a local or cloud-based MLflow server and configure it in your project’s
Step 2: Core Implementation
In this step, we’ll write the core ML code and integrate DVC for data management. Our example uses a simple regression model using scikit-learn.
- Create Main Script (
main.py):
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import mlflow
# Load the dataset (example with Boston housing data)
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)
# Initialize MLFlow
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI"))
# Start MLflow run
with mlflow.start_run():
# Train model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Log parameters and metrics
mlflow.log_param('n_features', data.data.shape[1])
mlflow.sklearn.log_model(lr_model, artifact_path="model")
- Data Versioning with DVC:
- Assume your dataset is stored in a directory named
data.
dvc add data/ git add .dvc data.dvc git commit -m "Add dataset and initialize DVC" - Assume your dataset is stored in a directory named
Step 3: Configuration
We now configure our GitHub repository to automatically run tests, train models, and track experiments using MLflow.
- Create
.github/workflowsDirectory: Inside your project directory, create a new folder named.github. - Add Workflow YAML File (
ci_cd.yml):
name: CI/CD for Machine Learning
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Setup Python
uses:/actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
train_model:
needs: test
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Run training script
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: python main.py
publish_model:
needs: train_model
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Publish model to MLflow
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: python main.py
push_to_dvc:
needs: publish_model
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Setup DVC
uses: dvc/actions/setup-dvc@master
- name: Push to remote storage
run: |
git add .
git commit -m "Push model and data artifacts"
git push origin main
- Secrets in GitHub:
- Go to your repository’s Settings > Secrets section.
- Add a new secret named
MLFLOW_TRACKING_URIwith the appropriate value.
Step 4: Running the Code
To run the CI/CD workflow locally before pushing it to GitHub, you can manually trigger each job defined in the .github/workflows/ci_cd.yml file.
- Run Tests Locally:
pytest - Train and Log Model:
- Ensure your MLflow server is running.
- Execute
python main.py.
- Expected Output:
- After running the script, you should see logs in MLflow UI showing training metrics and parameters.
- Troubleshooting Tips:
- If MLflow fails to start a run, verify that your environment variables are correctly set (
MLFLOW_TRACKING_URI). - Ensure all dependencies specified in
requirements.txtare up-to-date.
- If MLflow fails to start a run, verify that your environment variables are correctly set (
Step 5: Advanced Tips
Optimizing Workflow
- Use matrix strategy for parallel execution of multiple jobs.
- Implement caching mechanisms to speed up workflow executions.
- Integrate automated model evaluation and deployment steps into the pipeline.
Best Practices
- Regularly update MLflow tracking URI in
.envfile as your server evolves. - Ensure all sensitive information is stored securely via GitHub secrets rather than hardcoding them.
Results
By following this tutorial, you’ve set up a robust CI/CD pipeline for your machine learning project. Your models are now automatically tested and logged upon each push to the main branch, ensuring reproducibility and efficient collaboration among team members.
Going Further
- Model Monitoring: Integrate Prometheus or similar monitoring tools with MLflow to monitor model performance in production.
- Documentation Generation: Use
mkdocsorsphinxfor documenting your models and deployment processes. - Automated Notifications: Configure GitHub Actions to send notifications via email or Slack when new jobs are triggered or completed.
Conclusion
This tutorial has walked you through setting up a comprehensive CI/CD pipeline using modern tools like DVC, MLflow 2.0, and GitHub Actions. By automating your machine learning project workflow, you enhance reproducibility, improve collaboration efficiency, and ensure continuous model improvement.
π References & Sources
Research Papers
- arXiv - Intent-Aware Authorization for Zero Trust CI/CD - Arxiv. Accessed 2026-01-07.
- arXiv - Establishing Workload Identity for Zero Trust CI/CD: From Se - Arxiv. Accessed 2026-01-07.
Wikipedia
- Wikipedia - Rag - Wikipedia. Accessed 2026-01-07.
GitHub Repositories
- GitHub - Shubhamsaboo/awesome-llm-apps - Github. Accessed 2026-01-07.
All sources verified at time of publication. Please check original sources for the most current information.
π¬ Comments
Comments are coming soon! We're setting up our discussion system.
In the meantime, feel free to contact us with your feedback.