CI/CD for ML: GitHub Actions + DVC + MLflow 2.0 πŸš€

Table of Contents

πŸ“Ί Watch: Neural Networks Explained

Video by 3Blue1Brown


Introduction

In this comprehensive guide, we’ll explore how to set up a Continuous Integration and Continuous Deployment (CI/CD) pipeline for machine learning projects using GitHub Actions as the orchestrator, Data Version Control (DVC) for data management, and MLflow 2.0 for experiment tracking and model deployment. This integration streamlines your development process by automating tests, experiments, and deployments while keeping track of all changes in a reproducible manner.

Prerequisites

  • Python 3.10 or higher installed.
  • Git version control system.
  • A GitHub account with access to repositories.
  • DVC (Data Version Control) version 2.6+ installed.
  • MLflow 2.0+ installed and running on your local machine or a remote server.

To install the necessary tools, run:

pip install dvc mlflow==2.0 boto3 scikit-learn pandas numpy

Step 1: Project Setup

Before diving into CI/CD setup, ensure you have an existing ML project to work with or follow this step to create a new one.

  1. Initialize Git and DVC: In your project directory, initialize both Git and DVC.
    git init && dvc init
    
  2. Configure Remote Storag [1]e for DVC:
    • Choose an S3 bucket or other remote storage service supported by DVC (e.g., Google Cloud Storage).
    • Add the remote storage to your project using dvc remote add command.
    dvc remote add myremote s3://your-bucket-name --profile default-s3
    
  3. Add MLflow Tracking URI:
    • Set up a local or cloud-based MLflow server and configure it in your project’s .env file (or mlflow_config.py). Replace placeholders with actual values.
    import os
    
    mlflow_tracking_uri = "http://localhost:5000"  # Local URI
    os.environ["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri
    

Step 2: Core Implementation

In this step, we’ll write the core ML code and integrate DVC for data management. Our example uses a simple regression model using scikit-learn.

  1. Create Main Script (main.py):
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import mlflow

# Load the dataset (example with Boston housing data)
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

# Initialize MLFlow
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI"))

# Start MLflow run
with mlflow.start_run():
    # Train model
    lr_model = LinearRegression()
    lr_model.fit(X_train, y_train)
    
    # Log parameters and metrics
    mlflow.log_param('n_features', data.data.shape[1])
    mlflow.sklearn.log_model(lr_model, artifact_path="model")
  1. Data Versioning with DVC:
    • Assume your dataset is stored in a directory named data.
    dvc add data/
    git add .dvc data.dvc
    git commit -m "Add dataset and initialize DVC"
    

Step 3: Configuration

We now configure our GitHub repository to automatically run tests, train models, and track experiments using MLflow.

  1. Create .github/workflows Directory: Inside your project directory, create a new folder named .github.
  2. Add Workflow YAML File (ci_cd.yml):
name: CI/CD for Machine Learning

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Setup Python
        uses:/actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt
  train_model:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run training script
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
        run: python main.py

  publish_model:
    needs: train_model
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt
      - name: Publish model to MLflow
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
        run: python main.py

  push_to_dvc:
    needs: publish_model
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Setup DVC
        uses: dvc/actions/setup-dvc@master
      - name: Push to remote storage
        run: |
          git add .
          git commit -m "Push model and data artifacts"
          git push origin main
  1. Secrets in GitHub:
    • Go to your repository’s Settings > Secrets section.
    • Add a new secret named MLFLOW_TRACKING_URI with the appropriate value.

Step 4: Running the Code

To run the CI/CD workflow locally before pushing it to GitHub, you can manually trigger each job defined in the .github/workflows/ci_cd.yml file.

  1. Run Tests Locally:
    pytest
    
  2. Train and Log Model:
    • Ensure your MLflow server is running.
    • Execute python main.py.
  3. Expected Output:
    • After running the script, you should see logs in MLflow UI showing training metrics and parameters.
  4. Troubleshooting Tips:
    • If MLflow fails to start a run, verify that your environment variables are correctly set (MLFLOW_TRACKING_URI).
    • Ensure all dependencies specified in requirements.txt are up-to-date.

Step 5: Advanced Tips

Optimizing Workflow

  • Use matrix strategy for parallel execution of multiple jobs.
  • Implement caching mechanisms to speed up workflow executions.
  • Integrate automated model evaluation and deployment steps into the pipeline.

Best Practices

  • Regularly update MLflow tracking URI in .env file as your server evolves.
  • Ensure all sensitive information is stored securely via GitHub secrets rather than hardcoding them.

Results

By following this tutorial, you’ve set up a robust CI/CD pipeline for your machine learning project. Your models are now automatically tested and logged upon each push to the main branch, ensuring reproducibility and efficient collaboration among team members.

Going Further

  • Model Monitoring: Integrate Prometheus or similar monitoring tools with MLflow to monitor model performance in production.
  • Documentation Generation: Use mkdocs or sphinx for documenting your models and deployment processes.
  • Automated Notifications: Configure GitHub Actions to send notifications via email or Slack when new jobs are triggered or completed.

Conclusion

This tutorial has walked you through setting up a comprehensive CI/CD pipeline using modern tools like DVC, MLflow 2.0, and GitHub Actions. By automating your machine learning project workflow, you enhance reproducibility, improve collaboration efficiency, and ensure continuous model improvement.


πŸ“š References & Sources

Research Papers

  1. arXiv - Intent-Aware Authorization for Zero Trust CI/CD - Arxiv. Accessed 2026-01-07.
  2. arXiv - Establishing Workload Identity for Zero Trust CI/CD: From Se - Arxiv. Accessed 2026-01-07.

Wikipedia

  1. Wikipedia - Rag - Wikipedia. Accessed 2026-01-07.

GitHub Repositories

  1. GitHub - Shubhamsaboo/awesome-llm-apps - Github. Accessed 2026-01-07.

All sources verified at time of publication. Please check original sources for the most current information.