ML experiments management with DVC

7 min readMay 8, 2023

In this project, I will use DVC to manage and automate my ML experiments and pipelines. DVC enables data and model version control as well as experiment tacking & reproducibility.

Step 1: Download the data

The dataset is downloaded from Kaggle which will be used to train a model to classify bank customers’ churn.

Step2: Create a DVC project

I created locally a new repository (inspired by the Cookiecutter Data Science project template) that will develop to have the following structure:

├── Makefile           <- Makefile with commands like `make env` or `make requirements`.
├── README.md          <- Documentation for using this project.
├── params.yaml        <- configuration parameters e.g for training
├── app
│   └── main.py        <- Ml_API app built with FastAPI
├── data
│   ├── processed      <- Processed dataset.
│   └── raw            <- The original dataset (immutable data).
├── notebooks          <- Experimentation Notebooks
├── models             <- Trained and serialized models
├── reports            <- Generated analysis as HTML, PDF, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting.
│   └── train_metrics.csv    <- Relevant metrics after evaluating the model.
│
├── requirements.txt   <- The requirements file for reproducing the environment.
├── Dockerfile         <- Used to containerize the Ml_API app
├── src                <- Source code for use in this project.
│   └── stages
│   │  ├── train.py    <- Scripts to train the model.
│   │  └── data_x.py   <- Scripts to manage/process data.
│   └── utils.py       <- Utility functions 
│
├── dvc.lock           <- The version definition of each dependency, stage, and output from the 
│                         DVC data pipeline.
└── dvc.yaml           <- Defining the data pipeline stages, dependencies, and outputs.

Set up a Python virtual environment to get started. I will use Git to version control my code and DVC to version control my data and models.

Install DVC by running: pip install dvc

Initialize the DVC project by running dvc init . You can see that this creates a .dvc file. This is where DVC tracks the changes to the file that you instruct it to track.

Step 3: Set up S3 as a DVC remote storage

Next, I will set up S3 as a data remote. DVC remote storage enables the centralization or distribution of data storage for sharing and collaboration as well as the synchronization of large files and directories tracked by DVC.

These are the steps to go through to set up S3 as remote storage:

Create a new S3 Bucket to use as your DVC remote:

Create an IAM user, select Attach policies directly, then select AmazonS3FullAccess policy. If you want to be more specific with the permission to the S3 bucket, click on Create Policy and It will open a new tab in which you can customize the policy in the JSON visual editor.

Set permissions to a user (image by the author)

Set credentials by creating access keys to give DVC access to the bucket. Click on “security credentials” and select the “access keys” section.

Create access keys (image by the author)

Install the AWS CLI and DVC S3 extension

$ pip install dvc-s3 
$ pip install awscli

Configure the AWS CLI with your IAM credentials by running

$ aws configure

AWS Access Key ID [****************XXXO]: 
AWS Secret Access Key [****************Pprx]: 
Default region name [us-east-1]: 
Default output format [json]:

Configure the S3 storage as the DVC default remote storage:

dvc remote add --default s3store s3://demo-dvc-storage

After executing this command, you can see that under the .dvc/config file the following settings auto-generated:

[core]
    remote = s3store
['remote "s3store"']
    url = s3://demo-dvc-storage

The purpose of using DVC is to not push large data files to GitHub. To do that, add data to DVC, which means that data becomes under DVC control:

dvc add data/raw/Churn_Modelling.csv

Once you execute this command you will notice a Churn_Modelling.csv.dvc file is generated. The .dvc files are small text files that DVC uses to track the data. You will also notice that Churn_Modelling.csv is added to .gitignore . Remember the rule of thumb: large data files and folders go into DVC remote storage, but the small .dvc files go into GitHub. When you come back to your work and check out all the code from GitHub, you’ll also get the .dvc files, which you can use to get your large data files.

git add data/raw/.gitignore data/raw/Churn_Modelling.csv.dvc
git commit -m "Added raw data to DVC"

Then push the data to the AWS S3 remote with this command:

dvc push

Push data to remote storage (image by the author)

Step 4: DVC pipelines

You can use DVC not only as a version control tool that can handle large data files. DVC pipelines are version-controlled machine learning workflow steps (e.g. data loading, cleaning, feature engineering, training, etc.).

A good practice is to use a config file e.g params.yaml file to specify the different configurations needed to load data, preprocess it, and hyperparameters of the ML model, etc. This will simplify reusability and experimentation.

This is an example of a params.yaml file I used in the use case of the bank customer churn classification problem.

base:
  project: bank_customer_churn
  random_state: 0

data:
  raw_data_dir: data/raw
  data_file_name: Churn_Modelling.csv
  cat_cols:
  - Geography
  - Gender
  num_cols:
  - CreditScore
  - Age
  - Tenure
  - Balance
  - NumOfProducts
  - HasCrCard
  - IsActiveMember
  - EstimatedSalary
  target_col: Exited

data_split: 
  processed_data_dir: data/processed
  test_size: 0.2

train:
  model_dir: models
  model_type: XGBClassifier
  train_params:
    learning_rate: 0.2
    max_depth: 5
    n_estimators: 200

eval:
  model_path: models/model.pkl
  reports_dir: reports
  metrics_fname: metrics.csv

I used a Jupyter Notebook to explore the data and train a baseline model. To ensure reproducibility, I will create modules for the different steps e.g data_splitting.py, train.py, evaluate.py. These modules will be used in the definition of the ML pipeline stages with DVC.

Below is an example of my dvc.yaml file. It includes information about the command to run ( python3 src/stages/train.py), its dependencies, params, and outputs.

stages:
  split_data:
    cmd: python3 src/stages/split_data.py
    deps: 
    - src/stages/split_data.py
    - data/raw/Churn_Modelling.csv
    params:
    - base
    - data
    - data_split
    outs:
    - data/processed/x_train.csv
    - data/processed/x_test.csv
    - data/processed/y_train.csv
    - data/processed/y_test.csv
  train: 
    cmd: python3 src/stages/train.py
    deps: 
    - src/stages/train.py
    - data/processed/x_train.csv
    - data/processed/y_train.csv
    params:
    - data
    - data_split
    - train
    outs:
    - models/model.pkl
  eval:
    cmd: python3 src/stages/evaluate.py
    deps:
    - src/stages/evaluate.py
    - models/model.pkl
    - data/processed/x_test.csv
    - data/processed/y_test.csv
    metrics:
    - reports/metrics.json:
        cache: false 
    plots:
    - reports/confusion_matrix.png:
        cache: false

To execute the pipeline, run:

dvc repro

You will see an update in the dvc.lock file. DVC uses this metafile to track the data used and produced by the stage. Again you need to commit the dvc.lock file to git and then push the tracked files to s3 with dvc push.

Note: A pipeline automatically adds newly created files to DVC control, just as if you’ve typed dvc add.

Try to make a change in the training params in the params.yaml file and run again dvc repro. DVC can help you compare different experiments’ metrics, which you can do by running:

$ dvc metrics diff

Path                  Metric         HEAD     workspace    Change     
reports/metrics.json  f1_score       0.5441   0.55213      0.00803
reports/metrics.json  roc_auc_score  0.69848  0.70397      0.00549

Step 5: Experiment management & tracking with Iterative Studio

Iterative Studio is a great way to visualize experiments and their metadata and even compare them.

Another advantage is that you can collaborate with a team on your project so having such a visual interface helps improve productivity.

You can connect to Iterative Studio with a Git service e.g GitHub and then add your project, as shown below:

Connect to Iterative Studio (image by the author)

Create different branches and make experiments by changing the model or the hyperparameters and then once you push the changes to GitHub, you should be able to have the synchronized overview of your experiments, as shown below: