ML project — using YOLOv8, Roboflow, DVC, and MLflow on DagsHub

Training a YOLOv8 model on a custom dataset

Rihab Feki
7 min readApr 10, 2023
Photo by Marita Kavelashvili on Unsplash

By following this tutorial you will learn how to train a YOLOv8 model on a custom dataset, version data and build data pipelines with DVC, and track experiments with DVC & MLflow. I will show how I go through the different steps of the project from scratch, so follow along🔥

Hopefully, it will be an inspiration in your ML learning journey ✨ ;)

Project Overview

This is a Computer Vision project for Object Detection. The target is to detect wildfire smoke. The project consists of the following parts:

  • Project setup & data management
  • Training a YOLOv8 model on a custom dataset
  • Creating a Data Pipeline with DVC
  • Setting up MLflow logging

Project setup

Step 1: Create a repository on DagsHub

I will show how I made the setup from scratch. You can create a new project on DagsHub by connecting to a project you have on GitHub. This enables both repositories to be synchronized.

DagsHub is a GitHub for Machine Learning projects. It is a platform for data scientists and machine learning engineers to version their data, models, experiments, and code.

When you create a repository on DagsHub you will have access to three remote servers e.g DVC, MLflow & Git, that are automatically configured with this repository.

Clone the project and set up a virtual python environment, by running the following commands:

git clone https://github.com/RihabFekii/wildfire-smoke-detector.git
make env
source env/bin/activate .

Step 2: Configuring DagsHub as a DVC remote storage

I will store the data on DagsHub Storage that I will configure as a DVC remote. This provides 10 GB of free remote storage, that is fully configured and managed by DVC to version control the data.

DVC is a version control tool for machine learning projects designed to manage large files, data sets, machine learning models, and metrics. It works on top of Git to easily integrate with your existing Git code repositories.

Let’s first set up DVC by installing it in the virtual environment and initializing it, by running the following commands:

pip install dvc 
dvc init

To configure DagsHub Storage as a DVC remote with your local machine, follow these steps:

  1. Click on the remote button then in the data tab, choose DVC and the HTTP protocol.
  1. Enter a terminal in your project, and run the following commands:
dvc remote add origin https://dagshub.com/<DagsHub-user-name>/<project>.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user <DagsHub-user-name>
dvc remote modify origin --local password <Token>

Step 3: Download the dataset

The dataset of wildfire smoke is downloaded from Roboflow. You can download the dataset in different formats as shown in the figure below:

Roboflow offers a set of Computer Vision tools for data processing, training, and deployment. It also provides a universe that hosts thousands of open datasets and pre-trained models.

Note: You can create your own dataset and annotate it with Roboflow.

I have added the downloaded data to the data directory that is split into “train”, “test”, and “valid” folders with their corresponding “labels” folder containing a *.txt file per image custom to YOLOv8 specifications, which are the following:

  • One row per object
  • Each row is class x_center y_center width height format.
  • Box coordinates must be in normalized xywh format (from 0–1). If your boxes are in pixels, divide x_center and width by image width, and y_center and height by image height.
  • Class numbers are zero-indexed (start from 0).

Step 4: Track and push the data with DVC

To track the data file with DVC, DVC stores information about the added file in a special .dvc file named data.dvc — a small text file with a human-readable format. This metadata file is a placeholder for the original data that can be versioned with Git.

To track the data with DVC and Git, execute these commands:

dvc add data

git add data.dvc .gitignore
git commit -m "Added raw data"

At this point, I will push (or upload) the data tracked by DVC to the DagsHub remote storage, by running this command:

dvc push -r origin 

Make sure to re-run the “dvc push” command every time you make changes and want to track them. This will make the raw data available for sharing & collaborating on DagsHub, as shown below:

Training

Step 4: Training a YOLOv8 model on a custom dataset

Ultralytics released the YOLOv8 family of object detection models that use the YOLO (You Only Look Once) architecture.

You can experiment with the different models of YOLOv8 which are the following: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. The choice of the model depends on how big your training dataset is and the speed and accuracy requirements. YOLOv8 Nano is the fastest and smallest, while YOLOv8 Extra Large (YOLOv8x) is the most accurate yet the slowest among them.

For training, you will need the dataset YAML file e.g data.yaml to define the paths to the images and the class names, as shown below:

path: path/to/data/directory
train: 'train/images' #relative to path
test: 'test/images'
valid: 'valid/images'

#class names
names:
0: ['smoke']

To get the best performance, you need to tune the hyperparameters of the YOLO model. You can do that in the params.yaml file. Some common YOLO training configurations include batch size, learning rate, momentum, and weight decay. Other factors that may affect the training process include the choice of the optimizer, the choice of the loss function, and the size and composition of the training dataset.

This is an example of a params.yaml file that I used for the training:

model_type: yolov8s.pt 
pretrained: True
seed: 0
imgsz: 640
batch: 8
epochs: 25
optimizer: SGD # other choices=['SGD', 'Adam', 'AdamW', 'RMSProp']
lr0: 0.01 # learning rate
name: 'yolov8s_exp_v0' # experiment name

The training script looks like the following:

import yaml
from ultralytics import YOLO


with open(r"params.yaml") as f:
params = yaml.safe_load(f)

# load a pre-trained model
pre_trained_model = YOLO(params['model_type'])

# train
model = pre_trained_model.train(
data=DATA_YAML_PATH,
imgsz=params['imgsz'],
batch=params['batch'],
epochs=params['epochs'],
optimizer=params['optimizer'],
lr0=params['lr0'],
seed=params['seed'],
pretrained=params['pretrained'],
name=params['name']
)

Step 5: Creating a data pipeline with DVC — train stage

Pipelines are key to building reproducible ML workflows. For instance, to find the best model, data scientists need to iterate and make multiple experiments with different hyperparameters and model types/architectures, and since some of the steps will be unchanged e.g the data ingestions or the training script, pipelines can optimize the ML workflow in development as well as in production.

A pipeline is composed of stages and each stage is a computation task that produces data for the next stage. With DVC you can build a DAG (Directed Acyclic Graph) that is a dependency graph of stages, as shown below:

Train stage — data pipeline with DVC

You can manage what dependencies and outputs to control with DVC using a “dvc.yaml” file. This is where your DVC data pipeline will reside.

stages:
train:
cmd: python src/train.py
deps:
- data/raw/wildfire-raw-yolov8
- src/train.py
- src/utils.py
- params.yaml
outs:
- models/model.pt
- reports/train_params.yaml:
cache: false
metrics:
- reports/train_metrics.csv:
cache: false

To run this pipeline, execute the following command in your terminal:

dvc repro 

If you have a DVC pipeline, use dvc exp run to both run your code pipeline and save experiment results.

Step 6: Experiment tracking with MLflow

Experiment tracking in Machine Learning is the process of saving experiment-related information and metadata that includes the following:

  • Model parameters
  • Model metrics
  • Model Artifacts

MLflow is an open-source tool for experiment tracking. It saves all your experiment's metadata in one place and enables model versioning, so you can easily reproduce and compare the different experiments.

DagsHub provides a remote MLflow server with every repository. You can log experiments with MLflow to it, view its information under the experiment tab, and manage your trained models from the MLflow UI built into your DagsHub project.

Experiment Table — Source: https://dagshub.com/docs/feature_guide/discovering_experiments/

Set DagsHub as the remote MLflow server by following these steps:

  1. Install MLflow:
pip install mlflow

2. Set DagsHub as the MLflow server URI by adding this line of code:

import mlflow

mlflow.set_tracking_uri(https://dagshub.com/<DagsHub-user-name>/<repository-name>.mlflow)
MLflow integration with DagsHub

3. Authenticate to DagsHub Mlflow server by setting the following environment variables by running these commands in your terminal:

export MLFLOW_TRACKING_URI=<https://dagshub.com/<DagsHub-user-name>/<repository-name>.mlflow>
export MLFLOW_TRACKING_USERNAME=<your-user-name>
export MLFLOW_TRACKING_PASSWORD=<your_token>

Check out MLflow logging functions to customize your experiment tracking. Once you run some experiments, you can access the logs of the params, metrics, and artifacts in the MLflow UI.

MLflow UI
with mlflow.start_run(): 
# ADD YOUR TRAINING SCRIPT HERE
mlflow.log_param('param_name', param)
mlflow.log_artifact('artifact_path')
mlflow.log_metric('metrics_name', metric)

Track experiments

You can make easily changes in the params.yaml file of the model type, hyperparameters, etc, and run experiments with dvc exp run.

To visualize your experiments with DVC, use dvc exp show.

--

--