Imbalanced data: best practices

A guide to deliver great results on ML models with imbalanced datasets

Rihab Feki
5 min readJan 3, 2022
Photo by Jeremy Thomas on Unsplash

Machine Learning models are as good or as bad as the data you have. Therefore, the most important step in data science is data preparation. Various are the challenges you could face with datasets e.g feature selection, feature engineering, encoding, dimensionality reduction, etc…and the most common in classification problems is imbalanced data.

Dealing which imbalanced data is the focus of this article.

After reading this article you will learn:

  • What is class imbalance and why it is important to deal with it
  • Which metrics to use to evaluate your moderls performace
  • Data resampling techniues (under-sampling & over-sampling)
  • Which classification models are best for class imbalance

Without further ado, let’s get started!

What is data imbalance?

Imbalanced data refers to a dataset where the traget class distribution represents unequivalant proportions. Typically, it is the case with classification problems (binary or multi-class) where the classes are not represented equally.

Why dealing with data imbalance is important?

You might have noticed that once you trained a classification model on an imbalanced data, got a 90% accuracy. Then, you dig a little deeper and find out that 90% of the data belongs to one class.

This is what this article is for!

Class imbalance is not just common but sometimes expected

There are cases where class imbalance is inevitable. Takaing an example of fraud detection datasets, where the mojority of the observations would belong to the “non fraudulant class” and the minority would belong to the “fraudulant class”. Another example could be e-mail classification problem, where emails are classified into ham or spam. The number of “spam” emails is usually lower than the number of relevant “ham” emails.

In such cases the minority class matters the most, for this reason action has to be taken to deal with this imbalance and train robust models.

This is what you will find out in the following steps..

In this article I am using an example of a dataset to classify animal activity. The figure shown below represents the target class distribution:

Tip 1: Choosing the right metrics

Accuracy might have been always the togo metric in classification problems. But when having a class imbalance, it might not be the right one. Simply because the model looks at the mojority class and thinks by predicting it, it is doing the right thing . This explains the high accuracy which you could get. But it might be misleading.

The following are some other performance metrics I would recommend:

  • Confusion Matrix: A summary of predictions into a table showing correct classifications in the diagonal and the incorrect predictions made of that class in the same row.

This is the resulting confusion matrix from the example dataset:

Confusion Matrix plot
  • Precision: A measure of a classifiers exactness and the quality of a positive prediction made by the model.
  • Recall: A measure of a classifiers completeness, so it answers the question: what proportion of actual positives is correctly classified.
  • F1 Score (or F-score): The weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

you could have an overview on the above 3 metrics with the classification report implementation from sklearn shown bellow:

# Classification reportfrom sklearn.metrics import classification_reportprint(classification_report(y_test, predictions))
Classification report

Tip 2: Resampling your data

Before getting to this step, if it is possible to collect more data to balance the dataset target classes, that would be great. If not…

A good starting point is to try to resample your imbalanced datasets manually and the implementation of the following approches is simple and it makes sence to try it before adopting synthetic resampling techniques:

  • Over-sampling: is creating copies of the minority classes to even-up the classes.
  • Under-sampling: deleting samples from the majority classes until you have an equivalant classes distribution

This is an example of how you could apply under-sampling:

# Reducing the number of samples corresponding to a majority class#number of samples to delete
N= 20000
df = df.drop(df[df['label'].eq("majority-class")].sample(N).index)

There are also other ways to apply synthetic undersampling and oversampling, check this guide from the imbalanced-learn documentation to find out more about the implementation of such techniques.

Tip 3: Stratified train-test split

In a classification setting, Stratified train-test split is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.

Stratified splits are desirable in some cases, like when you’re classifying an imbalanced dataset with a significant difference in the number of samples that belong to distinct classes.

This is an example of two implementation methods to have a stratified train-test split:

Check this link to find out more about stratified train-test-split methods from sklearn.

Tip 4: Choose robust models to class imbalance

Trying out different models is always a better way than sticking with only one model and playing on tunning it.

Some models I would recommend are:

  • Tree-based algorithms often perform well on imbalanced datasets.
  • Boosting algorithms ( e.g AdaBoost, XGBoost,…) are ideal for imbalanced datasets because higher weight is given to the minority class at each successive iteration. during each interation in training the weights of misclassified classes are adjusted.

Avoiding these mistakes:

When you use any sampling technique (specifically synthetic) you divide your data first and then apply synthetic sampling on the training data only. After you do the training, you use the test set (which contains only original samples) to evaluate.This will save your test data from the l

The test set should ideally not be preprocessed with the training data. This will ensure no ‘peeking ahead’. Train data should be preprocessed separately and once the model is created we can apply the same preprocessing parameters used for the train set, onto the test set as though the test set didn’t exist before.

For example, if you’re doing data pre-processing (e.g imputing missing values or oversampling ) with the mean, neighbouring , computing the mean over train+validation data will leak information about the validation set into your training process, and may cause your model to slightly overfit to the validation set.

Conclusion

You have seen in this article the differnt ways to deal with class imbalance in classification problems and the best practices which I recommend to try out.

The takeaway from this guide is that data preperation is key to train robust predictive models.

--

--