Imagine you create a classification model and right off the bat you receive 90% accuracy. These results seem amazing to you but you dive a little deeper into your data and notice that almost an entirety of the data you used belongs to one class. Imbalanced Data can cause a lot of frustration. You feel extremely frustrated when you come to discover that your data has imbalanced classes and that all the great results you thought you were getting turned out to be a lie.
WHAT IS MEANT BY IMBALANCED DATA?
Imbalance data typically refers to classification tasks where the classes are not represented equally. For example, you may have a binary classification problem with 100 instances out of which 80 instances are labeled with Classification-1, and the remaining 20 instances are marked with Class-2. This is essentially an example of an imbalanced dataset, and the ratio of Class-1 to Class-2 instances is 4:1. Most of the real-world classification problems display some level of class imbalance, which happens when there are not sufficient instances of the data that correspond to either of the class labels. Therefore, it is imperative to choose the evaluation metric of your model correctly. If it is not done, then you might end up adjusting/optimizing a useless parameter. In a real business-first scenario, this may lead to a complete waste. There are problems where a class imbalance is not just common; it is bound to happen. For example, the datasets that deal with fraudulent and non-fraudulent transactions, it is very likely that the number of fraudulent transactions as compared to the number of non-fraudulent transactions will be much less. And this is where the problem arises .
WHY ARE IMBALANCED DATASETS A SERIOUS PROBLEM TO TACKLE?
Although many machine learning algorithms have shown great success in many real-world applications, the problem of learning from imbalanced data is still yet to be state-of-the-art. This learning from imbalanced data is referred to as Imbalanced Learning.
Following are the significant problems of Imbalanced Learning:
- When the dataset has underrepresented data, the class distribution starts to skew.
- Due to inherent complex characteristics of the dataset, learning from such data requires new understandings, new approaches, new principles, and new tools to transform data. This cannot guarantee an efficient solution to your business problem. In worst cases, it might turn to complete waste with zero residues to reuse.
APPROACHES FOR HANDLING IMBALANCED DATA
Defining four fundamental terms here:
- True Positive (TP) — An instance that is positive and is classified correctly as positive
- True Negative(TN) — An instance that is negative and is classified correctly as negative
- False Positive(FP) — An instance that is negative but is classified wrongly as positive
- False Negative(FN) — An instance that is positive but is classified incorrectly as negative
Suppose you got the following True Positive and Negative Rates and False Positive and Negative Rates for Logistic Regression:
Now, assume that the True Positive and Negative Rates and False Positive and Negative Rates for Random Forest are following:
Just look at the number of negative classes correctly predicted (True Negatives) by both of the classifiers. As you are dealing with an imbalanced dataset, you need to give this number the most priority (because Class-1 dominant in the dataset). So, considering that, Random Forest trades away Logistic Regression easily. The above representations is most popularly known as the Confusion Matrix. The following are the two terms that are derived from the confusion matrix and used when you are evaluating a classifier.
Precision: Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way; it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV). Precision can be thought of as a measure of a classifier’s exactness. A low precision can also indicate a large number of False Positives.
Recall: Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate. Recall can be thought of as a measure of a classifier’s completeness. A low recall indicates many False Negatives.
Some other metrics that can be useful in this context:
- AUC (Area Under the ROC Curve)
- ROC (Receiver Operating Characteristic) Curve
- Matthews Correlation Coefficient (MCC)
RE-SAMPLING THE DATASET:
Dealing with imbalanced datasets includes various strategies such as improving classification algorithms or balancing classes in the training data (essentially a data preprocessing step) before providing the data as input to the machine learning algorithm. The latter technique is preferred as it has broader application and adaptation. The time taken to enhance an algorithm is usually higher than to generate the required samples. But for research purposes, both are preferred. The main idea of sampling classes is to either increasing the samples of the minority class or decreasing the samples of the majority class. This is done in order to obtain a fair balance in the number of instances for both the classes.
There can be two main types of sampling:
- You can add copies of instances from the minority class which is called up-sampling (over-sampling/sampling with replacement), or
- You can delete instances from the majority class, which is called down-sampling (under-sampling)
When you randomly eliminate instances from the majority class of a dataset and assign it to the minority class (without filling out the void created in majority class), it is known as random under-sampling. The void that gets created in the majority dataset for this makes the process random.
- It can help improve the runtime of the model and solve the memory problems by reducing the number of training data samples when the training data set is enormous.
- It can discard useful information about the data itself which could be necessary for building rule-based classifiers such as Random Forests.
- The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population in that case. Therefore, it can cause the classifiers to perform poorly on real unseen data.
Just like random under-sampling, you can perform random oversampling as well. But in this case, taking any help from the majority class, you increase the instances corresponding to the minority class by replicating them up to a constant degree. In this case, you do not decrease the number of instances assigned to the majority class. Say, you have a dataset with 1000 instances where 980 instances correspond to the majority class, and the reaming 20 instances correspond to the minority class. Now you over-sample the dataset by replicating the 20 instances up to 20 times. As a result, after performing over-sampling the total number of instances in the minority class will be 400.
- Unlike under-sampling, this method leads to no information loss.
- It increases the likelihood of overfitting since it replicates the minority class events.
You can consider the following factors while thinking of applying under-sampling & over-sampling:
- Consider applying under-sampling when you have a lot of data
- Consider applying over-sampling when you don’t have a lot of data
- Consider applying random and non-random (e.g., stratified) sampling schemes.
- Consider applying different ratios of the class-labels (e.g., you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)
Generating Synthetic Samples:
A simple way to create synthetic samples is to sample the attributes from instances in the minority class randomly. There are systematic algorithms that you can use to generate synthetic samples. The most popular of such algorithms is called SMOTE or the Synthetic Minority Over-Sampling Technique. It was proposed in 2002, and the following info-graphic will give you a good idea about the synthetic samples:
SMOTE is an oversampling method which creates “synthetic” examples rather than oversampling by replacements. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. The heart of SMOTE is the construction of the minority classes. The intuition behind the construction algorithm is simple. We already know that oversampling causes overfitting, and because of repeated instances, the decision boundary gets tightened. What if you could generate similar samples instead of repeating them? “It has been shown that to a machine learning algorithm, these newly constructed instances are not exact copies, and thus it softens the decision boundary and thereby helping the algorithm to approximate the hypothesis more accurately.”
- Alleviates overfitting caused by random oversampling as synthetic examples are generated rather than replication of instances.
- No loss of information.
- It’s simple to implement and interpret.
- While generating synthetic examples, SMOTE does not take into consideration neighboring examples can be from other classes. This can increase the overlapping of classes and can introduce additional noise.
- SMOTE is not very practical for high dimensional data.
All in all, you can see the concepts of imbalanced data and the kinds of problems it can create while designing & developing machine learning models and the several different reasons as to why it is so crucial to tackle it. Knowing these various approaches to handling imbalanced data can help you to work with your datasets effectively. Processing imbalanced data is an active area of research, and it can open up new horizons for you to consider research problems.