Classification Model Evaluation

In recent years, “Machine Learning” has evolved into a buzzword that is used for describing all sorts of modeling. When studying Machine Learning it is important to get acquainted with all the different categories below the famous umbrella term. First, Machine Learning can be broken down into two simple categories, “Supervised Learning” and “Unsupervised Learning”. I will only focus on the Supervised Learning subset in this article. Supervised Learning is a class of machine learning that can “learn” a task through labeled training data. The labeled data is what makes a difference between Supervised Learning and Unsupervised learning, Unsupervised Learning models interpret data based off only input data. Furthermore, Supervised Learning models can be broken down into two more separate categories “Classification” and “Regression”. Today we will be focusing on Classification Models.

Below is a simple image to help visualize this structure.

For the most part, Classification models are what we use to predict categorical data. This comes in the form of a binary prediction; Yes or No, Positive or Negative, are some examples. This article will focus on how we can evaluate our Classification models. There are many steps that go into deploying a successful model; Preprocessing, EDA, Feature Engineering, Model Selection, Training, then Evaluation.

Why Evaluate?

We need to evaluate our models so we can determine which is the most successful. There are many different types of Classification algorithms we could use for our model; Logistic Regression, K-Nearest Neighbor, Decisions Tree are just a few examples. Evaluation Metrics can help us decide which algorithm is best for our predictions.

Evaluation metrics can also help us when tuning our parameters. Each different Model has many different parameters that can be adjusted to give us more accurate prediction.

Last, evaluation metrics can help us when deciding which data Features are important when building our model.

Confusion Matrix

While not exactly a metric, a Confusion Matrix is extremely important to know when evaluating our classifier. There are four terms that are detrimental to the understanding of a confusion matrix.

  • True Positive: An outcome where the model correctly predicts the positive class.
  • True Negative: An outcome where the model correctly predicts the negative class.
  • False Positive: An outcome where the model falsely predicts a truly negative outcome to be positive. Type I error
  • False Negative: An outcome where the model falsely predicts a truly positive outcome to be negative. Type II error

Below is a simple example of a confusion matrix.

This confusion matrix is depicting the results of model predicting if an image is a “Cat” or “Non-Cat”. We can tell from the confusion matrix that the data provides 85 images of a Cat, and 138 images of a Non-Cat

When in Python, scikit-learn gives us the ability to code many of the metrics we will need. “References” has a link to their documentation

Below is an example of a scikit-learn computed confusion matrix for our Cat Predictor

In the confusion matrix above;

  • True Positives: (When a Cat is successfully predicted) is 63
  • True Negative: (When a Non-Cat is successfully predicted) is 120
  • False Positives: (When the model predicts Cat but it is truly Non-Cat) is 18
  • False Negatives: (When the model predicts Non-Cat but it is truly a Cat) is 22.

It is important to always remember that in Classification Models — context matters. For example, if your target variable instance rate is 80% and your model predicts at an 81% success rate, your model could be just as efficient predicting true every time.

Evaluation Metrics

Note: These steps comes after a Train/Test split, scaling, & running your model.

Classification Accuracy: The simplest evaluation metric, shows how often your classifier is correct. To calculate we add our correct predictions and divide by the total predictions. Note: Not a good metric when dealing with highly unbalanced data.

; print((TP + TN) / (TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))
Cat Example Answer
82% Accuracy

Precision: Determines how precise our classifier is when predicting positives. Precision can be calculated by dividing True Positives by True Positives plus False Positive. Referring to the example above, precision shows us how good our model is a predicting if a picture is a cat.

print(TP / (TP + FP))
print(metrics.precision_score(y_test, y_pred_class))
Cat Example Answer

Recall: The fraction of samples from a class which are correctly predicted by the model. Or, the proportion of actual positives identified correctly.

print(TP / (TP + FN))
print(metrics.recall_score(y_test, y_pred_class))
Cat Example Answer

Specificity: Shows us when the actual value is negative, how often is the prediction correct.

print(TN / (TN + FP))Cat Example Answer

False Positive Rate: When the actual value is negative, how often is the prediction incorrect.

print(FP / (TN + FP))Cat Example Answer

F1 Score: Depending on application, you may want to give higher priority to recall or precision. But there are many applications in which both recall and precision are important. Therefore, it is natural to think of a way to combine these two into a single metric. The F1-score is the harmonic mean of precision and recall. A Perfect F1 would be 1 (perfect precision and recall) and the worst F1 score would be 0. It can be calculated as =2*Precision*Recall/(Precision+Recall)

Above is some helpful code utilizing scikit-learn to help get a good understand of your Classification Model. Note: We are trying to (1, or Cats) in the above example

There are a lot of different kinds of Classification Metrics, below is a chart depicting most of them. It would be extremely beneficial to get familiar with these metrics to help guide your decision making process when building a Classification Model.

Deciding on Classification Metrics

When deciding on which classification metric is important to you, it is important to look at the question. For example, if you are trying to predict if a patient has Coronavirus, minimizing False Negatives would be more important than False Positives. This is because having someone who is positive for Coronavirus walking around thinking he/she is negative is extremely dangerous, compared to thinking they are positive for Coronavirus but be truly negative.

Another example would be for Stock traders. If someone is trying to predict if a stock will go up tomorrow, then the most important because would be an accurate True Positive.

Financial professional with experience in data acquisition, data modeling, statistical analysis, machine learning, deep learning, and NLP.