F1 Score: Capturing the Tension Between Precision and Recall

by Qiuyue Wang
with Greg Page

Background: A Predictive Model for Hotel Cancellations

This article’s purpose is to introduce readers to the concepts of precision, sensitivity, and F1 score. To illustrate these concepts, we will use a classification model that predicts hotel room cancellations.

The dataset used to build this model, hotel-bookings.csv, can be found on Kaggle. In its original format, it includes 119,390 observations and 32 variables. The outcome variable is_canceled indicates whether a particular booking was canceled prior to the time of a guest’s scheduled stay. Among all the bookings in the dataset, approximately 37.55% canceled their reservations, whereas 62.45% did not.

During the data cleaning and Exploratory Data Analysis (EDA) phase of this project, we removed several variables due to factors such as redundancy and lack of association with the target variable. We also performed some other small adjustments to the data, such as converting values labeled “Undefined” to “NA”, and removing some observations with very rare factor variable levels. In other cases, we collapsed the levels of factor variables to make them easier to handle. To prepare our continuous numeric variables for the naive Bayes algorithm, we binned them each into four groups, separated by quartile values.

After partitioning the data into a roughly 60/40 split between training and validation records, we generated the model in R, using the naiveBayes() function from the e1071 package.

Sliding the Cutoff: The Tension Between Precision and Recall

When we used this model to predict cancellations among records in the validation set, we attained the classification statistics shown below:

There are many model statistics shown here, but the two that are of particular interest to us for this model are Pos Pred Value (also known as precision) and sensitivity (also known as recall). Note that our positive outcome class is a “1” for is_canceled (indicating that the guest canceled the reservation).

A classification model’s precision rate tells how often the model was correct among all the instances in which it predicted that a record would belong to the positive outcome class. The 2x2 confusion matrix near the top of the screenshot above indicates that our model predicted a total of 12,807 cancellations (9279 + 3078). Since 9729 of those people did in fact cancel, our precision rate here is 75.97%.

A classification model’s recall rate (also known as sensitivity rate) can be thought of as the “True Positive Rate.” To calculate a model’s sensitivity, we start with the total number of records that actually belong to the positive outcome class (True Positives) and find the percentage of those records that the model assigned to the positive outcome class. The confusion matrix shows us that there were 16,874 actual cancellations (9729 + 7145). Since we identified 9279 of them, the model’s sensitivity rate is 57.66%.

Each of the two metrics described above can be impacted when a modeler adjusts the default cutoff used for classifying records as likely to belong to the positive class. The default cutoff used in nearly all classification systems is 0.50 — such a system assigns records to the positive outcome class when their probability of positive class membership is 50% or greater.

Let’s suppose that hotel management tells us to focus on increasing the precision score of this model. Our busiest season of the year is coming up, and they know that every successfully predicted cancellation offers them a chance to overbook a particular type of room, and thereby optimize revenue. Using the code shown below, we make an adjustment — now, our model will only predict a cancellation when it assesses probability of greater than 75 percent that the guest will cancel:

As a result of this higher cutoff threshold for “1” class predictions, we should expect to see fewer total predicted cancellations; however, among the cases in which cancellations are predicted, we should expect to see a higher accuracy rate.

Let’s take a look at the resulting impact to the model outcomes:

As expected, the higher cutoff value brought our precision rate considerably higher — now, it’s 89.44%. We only predicted 8295 total cancellations this time (a 35% drop from the previous total), and 7419 of those predictions proved correct. At first, the hotel management team is thrilled! Now, when we tell them that we think a prospective guest will cancel, they can overbook that particular room type without much worry, knowing that there’s a 9-in-10 chance that the person who booked it will not actually arrive.

However, this initial excitement turns a bit sour when they see how sensitivity was impacted by this new cutoff threshold. Moving the cutoff from 0.50 to 0.75 meant that far more actual cancellations slipped through the cracks — our sensitivity rate of just 43.97% meant that there were more than twice as many cancellations than we predicted.

As a result, several Department Heads complain about us — the culinary staff has had to serve extra meals that were discarded, the housekeeping crew has spent many person-hours preparing linens that were never used, and the valet parking team has overallocated its crew work allocation hours. “Slide the cutoff way down now!” management demands. After some futile protestation, in which we warn about the likely impact to precision, we meekly agree to adjust the cutoff to 0.25:

Predictably, the lower cutoff threshold brought the sensitivity rate back up — now, it’s much higher than it was after the first model iteration. However, that increased sensitivity has come at a cost — the precision rate has fallen to just 60%. While the support staff will appreciate our new sensitivity rate, the team that handles room overbookings grows increasingly frustrated. They plead with management to make us revert back to the old cutoff value, and yes — you guessed it — the cycle begins anew.

Introducing the F1 Score as a Model Metric

As you saw in the previous section, adjustments to the cutoff value create a natural tension between precision and recall. A higher cutoff value means we are more conservative about predicting that someone will cancel a reservation — that makes us more likely to be correct when we do make such a call, but it also means that more actual cancellations are missed.

The classification model metric known as the F1 Score captures the back-and-forth nature of this relationship. F1 Score is the harmonic mean between precision and recall. The formula for calculating F1 score is as follows:

2 * precision * recall / (precision + recall)

In our original version of the model, the F1 score was: 2 * .7597 * .5766 / (.7597 + .5766) = .6556. That model was considerably more balanced than our second one, whose F1 score was: 2 * .8944 * .4397 / (.8944 + .4397) = .5896.

You may see this formula and wonder, “Why not just use a simple arithmetic mean?”

The harmonic mean approach for calculating F1 penalizes a mismatched model in a way that an arithmetic mean does not. To demonstrate that difference with this model, an arithmetic mean for the first set of precision and recall values would have given us something similar to the F1 score (.6682). For the model with precision of .8944 and sensitivity of .4397, a simple mean would have been almost identical — .6671. The F1 score, however, was 11.76% worse than the original one. In the third iteration of the model, the F1 score would have gone higher than the original, as the precision and recall rates were .6004 and .7612, respectively.

So What’s a Good F-1 score?

Since every data model is unique, and especially because the purpose behind every data model is unique, there is no universal answer to this question. In fact, some modellers may not even wish to maximize F1 score at all, as precision and recall may not be equally desirable in every scenario.

For example, imagine that we are telemarketers selling expensive vacation packages on behalf of an expensive Caribbean resort. We earn a hefty commission with each package we sell, and the marginal cost to us for reaching out to a prospect by phone is miniscule. In this scenario, precision would be almost meaningless to us, as the cost of being wrong about labeling someone as a potential buyer is so low. Sensitivity, on the other hand, would mean everything now — it would be catastrophic for us to let someone who actually wants a vacation package go uncontacted because we mislabeled him in the model.

In different instances, these priorities could reverse. If we were insurance fraud investigators working with a limited budget and limited resources, we would be unable to investigate the veracity of every claim. Given these limitations, precision would be of the utmost importance — when we label a claim as likely to be fraudulent (and therefore deserving of investigation), we would like to be very confident in that prediction. Even if this led us towards a high cutoff threshold that reduced our accuracy and our sensitivity, the emphasis on precision would be a wise business decision — it would limit our total caseload while also concentrating our resources on the cases that merited the most attention.

In most classification situations, however, a preference for either precision or recall won’t be so clear-cut. At the hotel, for example, too much emphasis on either metric is going to cause problems for someone. If the hotel model emphasizes precision, it will have an easier time managing overbookings, but a harder time with service staff provisioning; if the model tilts in the other direction, the problem set just flips back in the other direction.

Cutoff threshold adjustments sometimes remind us of an old Rolling Stones lyric: “You can’t always get what you want.” The F1 Score, which quantifies the nature of the precision and recall tradeoffs that arise in classification modeling, is a useful tool for expressing a model’s ability to balance these sometimes-competing aims.

The author is a Master’s Degree student in Applied Business Analytics at Boston University. She will graduate in Fall 2020. Her co-author is a Senior Lecturer in Applied Business Analytics at Boston University.