Hard-Voting and Soft-Voting Classification Ensembles: An Introduction
by Qiuyue Wang
with Greg Page
Background: Classifying the Quality of Red Wine
This article aims to introduce the reader to two important machine learning methodologies: the Hard-Voting Classification Ensemble and the Soft-Voting Classification Ensemble. To illustrate these concepts, we modeled numeric wine quality ratings as a classification problem, using three different algorithms to create models. Then, we pooled those models together into the two aforementioned types of ensembles.
The wine quality dataset, which can be found at the University of California-Irvine Machine Learning Repository, contains data regarding the physicochemical properties of the Portuguese “Vinho Verde” red and white wines. To whet readers’ appetites, we included a picture from the Vinho Verde vineyard here:
To focus our analysis, we only used the red wine data, which includes 1598 rows and 12 columns. We used 11 of these columns as input variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The 12th column, quality, is a discrete numerical rating given to each wine.
The table and barplot below offer some important insights about the distribution of the quality variable. We can see here that the vast majority of these ratings landed near the middle of the distribution, with 5 and 6 being the predominant scores.
To prepare this data for classification modeling, we binned all the wines into two groups, labeling all wines with quality numbers at or above the median as “good”, and all those below the median as “bad.” To run the classification models in scikit-learn, we re-coded the good and bad classes as 1 and 0, respectively. After checking for NaN values, and then ensuring that none of our inputs were highly correlated with other inputs, we partitioned the data and moved into the modeling phase.
Three Separate Classification Approaches
First, we fit a logistic regression model, using the LogisticRegression module from scikit-learn. That model’s classification report, shown below, indicates an accuracy against the test set of 74 percent, and very similar measures for the other model metrics.
Next, we used scitkit-learn’s GaussianNB module to fit this data to another model. The resulting Gaussian Naive Bayes model showed slightly better performance than the logistic regression model against the test set (74.375% vs. 73.90625%). However, when seen with two decimal places of precision, as shown in the screenshot below, the classification report from our second model looks almost identical to our first one. The Gaussian Naive Bayes model eked out a small improvement in specificity (recall for the 0 class in the table below indicates specificity).
Thirdly, we used scikit-learn’s RandomForestClassifier module to generate a random forest model for our data. A random forest model is an ensemble of many individual tree models. For each split within such a model, the tree is only allowed to choose from among a limited set of randomly-selected variables (hence the term ‘random’ in ‘random forest’). The final random forest model represents the collective results of each of the many trees’ predictions for records in a dataset.
After performing a bit of hyperparameter tuning to prevent tree overgrowth, we built our model, and then ran it against our test data. The random forest outperformed either of the first two models, notching an accuracy rate of 76.72%. Its full classification report is shown below.
In our next step, we imported scikit-learn’s VotingClassifier module, to see how a combination of the methods shown above would impact model accuracy.
The Hard Voting Classifier
A Hard Voting Classifier (HVC) is an ensemble method, which means that it uses multiple individual models to make its predictions. First, each individual model makes its prediction, which is then counted as one “vote” in a running tally. The ensemble assigns a record to an outcome class based on the majority of votes for that record among the models.
To illustrate this with a toy example, suppose that we are classifying records as belonging to either the “0” or “1” class in a binary outcome scenario, using an ensemble built from three separate component models. The table below demonstrates the way this process would work for each of the first five observations in the dataset — note how a simple majority among the categorical outcomes predicted by Models A, B, and C determines the category selected in the final column.
The code cell shown below this paragraph demonstrates the process for implementing a hard voting classifier in scikit-learn. Since a random forest model is already an ensemble, the voting_clf object created here could be called a meta-ensemble, or an “ensemble of ensembles.”
From the screenshot shown above, we can see that the HVC outperformed the logistic regression and Gaussian Naive Bayes models. However, at 75.47%, the HVC’s accuracy score came up a bit short, compared with the random forest.
The Soft Voting Classifier
A Soft Voting Classifier (SVC) relies on probabilistic outcome values generated by classification algorithms. The SVC averages each component model’s predicted probability of “1” class membership for a record, and then assigns a classification outcome prediction based on that average.
With classification modeling, we often focus mainly on the predicted categorical outcome, but most scikit-learn classification modules also generate specific probabilities of membership in either outcome class — these probabilities can be accessed via the predict_proba() method. From the screenshot shown below, for instance, we can know that the Gaussian Naive Bayes model predicts an 85% probability that the first record in the test set will belong to the positive class, a 27% probability that the second record will belong to the positive class, and so on.
To demonstrate the way that a soft voting ensemble works, we will return to our toy example with three component models. Here, we will assume that the classification threshold is 0.50 — any record whose average probability of “1” class membership is .50 or greater will be assigned by the SVC to the positive outcome class.
By looking at ways in which the HVC and SVC classified Observation 1, we can see an important distinction between these two approaches — vote-weighting. An SVC enables particularly strong predictions to significantly impact the ensemble’s prediction.
The HVC classified Observation 1 as a “0” based on the simple majority vote among the three component models (0, 1, and 0, respectively). When we drill down to probabilistic predictions with the SVC, however, the high probability assigned by Model B brought the overall arithmetic mean above the threshold. The prediction for this record flipped from 0 to 1.
The screenshot below demonstrates the process for setting up an SVC. The coding syntax is nearly identical to that used for the HVC, with one exception — the voting parameter’s input was changed from ‘hard’ to ‘soft’).
With an accuracy rate of 77.19%, the SVC emerged as the winner among all the classification approaches that we used here.
Conclusions
Ensemble modeling is a popular machine learning methodology that draws upon a principle sometimes known as the “wisdom of crowds.” Much like a homeowner who might seek out a second appraisal before listing her house on the market, or a patient who takes advice from more than one doctor, a data analyst may wish to draw upon different models in order to generate predictions.
In our experience with the wine quality data, we started with three standalone models, two of which were single models (logistic regression and Gaussian Naive Bayes), and one of which was already an ensemble (random forest). When using accuracy as the comparison benchmark, we found that the Soft Voting Classifier outperformed any of the standalone models.
Naturally, when people learn about ensembles, and especially when they achieve success using such methods for modeling, they sometimes begin to wonder, “If these are so effective, why would someone ever want to just use a single model?” As impressive as ensembles can be, there are times when a single model is the more appropriate choice.
A single model is preferable for situations in which a particular methodology is uniquely capable of explaining the data. For example, suppose we are modeling a very reliable, highly linear relationship among a group of input variables and a continuous numeric outcome. In this case, an ordinary least-squares regression model will likely be the perfect tool for the job. Adding extra models could be like putting unnecessary ingredients into an already-great dinner recipe — it won’t add value, and could even be a net-negative.
We should also stop to consider the simplicity advantage that comes with using a single model.
For any modeling type that generates coefficient values (like linear or logistic regression) it becomes easy for the modeler to explain the input-outcome relationship to an audience in a straightforward, clear way. Standing up in a business meeting and saying, “Our model predicts that a 27% increase in consumer rebate promotions this month will reduce next quarter’s expected churn by 50%” is much more direct than a deep-dive into the finer quirks of a multi-part ensemble.
Yet another important consideration with ensembles is the correlation among their predictions. In general, larger differences in the processes used by the individual ensemble components leads to more robust predictions. To return to the medical patient analogy for a moment, it is not likely that he would solicit that second opinion from a doctor who shares an office with the first one, lives in the same city, and went to the same medical school.
We hope you have enjoyed this overview and description of two particular ensemble types. As you might suspect, there are far more ensemble methodologies beyond the ones mentioned here. Experience tends to be the best teacher, so we encourage you to get out there and explore the world of ensemble modeling with your next project.
The author is a Master’s Degree student in Applied Business Analytics. She will graduate in Fall 2020. Her co-author is a Senior Lecturer in Applied Business Analytics at Boston University.