Beyond Supervised ML: which features are really important? Trial reveals surprising patterns
Updated: Feb 17, 2019
Trial and error reveal surprising patterns of success—and failure—using different BL and Supervised ML models.
When we consider ML (Machine Learning) the importance of variables is—or at least seems to be—of the utmost importance. What follows is a comparative analysis that seeks to address this issue.
In examining classification learning (such as Random Forest, Logistic-Regression, or Naive-Bayes) we hope to find prediction accuracy. We also look for the features these models examine, to find out which are features are most important, and their priority ranking. When we find features importance ranking, we hope to find that they match our intuition. For example, when we examine models predicting income, we anticipate that education and location will be important variables.
In this analysis, I demonstrate that the often neglected assumption of “features independence” invalidates the importance of variables – generated with Naive-Bayes models and in analogous Supervised Machine Learning models like Logistic regression. This is true, overall, for problems and data not holding the features independence assumption, which perhaps includes most of the real life problems.
I compared Bayesian Learning (BL) and a series of Supervised Machine Learning (ML) models: Logistic Regression, original Random Forest, and Random Forest featuring 3 types of unbalanced data optimization (Up-sampling, Down-sampling, and SMOTE).
Both BL and ML demonstrate the importance of variables in addition to classification. However, BL is also able to infer causes. This phenomenon is called Bayesian Inference.
This, to me, suggests that we should be determining features importance based on Bayesian Learning rather than Machine Learning. I agree that Bayesian Learning is more cumbersome, very often it’s unlikely to generate a single best model, and its graphical models often should be further improved based on human expert knowledge. Dealing with BL graph non-uniqueness will be a topic for another blog though.
Data Used in My Analysis
In order to conduct my analysis, I used publicly available email marketing campaign data from IBM. This represents a typical Business to Business (B2B) model. It contained information about clients who where contacted about purchasing auto insurance policies. In this case, the only parameter that can be manually changed by the email campaign manager is WHEN the client is contacted—i.e. Should they be contacted right before the existing policy expires, or should they be contacted further in advance?
Supervised ML uses the Naive-Bayes “features independence” assumption, which in turn provides results that suggests that the date of contact is important. However, when the same dataset is examined with BL, it seemed to show the opposite. I will demonstrate with two experiments why I believe that BL, which showed that the date of contact was unimportant, provided the more valuable result.
To demonstrate why I came to this conclusion, I conducted two experiments. First, I removed from the original dataset the features that Supervised ML designated as “important” – but BL denoted as “non-important”. I identified some 6 such features, whereas I looked at the BL graphical model and selected either independent from “Response” features, or dependent features located far from “Response”. The result I received demonstrated that the classification prediction for such new decimated data didn’t change. That alone suggests that what ML designated as important was not really consequential.
I conducted a second experiment in which I removed the features designated as “important” by BL and “unimportant” by ML. “Important” by BL would be the features directly linked or close to “Response”. Removing these features resulted in a dramatic failure in classification accuracy, which indicates that those features designated as important by BL really WERE important.
While Supervised ML provides a ranking for features importance, that importance can be close to meaningless when the dataset doesn’t agree with the features independence assumption. BL graphical model based ranking is more accurate.
Bayesian network learning strives to find the best graph that minimizes quantitative loss metrics, such as maximum likelihood. Finding a graph is an NP-Complete problem, and various optimization strategies can be deployed which, in general, produce a solution that is non-unique.
Take a look at the graph on the left. Note that many of the recovered dependencies actually make a great deal of sense, intuitively. For example, a person’s location (city, suburb, rural) should be linked to education, income, vehicle size, and total claim amount. We can also note that some of the features don’t seem to make sense intuitively. For example, the relationship between gender and location doesn’t seem to make much sense at first glance. On the other hand, the link between gender and income (via location) does. It is a common practice to update BL graphical models “manually;” in this case I would suggest linking gender directly to other Location’s children.
Bayesian Networks: Classification Accuracy
Classification is but one aspect of Bayesian Learning used to determine inference for causality, and variables/features independence. While the primary purpose for creating a Bayesian graph is Bayesian Inference, we can also use a Bayesian Graph to judge classification accuracy.
Below, you’ll see three classification accuracy graphs (from left to right: Bayesian, Naive-Bayes, logistic regression). Naturally, each was developed using the exactly same dataset, where I needed to quantize the original dataset, which introduced a small quantization error.
The results are consistent and similar, which provides us with another glimpse into the variables independence assumption with regards to Naive-Bayes and logistic regression. That is, these assumptions don’t particularly influence the prediction accuracy of classification! That is the main reason behind the fact that such assumptions are often dismissed and forgotten. However, these assumptions are crucial when evaluating the importance of variables!
Features: Importance as designated by ML vs BL
Here is the list of the 20 most important features s.t. ML Random Forest ("over") learning:
i6.SCCall Center 12.46
Newly Decimated Dataset 1:
I removed 14 variables ranking as most important by ML from the top of its variables importance list.
Regarding BL, notice that some of them were independent (outside of the Markov blanket) of the target, such as "1.EDnDL". Some were far away from/distant from the target, even if they were within the Markov blanket, such as "1.MoPolInc", "i3.MoLaClm", "i3.NoP".
Here is a list of the dismissed features based on ML:
c("i1.TCA" , "i1.CLfV", "i1.Inc" , "i1.EDnDL" , "i1.MoPolInc" , "i4.ROT" , "i3.MpPreAu" , "i3.MoLaClm" , "i3.ES" , "i3.NoP" , "Location.Code" , "i6.MS" , "i6.Ed" , "i6.SC" )
Newly Decimated Dataset 2:
In my second experiment, I removed the features closest to the target in the BL graph,
including all with the 1st and 2nd order links.
Notice that some of those features that were deemed unimportant by supervised Random Forest learning, such as "i6.VC", "i6.Cov", "i6.Ed", "i6.Gr", "VehcSz".
Here the list of the dismissed features based on BL:
c("i1.TCA" , "i1.Inc" , "i3.ES" , "i6.MS" , "i4.ROT" , "i6.SC" , "Location.Code" ,
"i6.VC" , "i3.MpPreAu" , "i6.Cov" , "i1.CLfV" , "i6.Ed" , "i6.Gr" , "VehcSz" )
Final Experiment: Accuracy and Confidence Comparison
My plan was to decimate the original dataset and see how well the new decimated dataset performed with regards to classification. If the classification accuracy (using a variety of metrics) didn’t change, that means that the features removed were unimportant - because the classification learning either just ignored the missing data or easily “compensated” for the information loss from other features.
But when the classification accuracy (using a variety of metrics) failed, that means that the features removed were important, and the classification learning could not recover from the information loss.
You can see comparisons of all three datasets—original/all features; experiment 1; experiment 2—at the end of this post.
All too often, features independence is something we overlook when using the most popular Supervised ML methods - Logistics Regression, Naive-Bayes, Random Forest, and their variants. We overlook features independence assumption, because it doesn’t necessarily make a difference when it comes to prediction accuracy-confidence. We’re satisfied with the accuracy gains we get from boosting and data up-sampling techniques used with Naive-Bayes and Random Forest classification.
However, we shouldn’t ignore the features independence assumption when we’re examining the importance of features! I hope that this blog demonstrates why.
How to use this results in your work as Data Scientist
ML - by adopting the assumption of features independence – essentially draws recommendations on the features importance based on a naive pairwise correlations with the target. BL – once we have an established graphical model – essentially reveals the true variables importance. Such importance is based on either ‘d-separation’ for independent features, otherwise a distance from the target in Markov blanket. Perhaps the main problem with BL for such evaluation is that there is hardly a unique graphical model but different graphical models can be equally well fitting the data. In this case I would suggest to consider looking at a set of BL models. From my experience for the problems that are well defined by existing variables and without critically important latent variables - such graphical models will look surprisingly equivalent.
Comparison of Prediction
Top raw shows the original dataset and classification learning.
The second raw is the 1st dataset with features removed based on BL
The second raw is the 2nd dataset with features removed based ML
Prediction comparison with Random Forest
Left: Original dataset Center: Decimated-1 Right: Decimated-2
Here, you can see three plots of Random Forest with up-sampling applied to each dataset. Area Under Curve (AUC) shows the same quality of classification prediction for both the original and the decimated-1 dataset. The decimated-2 dataset, on the other hand, shows much poorer results, even with overfitting for the same number of Trees in Random Forest learning.
Confidence intervals with Calibration
Left: Original dataset Center: Decimated-1 Right: Decimated 2
There’s very little noticeable difference between the Left and Center graphics, aside from a very insignificant loss of quality in the decimated dataset in the third bin—but the second bin actually looks closest to the main diagonal. On the Right, you can see that the confidence intervals are far worse, and much more removed from the main diagonal.
Prediction comparison with Logistic Regression
Left: Original dataset Center: Decimated-1 Right: Decimated-2
The decimated-2 dataset, shows again much poorer results, with almost no gained accuracy