Model interpretability — Making your model confesses: Feature Importance
Following the sequence of posts about model interpretability, it is time to talk about a different method to explain model predictions: Feature Importance or more precisely Permutation Feature Importance. It belongs to the family of model-agnostic methods, which as explained before, are methods that don’t rely on any particularity of the model we want to interpret.
For information about the other methods for interpretability see: Model interpretability — Making your model confess: Shapley values
Feature Importance: Motivation
One of the most basic questions we might ask about a model is what features have the biggest impact on predictions. This concept is called feature importance and it is based on the idea that more important feature have bigger impact. However, how can we tell how much impact a feature has in the prediction? To answer this, we have to look at a problem from another perspective: “if a feature is important, then when missing, the accuracy of the model would decrease”. This method is also called Mean Decrease Accuracy (MDA, Breiman (2001)).
“Permutation” Feature Importance
Up to here we have a way to know if a feature is important by looking at the error that is introduced to the model when the feature is missing. However, how can we evaluate a model without some of the features? Most models cannot natively handle missing data — they deal with floats, and can’t operate on literal nulls. This is where the word “Permutation” comes in in the “Permutation Feature Importance”. A feature is important if permuting its values increases the model error — because the model relied on the feature for the prediction. In the same way, a feature is unimportant if permuting its values keeps the model error unchanged — because the model ignored the feature for the prediction.
Spoiler alert: We will see later in this post that there are other ways to solve this problem including “Impurity Feature Importance” and “Conditional Feature Importance”. Those two method are not model agnostics since only works with tree-based methods. Keep reading.
Permutation Feature Importance basic algorithm:
For each column in the dataset:
- Shuffle the values in the column.
- Make predictions using the resulting dataset.
- Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.
- Undo the shuffle and return the data to the original order.
Interpretation is pretty straightforward. Features with a high calculated value are the most important. It shows how much model performance decreased with a random shuffling (in the case using “accuracy” as the performance metric). Interestingly enough, you can occasionally see negative values for importance. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn’t matter and should have had an importance close to 0, but random chance caused the predictions on shuffled data to be more accurate.
Let’s see an example
We will continue working with the same dataset I introduced in my first post. This dataset comes from a linguistics use case where the objective is to understand the aspects that make a verb more likely to be used in the past conjugation or in the present conjugation when referring to an event in the past. We are training a model to predict in which conjugation the verb is more likely to be.
from eli5.sklearn import PermutationImportance# We are computing the importance using the test set. In this case
# X_test contains all the features for the test set and y_test
# contains the target column 'tense'
features = X_test.columns.tolist()# rf is the model that was fitted. In this case, it is
perm = PermutationImportance(rf, random_state=123)
.fit(X_test, y_test)eli5.show_weights(perm, feature_names = features)
The generated output is as follows:
The previous table shows the feature importance of each of the columns. The weight column represents the importance of the feature measured as the MDA. The +/- sign represents the standard deviation of the importance calculated before. This value tries to measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. weights then represents the mean of the error accumulated in the multiple shuffles instead of in a single trial.
We can also get the same data in a data frame format as follows:
importances = eli5.explain_weights_df(perm, feature_names=features)
Let’s see the data in a graph:
import seaborn as sns
import matplotlib.pyplot as pltplt.errorbar(x=importances['feature'],
ecolor='r', capsize=8, fmt='none',)
dodge=True, join=False, ci='none')
This graph is telling us the importance of each of the features at classifying something as present or past. According to it, main_v, which encodes the main verb of the sentence is the most importance feature. If you remember, when we used Shapley values, the most important feature was different. So it comes to the characteristics and assumptions of each method.
Let’s have a look at them:
- Importance is very easy to undestand since it related directly to the error rate introduced by no paying attention to the feature
- Automatically takes into account all interactions with other features, since permutations will also destroy the interaction effect. However, this will also mean that the degradation in performance due to losing such interaction will be accounted multiple times (once per each feature. If A is correlated with B, then the importance will be measured both in A and B — not necessarily a problem).
- The calculated feature importance is tied to the error of the model itself. This is not exactly bad if your model doesn’t show signs of overfitting since the model variance and the feature importance will correlate well. However, the calculated weights may start to diverge when your model evolves or is retrained.
- You need access to the actual outcome target. If someone only gives you the model and unlabeled data — not the ground truth— you can’t compute the permutation feature importance.
- The permutation feature importance measure depends on shuffling the feature, which adds randomness and, when features are correlated, the permutation feature importance measure can be biased by unrealistic data instances.
- Average values in the importance measure may not be easy to interpret: If a feature has medium permutation importance, that could mean it has a large effect for a few predictions, but no effect in general, or a medium effect for all predictions.
Impurity Feature Importance
The permutation importance follows the rationale that a random permutation is supposed to mimic the absence of the feature from the
model. Such method relies on the Mean Decrease Accuracy (MDA). The alternative importance measure is the one used in random forests, the impurity importance, which is based on the principle of impurity reduction or Mean Decrease Impurity (MDI) that is followed in most traditional classification tree algorithms. This method has the advantage of not requiring uncorrelated predictors.
However, there a couple of problems with such approach:
- Impurity-based importance is biased towards high cardinality features (Strobl C et al (2007), Bias in Random Forest Variable Importance Measures)
- It is only applicable to tree-based algorithms.
Impurity Feature Importance is implemented in sklearn package. There is an example in they site.
Conditional Permutation Feature Importance
Conditional Permutation Feature Importance tries to marry the best of the two previous methods: provide a reliable measure of variable importance while not assuming uncorrelated predictors. To meet this aim, Strobl et al. (2008) suggest a conditional permutation scheme, where each feature is permuted only within groups of observations where the rest of the features (conditioned to) have similar values in order to preserve the correlation structure between the feature and the other predictor variables.
Although conditioning is straightforward whenever the variables to be conditioned on are categorical, conditioning on continuous variables is challenging since you have to create ranges of values of reasonable size. Continuous variables need to be discretized. Random forest provides a way to achieve this as they mimic this behavior in the way the trees are constructed. However (there is always a however), this method is very compute intensive because of the number of trees you have to grow.