Permutation importance is a method for calculating feature importance in machine learning models. It works by shuffling (permuting) the values of each feature and measuring the resulting decrease in model performance. The idea is that a feature with a high permutation importance score is one that, when shuffled, causes the model’s performance to decrease significantly. This indicates that the feature is important for the model’s predictions.
Here is an example of how to calculate permutation importance in Python using the scikit-learn library:
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate some data for classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_classes=2, random_state=42)
# Create a random forest classifier
clf = RandomForestClassifier(random_state=42)
# Train the classifier on the data
clf.fit(X, y)
# Calculate the permutation importance of each feature
result = permutation_importance(clf, X, y, n_repeats=10, random_state=42)
# Print the feature importance scores
for i in range(len(result.importances_mean)):
print(f"Feature {i}: {result.importances_mean[i]}")
In this example, we first generate some synthetic data for classification using the make_classification
function. We then create a random forest classifier and train it on the data. Finally, we calculate the permutation importance of each feature using the permutation_importance
function and print out the resulting feature importance scores.
The permutation_importance
function takes several arguments:
clf
: The classifier or model for which to calculate the feature importance.X
: The input data used to train the model.y
: The target labels used to train the model.n_repeats
: The number of times to shuffle each feature and measure the resulting decrease in performance.random_state
: The random seed used for shuffling the data.
The function returns a namedtuple
with three fields:
importances
: An array of shape(n_features, n_repeats)
containing the permutation importance scores for each feature and repeat.importances_mean
: An array of shape(n_features,)
containing the mean permutation importance score for each feature across all repeats.importances_std
: An array of shape(n_features,)
containing the standard deviation of the permutation importance score for each feature across all repeats.