Extensive EDA and Model Explainability using SHAP🔎
Getting insights from a black box machine learning model
Hi Readers,
In this publication, I have shared my approach to tackling Kaggle’s Tabular May 2021 competition. The main focus of this analysis is not to score very well but to gain insights from the data about the behavior of each feature and their impact on the class prediction. I have used SHAP to get deeper insights into this data.
The Kaggle competition link can be found here.
It’s a multiclass classification problem with Logloss as the primary metric.
Please clap and follow if you like it!
Table of Contents
- Importing Libraries
- Reading the data files
- Overview
- Exploratory Data Analysis (EDA)
- Scaling
- Correlation Check
- Outlier Treatment
- Modeling
- Model Explainability using SHAP
Importing Libraries
#Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import shap
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import log_loss
from statistics import mean
from tqdm import tqdmimport warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 100)
sns.set_palette("coolwarm_r", 4)
Reading the data files
#Reading the data files
train = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')
Overview
print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')Shape of train data: (100000, 52)
Missing values count: 0train.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 52 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100000 non-null int64
1 feature_0 100000 non-null int64
2 feature_1 100000 non-null int64
3 feature_2 100000 non-null int64
4 feature_3 100000 non-null int64
5 feature_4 100000 non-null int64
6 feature_5 100000 non-null int64
7 feature_6 100000 non-null int64
8 feature_7 100000 non-null int64
9 feature_8 100000 non-null int64
10 feature_9 100000 non-null int64
11 feature_10 100000 non-null int64
12 feature_11 100000 non-null int64
13 feature_12 100000 non-null int64
14 feature_13 100000 non-null int64
15 feature_14 100000 non-null int64
16 feature_15 100000 non-null int64
17 feature_16 100000 non-null int64
18 feature_17 100000 non-null int64
19 feature_18 100000 non-null int64
20 feature_19 100000 non-null int64
21 feature_20 100000 non-null int64
22 feature_21 100000 non-null int64
23 feature_22 100000 non-null int64
24 feature_23 100000 non-null int64
25 feature_24 100000 non-null int64
26 feature_25 100000 non-null int64
27 feature_26 100000 non-null int64
28 feature_27 100000 non-null int64
29 feature_28 100000 non-null int64
30 feature_29 100000 non-null int64
31 feature_30 100000 non-null int64
32 feature_31 100000 non-null int64
33 feature_32 100000 non-null int64
34 feature_33 100000 non-null int64
35 feature_34 100000 non-null int64
36 feature_35 100000 non-null int64
37 feature_36 100000 non-null int64
38 feature_37 100000 non-null int64
39 feature_38 100000 non-null int64
40 feature_39 100000 non-null int64
41 feature_40 100000 non-null int64
42 feature_41 100000 non-null int64
43 feature_42 100000 non-null int64
44 feature_43 100000 non-null int64
45 feature_44 100000 non-null int64
46 feature_45 100000 non-null int64
47 feature_46 100000 non-null int64
48 feature_47 100000 non-null int64
49 feature_48 100000 non-null int64
50 feature_49 100000 non-null int64
51 target 100000 non-null object
dtypes: int64(51), object(1)
memory usage: 39.7+ MB
- Training data has 100000 records and 50 features.
- Column ‘id’ is the primary key.
- It’s a multiclass classification problem and ‘target’ is our target variable.
- All the features are numerical in this data.
- There is no missing value in the data.
- The numerical features are discrete in nature since the cardinality is not very high.
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')Shape of test data: (50000, 51)
Missing values count: 0
- Test data has 50000 records and 50 features.
Exploratory Data Analysis (EDA)
# Checking the distribution of each featurenum_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']
cat_columns = train.select_dtypes(include=['object']).columnstrain.describe().T.style.bar(subset=['mean'], color='royalblue')\
.background_gradient(subset=['std'], cmap='coolwarm_r')\
.background_gradient(subset=['50%'], cmap='coolwarm_r')\
.background_gradient(subset=['min'], cmap='coolwarm_r')\
.background_gradient(subset=['max'], cmap='coolwarm_r')
Observations
- Most of the features have 0 value in more than 50 percentiles.
- Only feature14 and feature38 have values other than 0 in more than 50 percentile records.
- Only a handful of features have negative values. It will be interesting to see their importance in prediction.
Analyzing Target Feature
#Checking the distribution of target variable
target3 = train['target'].value_counts()['Class_4']
target2 = train['target'].value_counts()['Class_3']
target1 = train['target'].value_counts()['Class_2']
target0 = train['target'].value_counts()['Class_1']
target3per = target3 / train.shape[0] * 100
target2per = target2 / train.shape[0] * 100
target1per = target1 / train.shape[0] * 100
target0per = target0 / train.shape[0] * 100
print('{} of {} records have target 1 it is the {:.2f}% of the training set.'.format(target0, train.shape[0], target0per))
print('{} of {} records have target 2 and it is the {:.2f}% of the training set.'.format(target1, train.shape[0], target1per))
print('{} of {} records have target 3 and it is the {:.2f}% of the training set.'.format(target2, train.shape[0], target2per))
print('{} of {} records have target 4 and it is the {:.2f}% of the training set.\n'.format(target3, train.shape[0], target3per))
plt.figure(figsize=(8,6))
sns.countplot(train['target'], palette = 'coolwarm_r', order = sorted(train['target'].unique()))
plt.xlabel('Target', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1, 2, 3), ['1 ({0:.2f}%)'.format(target0per), '2 ({0:.2f}%)'.format(target1per), '3 ({0:.2f}%)'.format(target2per), '4 ({0:.2f}%)'.format(target3per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.title('Training Set Target Distribution', size=15, y=1.05)
plt.show()8490 of 100000 records have target 1 it is the 8.49% of the training set.
57497 of 100000 records have target 2 and it is the 57.50% of the training set.
21420 of 100000 records have target 3 and it is the 21.42% of the training set.
12593 of 100000 records have target 4 and it is the 12.59% of the training set.
Observations
- The distribution of the classes is imbalanced.
- More than 50% of the records belong to class2.
- The smallest class is class1 having only 8.5% records.
# Label Encoding the classes
train.loc[train['target'] == 'Class_1', 'target'] = '1'
train.loc[train['target'] == 'Class_2', 'target'] = '2'
train.loc[train['target'] == 'Class_3', 'target'] = '3'
train.loc[train['target'] == 'Class_4', 'target'] = '4'
train['target'] = train['target'].astype(int)
Continuous Features
len(num_columns)50
All of the 50 features are numerical in this data.
# Checking the distribution of continuous features
i = 1
fig, ax = plt.subplots(10,5, figsize=(40,30))
for feature in tqdm(num_columns):
plt.subplot(10,5, i)
sns.kdeplot(data = train, y = feature, vertical=True, palette = 'coolwarm_r')
plt.xlabel(f'{feature}- Skew: {round(train[feature].skew(), 2)}', size=20)
i += 1
fig.tight_layout()
plt.show()100%|██████████| 50/50 [00:20<00:00, 2.41it/s]
Observations
- We can see a big peak in all the features at 0 value.
- The features are sparse just like one hot encoding.
- There is skewness present in all the features but let’s not treat it since the values are discrete and not continuous in this data.
# Checking the distribution of continuous features
i = 1
fig, ax = plt.subplots(10,5, figsize=(50,30))
for feature in tqdm(num_columns):
plt.subplot(10,5, i)
sns.countplot(data = train, x = feature, order = train[feature].value_counts()[:4].index, hue = 'target', palette = 'coolwarm_r')
plt.xlabel(feature, size=25)
plt.legend(loc='upper right', prop={'size': 15})
i += 1
fig.tight_layout()
plt.show()100%|██████████| 50/50 [00:02<00:00, 21.45it/s]
Observations
- We cannot get any good insight here. Since the values are distributed in almost the same proportion as the target variable.
- Clearly 0 alone won’t help the model in classification.
- It will be interesting to see if the model is able to pick any other value apart from 0 which can help in classification.
Analyzing Zeros
Since 0 covers most of the cell values in this data, let’s check if there is an interesting pattern with zeros in this data.
zero_data = ((train.drop('target', axis = 1)==0).sum() / len(train) * 100)[::-1]
fig, ax = plt.subplots(1,1,figsize=(10, 19))
ax.barh(zero_data.index, 100, color='lightgrey', height=0.6)
barh = ax.barh(zero_data.index, zero_data, height=0.6, color='royalblue')
ax.bar_label(barh, fmt='%.01f %%')
ax.spines[['left', 'bottom', 'right']].set_visible(False)
ax.set_xticks([])
ax.set_title('# of Zeros (by feature)', loc='center', fontweight='bold', fontsize=15)
plt.show()
Observations
- Some features have more than 90% zero values, these features are very sparse and won’t help the model much.
- features 15, 15, 27, and 38 look most promising since they have a variety of negative, positive, and zero values.
Correlation Check
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']
cat_columns = train.select_dtypes(include=['object']).columns#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(60,60))
ax.set_title("Correlation Matrix", fontsize=30)
corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm_r',
cbar_kws={"shrink": .8}, vmin=0, vmax=1)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(20)
tick.label.set_rotation(90)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(20)
tick.label.set_rotation(0)
plt.show()
Observations
- None of the features show any linear correlation among themselves and with the target variable.
Scaling
There are some high max values in this data. Let’s get them to a standard scale.
#Scaling the data using standard scaler
train[num_columns] = StandardScaler().fit_transform(train[num_columns])
test[num_columns] = StandardScaler().fit_transform(test[num_columns])
Outlier Treatment
# Outlier Analysis
iqr_factor = [3]
list1, list2 = [], []
for factor in iqr_factor:
count = 0
print(f'Outliers for {factor} IQR :')
print('-------------------------------------')
for col in num_columns:
IQR = train[col].quantile(0.75) - train[col].quantile(0.25)
lower_lim = train[col].quantile(0.25) - factor*IQR
upper_lim = train[col].quantile(0.75) + factor*IQR
cond = train[(train[col] < lower_lim) | (train[col] > upper_lim)].shape[0]
if cond > 0 and factor == 1.5:
list1.append(train[(train[col] < lower_lim) | (train[col] > upper_lim)].index.tolist())
elif cond > 0 and factor == 3:
list2.append(train[(train[col] < lower_lim) | (train[col] > upper_lim)].index.tolist())
if cond > 0: print(f'{col:<30} : ', cond); count += cond
print(f'\nTOTAL OUTLIERS FOR {factor} IQR : {count}')
print('')Outliers for 3 IQR :
-------------------------------------
feature_0 : 11527
feature_1 : 10986
feature_2 : 6508
feature_3 : 21893
feature_4 : 11619
feature_5 : 9753
feature_6 : 19125
feature_7 : 8701
feature_8 : 19954
feature_9 : 9138
feature_10 : 13610
feature_11 : 21618
feature_12 : 22586
feature_13 : 5088
feature_14 : 4662
feature_15 : 2486
feature_16 : 14319
feature_17 : 24222
feature_18 : 24991
feature_19 : 9751
feature_20 : 11818
feature_21 : 20728
feature_22 : 12850
feature_23 : 16979
feature_24 : 6148
feature_25 : 18774
feature_26 : 11839
feature_27 : 8759
feature_28 : 961
feature_29 : 8649
feature_30 : 13579
feature_31 : 9770
feature_32 : 9583
feature_33 : 15587
feature_34 : 6773
feature_35 : 24115
feature_36 : 8864
feature_37 : 19175
feature_38 : 2820
feature_39 : 12762
feature_40 : 4015
feature_41 : 19741
feature_42 : 16534
feature_43 : 15701
feature_44 : 6749
feature_45 : 12711
feature_46 : 13892
feature_47 : 14732
feature_48 : 5305
feature_49 : 19486
TOTAL OUTLIERS FOR 3 IQR : 641936
Observations
- The above table shows the number of outliers in each feature.
- But these are not the actual outliers since the data is very sparse, most of the values other than 0 are being detected as outliers here.
- Let’s keep these outliers as they are since these are the ones that will help the model in classification.
Modeling
Let’s try different ML models and see which performs best.
train = train.reset_index(drop = True)# Storing the target variable separately
X_train = train.drop('target', axis = 1)
X_test = test
y_train = train['target']
print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))X_train shape: (100000, 50)
y_train shape: (100000,)
X_test shape: (50000, 50)#Stratified K fold Cross Validation
def train_and_validate(model, N):
scores = []
regex = '^[^\(]+'
match = re.findall(regex, str(model))
print(f'Running {N} Fold CV with {match[0]} Model.')
preds = np.zeros((test.shape[0],4))
importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=train.drop('target', axis = 1).columns)
skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)
for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
print('Fold {}\n'.format(fold))
# Fitting the model
model.fit(X_train.iloc[trn_idx], y_train[trn_idx])
# Computing Train logloss score
trn_logloss_score = log_loss(y_train[trn_idx], model.predict_proba(X_train.iloc[trn_idx]))
# Computing Validation logloss score
val_logloss_score = log_loss(y_train[val_idx], model.predict_proba(X_train.iloc[val_idx]))
scores.append((trn_logloss_score, val_logloss_score))
preds += model.predict_proba(X_test)/skf.n_splits
importances.iloc[:, fold - 1] = model.feature_importances_
print(scores[-1])
trlogloss = mean([i[0] for i in scores])
cvlogloss = mean([i[1] for i in scores])
print(f'Average Training logloss: {trlogloss}, Average CV logloss: {cvlogloss}')
print ("*"*40)
print ("\n")
return trlogloss, cvlogloss, importances, preds, model#Testing multiple ML models using stratified K fold CV
df_row = []
N = 3
for i in [DecisionTreeClassifier(),
LGBMClassifier(),
RandomForestClassifier(n_estimators = 10, max_depth = 10)]:
trlogloss, cvlogloss, importances, preds, model = train_and_validate(i, N)
regex = '^[^\(]+'
match = re.findall(regex, str(i))
df_row.append([match[0], trlogloss, cvlogloss])
df = pd.DataFrame(df_row, columns = ['Model', f'{N} Fold Training logloss', f'{N} Fold CV logloss'])
df
Observations
- LGBM Model has scored the least Logloss.
- But the best performing model here is RandomForest because the difference between training log loss and CV log loss is least in this model.
- Random Forest is generalizing the data very well here and is not overfitting much.
#Plotting the RandomForest importances
importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)
plt.figure(figsize=(8,8))
sns.barplot(x='Mean_Importance', y=importances.head(15).index, data=importances.head(15), palette = 'coolwarm_r')
plt.xlabel('')
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.title('Top 15 features', size=10)
plt.show()
Observations
- As expected from EDA, features 38 and 14 are coming among the most important feature since they had a different behavior than all other features in our data.
- It’s interesting to see features 2, 15, and 6 appearing at the top of the importance list. Let’s explore more about these features using SHAP.
Let’s try making a submission with the RandomForest model and see the performance on the leaderboard.
#Creating the submission with Random Forest Model
model = RandomForestClassifier(n_estimators = 10, max_depth = 10)
trlogloss, cvlogloss, importances, preds, _ = train_and_validate(model, 5)
sample.iloc[:, 1:] = preds
sample.to_csv('submission.csv', index = False)
We got a log loss of 1.104 on the leaderboard on submitting the above CSV. Let’s check if the LGBM model gets a better score.
#Creating the submission with LGBM Model
model = LGBMClassifier()
trlogloss, cvlogloss, importances, preds, _ = train_and_validate(model, 5)
sample.iloc[:, 1:] = preds
sample.to_csv('submission.csv', index = False)
Great! LGBM scored a log loss of 1.088 which is an improvement over the Random Forest Model.
Let’s try to understand more about the behavior of the features using SHAP.
Model Explainability using SHAP
#Fitting the SHAP on our model and training data
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)#Plotting the SHAP summary. (Note: Class 0 in the analysis below correspond to class 1 in the data and so on.)
shap.summary_plot(shap_values, X_train, color=plt.get_cmap("tab20c"))
Observations
- Features 31, 24, and 16 have a major impact in predicting class 3.
- feature 15, 11, 2, 1, and 33 help most in detecting class 2.
- Features 37, 25, and 28 have the greatest impact in deciding class 0.
- Features 6, 37, 28, and 34 have a major impact in predicting class 1.
shap.summary_plot(shap_values[0], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)
Observations
- Red bulbs in the center are indicating that 0 values are not having any impact in predicting class 1.
- feature 6, 15, and 41 are positively correlated with class 1.
- Features 25, and 19 show a slight negative correlation with class 1.
- feature 17, 1 has the least impact on class 1.
shap.summary_plot(shap_values[1], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)
Observations
- Red bulbs in the center are indicating that 0 values are not having any impact in predicting class 2.
- Features 19, 35, 29, 14, and 28 show a positive correlation with class 2.
- Features 15, 6, 10, 42, and 30 show a negative correlation with class 2.
- Feature 2 has the least impact on class 2.
shap.summary_plot(shap_values[2], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)
Observations
- Red bulbs are not completely in the center, indicating that zeros have a bit of impact on class 3.
- Features 43, 14, and 42 show a positive correlation with class 3.
- Features 15, 38, 11, and 0 show a negative correlation with class 3.
- Feature 32 has the least impact on class 3.
shap.summary_plot(shap_values[3], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)
Observations
- Zero values in features 31, 14, and 24 have a bit of impact on class 4.
- No feature shows a strong positive correlation with class 4.
- feature 31, 14, 24, 16, 23, 7 show negative correlation with class 4.
- Features 22, 17, 32, and 9 have the least impact on class 4.
Most useless features in this data
As per the above analysis, we can safely conclude that feature_32, feature_17, and feature_1 are the most indecisive features to predict any class in this data.
The End!
Thank you for reading this publication. I hope you have learned something new today. Kindly share feedback if you find any flaws or have a better approach.
At last, please clap this publication if you liked it! Thanks in advance.
Links:
Kaggle Kernel link.
Kaggle Profile link.
LinkedIn Profile Link.