Extensive EDA and Model Explainability using SHAP🔎

Getting insights from a black box machine learning model

12 min readMay 9, 2022

Hi Readers,

In this publication, I have shared my approach to tackling Kaggle’s Tabular May 2021 competition. The main focus of this analysis is not to score very well but to gain insights from the data about the behavior of each feature and their impact on the class prediction. I have used SHAP to get deeper insights into this data.

The Kaggle competition link can be found here.

It’s a multiclass classification problem with Logloss as the primary metric.

Please clap and follow if you like it!

Importing Libraries
Reading the data files
Overview
Exploratory Data Analysis (EDA)
Scaling
Correlation Check
Outlier Treatment
Modeling
Model Explainability using SHAP

Importing Libraries

#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import shap
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import log_loss
from statistics import mean
from tqdm import tqdmimport warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', 100)
sns.set_palette("coolwarm_r", 4)

Reading the data files

#Reading the data files

train = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

Overview

print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')Shape of train data: (100000, 52)
Missing values count: 0train.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 52 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          100000 non-null  int64 
 1   feature_0   100000 non-null  int64 
 2   feature_1   100000 non-null  int64 
 3   feature_2   100000 non-null  int64 
 4   feature_3   100000 non-null  int64 
 5   feature_4   100000 non-null  int64 
 6   feature_5   100000 non-null  int64 
 7   feature_6   100000 non-null  int64 
 8   feature_7   100000 non-null  int64 
 9   feature_8   100000 non-null  int64 
 10  feature_9   100000 non-null  int64 
 11  feature_10  100000 non-null  int64 
 12  feature_11  100000 non-null  int64 
 13  feature_12  100000 non-null  int64 
 14  feature_13  100000 non-null  int64 
 15  feature_14  100000 non-null  int64 
 16  feature_15  100000 non-null  int64 
 17  feature_16  100000 non-null  int64 
 18  feature_17  100000 non-null  int64 
 19  feature_18  100000 non-null  int64 
 20  feature_19  100000 non-null  int64 
 21  feature_20  100000 non-null  int64 
 22  feature_21  100000 non-null  int64 
 23  feature_22  100000 non-null  int64 
 24  feature_23  100000 non-null  int64 
 25  feature_24  100000 non-null  int64 
 26  feature_25  100000 non-null  int64 
 27  feature_26  100000 non-null  int64 
 28  feature_27  100000 non-null  int64 
 29  feature_28  100000 non-null  int64 
 30  feature_29  100000 non-null  int64 
 31  feature_30  100000 non-null  int64 
 32  feature_31  100000 non-null  int64 
 33  feature_32  100000 non-null  int64 
 34  feature_33  100000 non-null  int64 
 35  feature_34  100000 non-null  int64 
 36  feature_35  100000 non-null  int64 
 37  feature_36  100000 non-null  int64 
 38  feature_37  100000 non-null  int64 
 39  feature_38  100000 non-null  int64 
 40  feature_39  100000 non-null  int64 
 41  feature_40  100000 non-null  int64 
 42  feature_41  100000 non-null  int64 
 43  feature_42  100000 non-null  int64 
 44  feature_43  100000 non-null  int64 
 45  feature_44  100000 non-null  int64 
 46  feature_45  100000 non-null  int64 
 47  feature_46  100000 non-null  int64 
 48  feature_47  100000 non-null  int64 
 49  feature_48  100000 non-null  int64 
 50  feature_49  100000 non-null  int64 
 51  target      100000 non-null  object
dtypes: int64(51), object(1)
memory usage: 39.7+ MB

Training data has 100000 records and 50 features.
Column ‘id’ is the primary key.
It’s a multiclass classification problem and ‘target’ is our target variable.
All the features are numerical in this data.
There is no missing value in the data.
The numerical features are discrete in nature since the cardinality is not very high.

print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')Shape of test data: (50000, 51)
Missing values count: 0

Test data has 50000 records and 50 features.

Exploratory Data Analysis (EDA)

# Checking the distribution of each featurenum_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columnstrain.describe().T.style.bar(subset=['mean'], color='royalblue')\
                            .background_gradient(subset=['std'], cmap='coolwarm_r')\
                            .background_gradient(subset=['50%'], cmap='coolwarm_r')\
                            .background_gradient(subset=['min'], cmap='coolwarm_r')\
                            .background_gradient(subset=['max'], cmap='coolwarm_r')

Observations

Most of the features have 0 value in more than 50 percentiles.
Only feature14 and feature38 have values other than 0 in more than 50 percentile records.
Only a handful of features have negative values. It will be interesting to see their importance in prediction.

Analyzing Target Feature

#Checking the distribution of target variable

target3 = train['target'].value_counts()['Class_4']
target2 = train['target'].value_counts()['Class_3']
target1 = train['target'].value_counts()['Class_2']
target0 = train['target'].value_counts()['Class_1']
target3per = target3 / train.shape[0] * 100
target2per = target2 / train.shape[0] * 100
target1per = target1 / train.shape[0] * 100
target0per = target0 / train.shape[0] * 100

print('{} of {} records have target 1 it is the {:.2f}% of the training set.'.format(target0, train.shape[0], target0per))
print('{} of {} records have target 2 and it is the {:.2f}% of the training set.'.format(target1, train.shape[0], target1per))
print('{} of {} records have target 3 and it is the {:.2f}% of the training set.'.format(target2, train.shape[0], target2per))
print('{} of {} records have target 4 and it is the {:.2f}% of the training set.\n'.format(target3, train.shape[0], target3per))

plt.figure(figsize=(8,6))
sns.countplot(train['target'], palette = 'coolwarm_r', order = sorted(train['target'].unique()))

plt.xlabel('Target', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1, 2, 3), ['1 ({0:.2f}%)'.format(target0per), '2 ({0:.2f}%)'.format(target1per), '3 ({0:.2f}%)'.format(target2per), '4 ({0:.2f}%)'.format(target3per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.title('Training Set Target Distribution', size=15, y=1.05)

plt.show()8490 of 100000 records have target 1 it is the 8.49% of the training set.
57497 of 100000 records have target 2 and it is the 57.50% of the training set.
21420 of 100000 records have target 3 and it is the 21.42% of the training set.
12593 of 100000 records have target 4 and it is the 12.59% of the training set.

Observations

The distribution of the classes is imbalanced.
More than 50% of the records belong to class2.
The smallest class is class1 having only 8.5% records.

# Label Encoding the classes

train.loc[train['target'] == 'Class_1', 'target'] = '1'
train.loc[train['target'] == 'Class_2', 'target'] = '2'
train.loc[train['target'] == 'Class_3', 'target'] = '3'
train.loc[train['target'] == 'Class_4', 'target'] = '4'

train['target'] = train['target'].astype(int)

Continuous Features

len(num_columns)50

All of the 50 features are numerical in this data.

# Checking the distribution of continuous features

i = 1
fig, ax = plt.subplots(10,5, figsize=(40,30))

for feature in tqdm(num_columns):
    plt.subplot(10,5, i)
    sns.kdeplot(data = train, y = feature, vertical=True, palette = 'coolwarm_r')
    plt.xlabel(f'{feature}- Skew: {round(train[feature].skew(), 2)}', size=20)
    i += 1

fig.tight_layout()
plt.show()100%|██████████| 50/50 [00:20<00:00,  2.41it/s]

Observations

We can see a big peak in all the features at 0 value.
The features are sparse just like one hot encoding.
There is skewness present in all the features but let’s not treat it since the values are discrete and not continuous in this data.

# Checking the distribution of continuous features

i = 1
fig, ax = plt.subplots(10,5, figsize=(50,30))

for feature in tqdm(num_columns):
    plt.subplot(10,5, i)
    sns.countplot(data = train, x = feature, order = train[feature].value_counts()[:4].index, hue = 'target', palette = 'coolwarm_r')
    plt.xlabel(feature, size=25)
    plt.legend(loc='upper right', prop={'size': 15})
    i += 1

fig.tight_layout()
plt.show()100%|██████████| 50/50 [00:02<00:00, 21.45it/s]

Observations

We cannot get any good insight here. Since the values are distributed in almost the same proportion as the target variable.
Clearly 0 alone won’t help the model in classification.
It will be interesting to see if the model is able to pick any other value apart from 0 which can help in classification.

Analyzing Zeros

Since 0 covers most of the cell values in this data, let’s check if there is an interesting pattern with zeros in this data.

zero_data = ((train.drop('target', axis = 1)==0).sum() / len(train) * 100)[::-1]
fig, ax = plt.subplots(1,1,figsize=(10, 19))

ax.barh(zero_data.index, 100, color='lightgrey', height=0.6)
barh = ax.barh(zero_data.index, zero_data, height=0.6, color='royalblue')
ax.bar_label(barh, fmt='%.01f %%')
ax.spines[['left', 'bottom', 'right']].set_visible(False)

ax.set_xticks([])

ax.set_title('# of Zeros (by feature)', loc='center', fontweight='bold', fontsize=15)    
plt.show()

Observations

Some features have more than 90% zero values, these features are very sparse and won’t help the model much.
features 15, 15, 27, and 38 look most promising since they have a variety of negative, positive, and zero values.

Correlation Check

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(60,60))
ax.set_title("Correlation Matrix", fontsize=30)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm_r',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(20) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(20)
    tick.label.set_rotation(0)
    
plt.show()

Observations

None of the features show any linear correlation among themselves and with the target variable.

Scaling

There are some high max values in this data. Let’s get them to a standard scale.

#Scaling the data using standard scaler

train[num_columns] = StandardScaler().fit_transform(train[num_columns])
test[num_columns] = StandardScaler().fit_transform(test[num_columns])

Outlier Treatment

# Outlier Analysis

iqr_factor = [3]
list1, list2 = [], []

for factor in iqr_factor:
    count = 0
    print(f'Outliers for {factor} IQR :')
    print('-------------------------------------')
    for col in num_columns:
    
        IQR = train[col].quantile(0.75) - train[col].quantile(0.25)
        lower_lim = train[col].quantile(0.25) - factor*IQR
        upper_lim = train[col].quantile(0.75) + factor*IQR
    
        cond = train[(train[col] < lower_lim) | (train[col] > upper_lim)].shape[0]
        
        if cond > 0 and factor == 1.5:
            list1.append(train[(train[col] < lower_lim) | (train[col] > upper_lim)].index.tolist())
        elif cond > 0 and factor == 3:
            list2.append(train[(train[col] < lower_lim) | (train[col] > upper_lim)].index.tolist())
        
        if cond > 0: print(f'{col:<30} : ', cond); count += cond
    print(f'\nTOTAL OUTLIERS FOR {factor} IQR : {count}')
    print('')Outliers for 3 IQR :
-------------------------------------
feature_0                      :  11527
feature_1                      :  10986
feature_2                      :  6508
feature_3                      :  21893
feature_4                      :  11619
feature_5                      :  9753
feature_6                      :  19125
feature_7                      :  8701
feature_8                      :  19954
feature_9                      :  9138
feature_10                     :  13610
feature_11                     :  21618
feature_12                     :  22586
feature_13                     :  5088
feature_14                     :  4662
feature_15                     :  2486
feature_16                     :  14319
feature_17                     :  24222
feature_18                     :  24991
feature_19                     :  9751
feature_20                     :  11818
feature_21                     :  20728
feature_22                     :  12850
feature_23                     :  16979
feature_24                     :  6148
feature_25                     :  18774
feature_26                     :  11839
feature_27                     :  8759
feature_28                     :  961
feature_29                     :  8649
feature_30                     :  13579
feature_31                     :  9770
feature_32                     :  9583
feature_33                     :  15587
feature_34                     :  6773
feature_35                     :  24115
feature_36                     :  8864
feature_37                     :  19175
feature_38                     :  2820
feature_39                     :  12762
feature_40                     :  4015
feature_41                     :  19741
feature_42                     :  16534
feature_43                     :  15701
feature_44                     :  6749
feature_45                     :  12711
feature_46                     :  13892
feature_47                     :  14732
feature_48                     :  5305
feature_49                     :  19486

TOTAL OUTLIERS FOR 3 IQR : 641936

Observations

The above table shows the number of outliers in each feature.
But these are not the actual outliers since the data is very sparse, most of the values other than 0 are being detected as outliers here.
Let’s keep these outliers as they are since these are the ones that will help the model in classification.

Modeling

Let’s try different ML models and see which performs best.

train = train.reset_index(drop = True)# Storing the target variable separately

X_train = train.drop('target', axis = 1)
X_test = test
y_train = train['target']

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))X_train shape: (100000, 50)
y_train shape: (100000,)
X_test shape: (50000, 50)#Stratified K fold Cross Validation

def train_and_validate(model, N):
    
    scores = []
    regex = '^[^\(]+'
    match = re.findall(regex, str(model))
    print(f'Running {N} Fold CV with {match[0]} Model.')
    
    preds = np.zeros((test.shape[0],4))

    importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=train.drop('target', axis = 1).columns)

    skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)

    for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
        print('Fold {}\n'.format(fold))
        
        # Fitting the model
        model.fit(X_train.iloc[trn_idx], y_train[trn_idx])

        # Computing Train logloss score
        trn_logloss_score = log_loss(y_train[trn_idx], model.predict_proba(X_train.iloc[trn_idx]))
        # Computing Validation logloss score
        val_logloss_score = log_loss(y_train[val_idx], model.predict_proba(X_train.iloc[val_idx]))

        scores.append((trn_logloss_score, val_logloss_score))

        preds += model.predict_proba(X_test)/skf.n_splits
        importances.iloc[:, fold - 1] = model.feature_importances_
        
        print(scores[-1])
    
    trlogloss = mean([i[0] for i in scores])
    cvlogloss = mean([i[1] for i in scores])
    
    print(f'Average Training logloss: {trlogloss}, Average CV logloss: {cvlogloss}')
    print ("*"*40)
    print ("\n")
    
    return trlogloss, cvlogloss, importances, preds, model#Testing multiple ML models using stratified K fold CV

df_row = []
N = 3

for i in [DecisionTreeClassifier(),
    LGBMClassifier(),
    RandomForestClassifier(n_estimators = 10, max_depth = 10)]:
    
    trlogloss, cvlogloss, importances, preds, model = train_and_validate(i, N)
    
    regex = '^[^\(]+'
    match = re.findall(regex, str(i))
    
    df_row.append([match[0], trlogloss, cvlogloss])

df = pd.DataFrame(df_row, columns = ['Model', f'{N} Fold Training logloss', f'{N} Fold CV logloss'])
df

Observations

LGBM Model has scored the least Logloss.
But the best performing model here is RandomForest because the difference between training log loss and CV log loss is least in this model.
Random Forest is generalizing the data very well here and is not overfitting much.

#Plotting the RandomForest importances

importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)

plt.figure(figsize=(8,8))
sns.barplot(x='Mean_Importance', y=importances.head(15).index, data=importances.head(15), palette = 'coolwarm_r')

plt.xlabel('')
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.title('Top 15 features', size=10)

plt.show()

Observations

As expected from EDA, features 38 and 14 are coming among the most important feature since they had a different behavior than all other features in our data.
It’s interesting to see features 2, 15, and 6 appearing at the top of the importance list. Let’s explore more about these features using SHAP.

Let’s try making a submission with the RandomForest model and see the performance on the leaderboard.

#Creating the submission with Random Forest Model

model = RandomForestClassifier(n_estimators = 10, max_depth = 10)
trlogloss, cvlogloss, importances, preds, _ = train_and_validate(model, 5)

sample.iloc[:, 1:] = preds
sample.to_csv('submission.csv', index = False)

We got a log loss of 1.104 on the leaderboard on submitting the above CSV. Let’s check if the LGBM model gets a better score.

#Creating the submission with LGBM Model

model = LGBMClassifier()
trlogloss, cvlogloss, importances, preds, _ = train_and_validate(model, 5)

sample.iloc[:, 1:] = preds
sample.to_csv('submission.csv', index = False)

Great! LGBM scored a log loss of 1.088 which is an improvement over the Random Forest Model.

Let’s try to understand more about the behavior of the features using SHAP.

Model Explainability using SHAP

#Fitting the SHAP on our model and training data

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)#Plotting the SHAP summary. (Note: Class 0 in the analysis below correspond to class 1 in the data and so on.)

shap.summary_plot(shap_values, X_train, color=plt.get_cmap("tab20c"))

Observations

Features 31, 24, and 16 have a major impact in predicting class 3.
feature 15, 11, 2, 1, and 33 help most in detecting class 2.
Features 37, 25, and 28 have the greatest impact in deciding class 0.
Features 6, 37, 28, and 34 have a major impact in predicting class 1.

shap.summary_plot(shap_values[0], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

Observations

Red bulbs in the center are indicating that 0 values are not having any impact in predicting class 1.
feature 6, 15, and 41 are positively correlated with class 1.
Features 25, and 19 show a slight negative correlation with class 1.
feature 17, 1 has the least impact on class 1.

shap.summary_plot(shap_values[1], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

Observations

Red bulbs in the center are indicating that 0 values are not having any impact in predicting class 2.
Features 19, 35, 29, 14, and 28 show a positive correlation with class 2.
Features 15, 6, 10, 42, and 30 show a negative correlation with class 2.
Feature 2 has the least impact on class 2.

shap.summary_plot(shap_values[2], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

Observations

Red bulbs are not completely in the center, indicating that zeros have a bit of impact on class 3.
Features 43, 14, and 42 show a positive correlation with class 3.
Features 15, 38, 11, and 0 show a negative correlation with class 3.
Feature 32 has the least impact on class 3.

shap.summary_plot(shap_values[3], X_train, show = False, cmap = 'coolwarm_r')
plt.gcf().axes[-1].set_aspect(100)
plt.gcf().axes[-1].set_box_aspect(100)

Observations

Zero values in features 31, 14, and 24 have a bit of impact on class 4.
No feature shows a strong positive correlation with class 4.
feature 31, 14, 24, 16, 23, 7 show negative correlation with class 4.
Features 22, 17, 32, and 9 have the least impact on class 4.

Most useless features in this data

As per the above analysis, we can safely conclude that feature_32, feature_17, and feature_1 are the most indecisive features to predict any class in this data.

The End!

Thank you for reading this publication. I hope you have learned something new today. Kindly share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.

Links:

Kaggle Kernel link.
Kaggle Profile link.
LinkedIn Profile Link.

Extensive EDA and Model Explainability using SHAP🔎

Getting insights from a black box machine learning model

Table of Contents

Importing Libraries

Reading the data files

Overview

Exploratory Data Analysis (EDA)

Analyzing Target Feature

Continuous Features

Analyzing Zeros

Correlation Check

Scaling

Outlier Treatment

Modeling

Model Explainability using SHAP

Most useless features in this data

The End!

Links:

Written by Shagun Kala

No responses yet