Robust Validation Approach for Imbalanced Data

Learning practical data science with Kaggle Competitions

Shagun Kala
12 min readMay 2, 2022

In this blog, I am sharing the technique I used to correctly validate models in case of imbalanced dataset along with the complete solution to Kaggle’s Tabular March 2021 competition.

The Kaggle Competition link can be found here.

It is a classification problem and AUC is the evaluation metric.

To ensure correct model validation, the difference between the model performance (AUC in this case) on test data and unseen data should be as less as possible. That can only happen when the model has learned a generalized solution and is not overfitting to the train data.

We will use the Stratified K Fold Cross Validation technique to pick the best-generalized model in this analysis of imbalanced data.

Table of Contents

  • Importing Libraries
  • Reading the data files
  • Exploring the data
  • Exploratory Data Analysis (EDA)
  • Scaling
  • Correlation Check
  • Outlier Treatment
  • Feature Engineering
  • Modeling and Validation
  • LGBM Hyperparameter Tuning with Optuna

Importing Libraries

#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import roc_curve, auc, roc_auc_score
from statistics import mean
from imblearn.over_sampling import SMOTE

from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

sns.set_palette("muted")

Reading the data files

#Reading the data files

train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-mar-2021/sample_submission.csv')

Exploring the data

print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()
Shape of train data: (300000, 32)
Missing values count: 0
png
train.info()
print ("*"*40)
train.nunique()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 300000 non-null int64
1 cat0 300000 non-null object
2 cat1 300000 non-null object
3 cat2 300000 non-null object
4 cat3 300000 non-null object
5 cat4 300000 non-null object
6 cat5 300000 non-null object
7 cat6 300000 non-null object
8 cat7 300000 non-null object
9 cat8 300000 non-null object
10 cat9 300000 non-null object
11 cat10 300000 non-null object
12 cat11 300000 non-null object
13 cat12 300000 non-null object
14 cat13 300000 non-null object
15 cat14 300000 non-null object
16 cat15 300000 non-null object
17 cat16 300000 non-null object
18 cat17 300000 non-null object
19 cat18 300000 non-null object
20 cont0 300000 non-null float64
21 cont1 300000 non-null float64
22 cont2 300000 non-null float64
23 cont3 300000 non-null float64
24 cont4 300000 non-null float64
25 cont5 300000 non-null float64
26 cont6 300000 non-null float64
27 cont7 300000 non-null float64
28 cont8 300000 non-null float64
29 cont9 300000 non-null float64
30 cont10 300000 non-null float64
31 target 300000 non-null int64
dtypes: float64(11), int64(2), object(19)
memory usage: 73.2+ MB
****************************************
id 300000
cat0 2
cat1 15
cat2 19
cat3 13
cat4 20
cat5 84
cat6 16
cat7 51
cat8 61
cat9 19
cat10 299
cat11 2
cat12 2
cat13 2
cat14 2
cat15 4
cat16 4
cat17 4
cat18 4
cont0 299874
cont1 299861
cont2 299872
cont3 299818
cont4 299876
cont5 299791
cont6 299843
cont7 299880
cont8 299849
cont9 299859
cont10 299823
target 2
dtype: int64
  • Training data has 300000 records and 32 features.
  • Column ‘id’ is the primary key.
  • It’s a binary classification problem since we need to predict the binary ‘target’ feature.
  • There are 11 numerical features which are already scaled and 19 categorical features in the data.
  • There is no missing value in the data.
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()
Shape of test data: (200000, 31)
Missing values count: 0
png
test.info()
print ("*"*40)
test.nunique()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 200000 non-null int64
1 cat0 200000 non-null object
2 cat1 200000 non-null object
3 cat2 200000 non-null object
4 cat3 200000 non-null object
5 cat4 200000 non-null object
6 cat5 200000 non-null object
7 cat6 200000 non-null object
8 cat7 200000 non-null object
9 cat8 200000 non-null object
10 cat9 200000 non-null object
11 cat10 200000 non-null object
12 cat11 200000 non-null object
13 cat12 200000 non-null object
14 cat13 200000 non-null object
15 cat14 200000 non-null object
16 cat15 200000 non-null object
17 cat16 200000 non-null object
18 cat17 200000 non-null object
19 cat18 200000 non-null object
20 cont0 200000 non-null float64
21 cont1 200000 non-null float64
22 cont2 200000 non-null float64
23 cont3 200000 non-null float64
24 cont4 200000 non-null float64
25 cont5 200000 non-null float64
26 cont6 200000 non-null float64
27 cont7 200000 non-null float64
28 cont8 200000 non-null float64
29 cont9 200000 non-null float64
30 cont10 200000 non-null float64
dtypes: float64(11), int64(1), object(19)
memory usage: 47.3+ MB
****************************************
id 200000
cat0 2
cat1 15
cat2 19
cat3 13
cat4 20
cat5 84
cat6 16
cat7 51
cat8 61
cat9 19
cat10 295
cat11 2
cat12 2
cat13 2
cat14 2
cat15 4
cat16 4
cat17 4
cat18 4
cont0 199941
cont1 199945
cont2 199945
cont3 199927
cont4 199944
cont5 199907
cont6 199918
cont7 199953
cont8 199928
cont9 199946
cont10 199927
dtype: int64
  • Test data has 200000 records and 31 features.
  • Column ‘id’ is the primary key.
  • There are 11 numerical features which are already scaled and 19 categorical features in the data.
  • There is no missing value in the data.
sample.head()
png
  • We need to submit the predicted probability values for each id in the test data.

Exploratory Data Analysis (EDA)

# Setting index as 'id'
train = train.set_index('id')
test = test.set_index('id')
#Checking if there is any difference between the behaviour of train and test data
train.describe() - test.describe()
png

There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in the correct validation.

#Checking shape and cardinalitiestrain.shape, train.nunique()((300000, 31),
cat0 2
cat1 15
cat2 19
cat3 13
cat4 20
cat5 84
cat6 16
cat7 51
cat8 61
cat9 19
cat10 299
cat11 2
cat12 2
cat13 2
cat14 2
cat15 4
cat16 4
cat17 4
cat18 4
cont0 299874
cont1 299861
cont2 299872
cont3 299818
cont4 299876
cont5 299791
cont6 299843
cont7 299880
cont8 299849
cont9 299859
cont10 299823
target 2
dtype: int64)

Features cat5, cat7, cat8, cat10 have high cardinality.

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

Target Feature

#Let's check the distribution of target variable

target1 = train['target'].value_counts()[1]
target0 = train['target'].value_counts()[0]
target1per = target1 / train.shape[0] * 100
target0per = target0 / train.shape[0] * 100

print('{} of {} records have target 1 and it is the {:.2f}% of the training set.'.format(target1, train.shape[0], target1per))
print('{} of {} records have target 0 and it is the {:.2f}% of the training set.'.format(target0, train.shape[0], target0per))

plt.figure(figsize=(10, 8))
sns.countplot(train['target'])

plt.xlabel('Target', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1), ['0 ({0:.2f}%)'.format(target0per), '1 ({0:.2f}%)'.format(target1per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.title('Training Set Target Distribution', size=15, y=1.05)

plt.show()
79461 of 300000 records have target 1 and it is the 26.49% of the training set.
220539 of 300000 records have target 0 and it is the 73.51% of the training set.
png

The distribution of the target variable is imbalanced. We can try filling the minor class with synthetic samples using SMOTE.

Continuous Features

# Checking the distribution of continuous features

i = 1
fig, ax = plt.subplots(4, 3, figsize=(14, 14))

for feature in num_columns:
plt.subplot(4, 3, i)
sns.kdeplot(data = train, y = feature, vertical=True, hue='target', legend = True, shade = True)
plt.xlabel(f'{feature}- Skew: {round(train[feature].skew(), 2)}')
i += 1

fig.tight_layout()

fig.delaxes(ax[3,2])

plt.show()
png
  • No feature is highly skewed.
  • All continuous features are multimodal in nature.
  • We can observe differences in peaks between target 1 and target 0. This should help the model in classifying the target accurately.

Categorical Features

train.head()
png
# Checking the distribution of categorical features

fig, axs = plt.subplots(ncols=5, nrows=4, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_columns, 1):
plt.subplot(5, 4, i)
sns.countplot(x=feature, hue='target', data=train)

plt.xlabel('{}'.format(feature), size=20, labelpad=5)
plt.ylabel('Count', size=20, labelpad=15)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)

plt.legend(['0', '1'], loc='upper right', prop={'size': 18})

plt.show()
png
  • We can observe that some categories are much more dominating than others. Such features are not useful for the models.
  • Let’s club the insignificant categories to reduce the cardinality.
#Clubbing the insignificant categories together

for i in cat_columns:
x = train[i].value_counts()*100/train.shape[0]
for j in x[x<1].index:
train.loc[train[i] == j, i] = 'Clubbed'
test.loc[test[i] == j, i] = 'Clubbed'
# Checking the distribution of categorical features after clubbing

fig, axs = plt.subplots(ncols=5, nrows=4, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_columns, 1):
plt.subplot(5, 4, i)
sns.countplot(x=feature, hue='target', data=train)

plt.xlabel('{}'.format(feature), size=20, labelpad=5)
plt.ylabel('Count', size=20, labelpad=15)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)

plt.legend(['0', '1'], loc='upper right', prop={'size': 18})

plt.show()
png

Scaling

train.describe()
png

All continuous features are already scaled in the dataset.

Correlation Check

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns
#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
tick.label.set_rotation(90)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
tick.label.set_rotation(0)

plt.show()
png
  • (cont1 & cont2), (cont0 & cont10), (cont7 & cont10), (cont0 & cont7) are highly correlated with each other.
  • None of the features show a strong correlation with the target feature.
# Removing the correlated variables

train = train.drop(['cont2', 'cont10'], axis = 1)
test = test.drop(['cont2', 'cont10'], axis = 1)
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

Outlier Treatment

#Checking for mild outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])
png
#Checking for extreme outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])
png
  • There is no extreme outlier present in this data. But it has some mild outliers.
  • Let’s replace the mild outliers with median values.
#Replacing outliers with median value

def replace_outliers(data):
for col in data.columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
median_ = data[col].median()

data.loc[((data[col] < Q1 - 1.5*IQR) | (data[col] > Q3 + 1.5*IQR)), col] = median_
return data

train[num_columns] = replace_outliers(train[num_columns])

Feature Engineering

Continuous Features

# Splitting and labelencoding the multimodal continuous variables

tr_size = len(train)
df_full = pd.concat([train, test])

for i in num_columns:
df_full[i] = pd.qcut(df_full[i], 7)
df_full[i] = LabelEncoder().fit_transform(df_full[i])

train = df_full[:tr_size]
test = df_full[tr_size:]
# Checking the distribution of continuous features

fig, axs = plt.subplots(4, 3, figsize=(14,14))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(num_columns, 1):
plt.subplot(4, 3, i)
sns.countplot(x=feature, hue='target', data=train)

plt.xlabel('{}'.format(feature), size=12, labelpad=5)
plt.ylabel('Count', size=12, labelpad=15)
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.legend(['0', '1'], loc='upper right', prop={'size': 12})

fig.delaxes(axs[3,0])
fig.delaxes(axs[3,1])
fig.delaxes(axs[3,2])

plt.show()
png
  • We have turned the multimodal continuous features into ordinal categorical features.

Categorical Features

#Applying one hot encoding to categorical features

tr_size = len(train)
df_all = pd.concat([train, test])
df_all = pd.get_dummies(df_all, columns=cat_columns)

train = df_all[:tr_size]
test = df_all[tr_size:]
test = test.drop('target', axis = 1, errors = 'ignore')train.shape, test.shape((300000, 170), (200000, 169))

Modeling and Validation

Let’s try different ML models and see which performs best.

train = train.reset_index(drop = True)# Storing the target variable separately

X_train = train.drop('target', axis = 1)
X_test = test
y_train = train['target']

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))
X_train shape: (300000, 169)
y_train shape: (300000,)
X_test shape: (200000, 169)

Stratified K fold Cross Validation

def train_and_validate(model, N):

regex = '^[^\(]+'
match = re.findall(regex, str(model))
print(f'Running {N} Fold CV with {match[0]} Model.')

probs = pd.DataFrame(np.zeros((len(X_test), N * 2)), columns=
['Fold_{}_Prob_{}'.format(i, j) for i in range(1, N + 1) for j in
range(2)])
importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=train.drop('target', axis = 1).columns)
fprs, tprs, scores = [], [], []

skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)

for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
print('Fold {}\n'.format(fold))

# Fitting the model
model.fit(X_train.iloc[trn_idx], y_train[trn_idx])

# Computing Train AUC score
trn_fpr, trn_tpr, trn_thresholds = roc_curve(y_train[trn_idx], model.predict_proba(X_train.iloc[trn_idx])[:, 1])
trn_auc_score = auc(trn_fpr, trn_tpr)
# Computing Validation AUC score
val_fpr, val_tpr, val_thresholds = roc_curve(y_train[val_idx], model.predict_proba(X_train.iloc[val_idx])[:, 1])
val_auc_score = auc(val_fpr, val_tpr)

scores.append((trn_auc_score, val_auc_score))
fprs.append(val_fpr)
tprs.append(val_tpr)

# X_test probabilities
probs.loc[:, 'Fold_{}_Prob_0'.format(fold)] = model.predict_proba(X_test)[:, 0]
probs.loc[:, 'Fold_{}_Prob_1'.format(fold)] = model.predict_proba(X_test)[:, 1]
importances.iloc[:, fold - 1] = model.feature_importances_

print(scores[-1])

trauc = mean([i[0] for i in scores])
cvauc = mean([i[1] for i in scores])
print(f'Average Training AUC: {trauc}, Average CV AUC: {cvauc}')
print ("*"*40)
print ("\n")

return trauc, cvauc, importances, probs
#Testing multiple ML models using stratified K fold CV

df_row = []
N = 3

for i in [
LGBMClassifier(),
RandomForestClassifier(n_estimators = 10, max_depth = 30),
XGBClassifier(verbosity = 0)]:

trauc, cvauc, importances, probs = train_and_validate(i, N)

regex = '^[^\(]+'
match = re.findall(regex, str(i))

df_row.append([match[0], trauc, cvauc])

df = pd.DataFrame(df_row, columns = ['Model', f'{N} Fold Training AUC', f'{N} Fold CV AUC'])
df
Running 3 Fold CV with LGBMClassifier Model.Average Training AUC: 0.893543370281815, Average CV AUC: 0.8876388122150632
****************************************

Running 3 Fold CV with RandomForestClassifier Model.
Average Training AUC: 0.9952066033283958, Average CV AUC: 0.8646785996260924
****************************************

Running 3 Fold CV with XGBClassifier Model.
Average Training AUC: 0.9098102519515038, Average CV AUC: 0.888680388946887
png
Model Performance Summary

We can observe that XGBoost CV AUC is highest but if you look closer the difference between Training AUC and CV AUC is least in the case of LGBM Classifier. Hence, we will choose LGBM as our best-performing model since it is less overfitting.

#Plotting the XGBoost importances

importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)

plt.figure(figsize=(8,8))
sns.barplot(x='Mean_Importance', y=importances.head(15).index, data=importances.head(15))

plt.xlabel('')
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.title('Classifier Mean Feature Importance Between Folds', size=10)

plt.show()
png

Let’s try tuning the LGBM parameters using Optuna.

LGBM Hyperparameter Tuning using Optuna

## Install optuna library
# !pip install optuna
#Importing optuna library
import optuna
#Function for hyperparameter tuning using optuna

def objective(trial, data=X_train, target=y_train):
seed = 2021
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=seed)

for train_index, valid_index in split.split(data, target):
X_train = data.iloc[train_index]
y_train = target.iloc[train_index]
X_valid = data.iloc[valid_index]
y_valid = target.iloc[valid_index]


lgbm_params = {
'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10.0),
'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10.0),
'num_leaves': trial.suggest_int('num_leaves', 11, 333),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
'max_depth': trial.suggest_int('max_depth', 5, 30),
'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.01, 0.02, 0.05, 0.1]),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.5),
'n_estimators': trial.suggest_int('n_estimators', 100, 5000),
'random_state': seed,
'boosting_type': 'gbdt',
'metric': 'AUC',
#'device': 'gpu'
}


model = LGBMClassifier(**lgbm_params)

model.fit(
X_train,
y_train,
early_stopping_rounds=100,
eval_set=[(X_valid, y_valid)],
verbose=False
)

y_valid_pred = model.predict_proba(X_valid)[:,1]

roc_auc = roc_auc_score(y_valid, y_valid_pred)

return roc_auc
#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction = 'maximize')
study.optimize(objective, n_trials = 10)
#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
print(f"\t\t{key}: {value}")
Best value (rmse): 0.89454
Best params:
reg_alpha: 0.7719427188223845
reg_lambda: 4.148696295661259
num_leaves: 214
min_child_samples: 76
max_depth: 15
learning_rate: 0.05
colsample_bytree: 0.27115565543222925
n_estimators: 2051
#Storing final parameters

params=study.best_params
#Training the best model
trauc, cvauc, importances, probs = train_and_validate(LGBMClassifier(**params), 3)
Running 3 Fold CV with LGBMClassifier Model.
Fold 1

(0.9621877692654366, 0.8883507023028498)
Fold 2

(0.9617237292285297, 0.8876533776034585)
Fold 3

(0.962315446666268, 0.8890917406289454)
Average Training AUC: 0.9620756483867448, Average CV AUC: 0.8883652735117512
****************************************
#Creating the submission
cols = [i for i in probs.columns if i.endswith('1')]

probs = probs[cols]

sample['target'] = probs.sum(axis = 1)/5
sample.to_csv('submission.csv', index = False)

Awesome! We got a leaderboard score of 0.89328 after tuning the LGBM Classifier which is very close to the CV AUC on test data.

Let’s conclude. However, the AUC can be improved further by stacking the models together.

Kaggle Submission

Techniques that did not work:

  • The continuous features are multimodal in nature but still, Gaussian Mixture Modeling didn’t improve the score.
  • Standard scaling didn’t help in improving the score.
  • Applying SMOTE didn’t improve the leaderboard score.

The End!

Thank you for reading this publication. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.

Links:

Kaggle Kernel link.

Kaggle Profile link.

LinkedIn Profile Link.

--

--