Robust Validation Approach for Imbalanced Data

Learning practical data science with Kaggle Competitions

12 min readMay 2, 2022

In this blog, I am sharing the technique I used to correctly validate models in case of imbalanced dataset along with the complete solution to Kaggle’s Tabular March 2021 competition.
The Kaggle Competition link can be found here.
It is a classification problem and AUC is the evaluation metric.

To ensure correct model validation, the difference between the model performance (AUC in this case) on test data and unseen data should be as less as possible. That can only happen when the model has learned a generalized solution and is not overfitting to the train data.

We will use the Stratified K Fold Cross Validation technique to pick the best-generalized model in this analysis of imbalanced data.

Importing Libraries
Reading the data files
Exploring the data
Exploratory Data Analysis (EDA)
Scaling
Correlation Check
Outlier Treatment
Feature Engineering
Modeling and Validation
LGBM Hyperparameter Tuning with Optuna

Importing Libraries

#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import roc_curve, auc, roc_auc_score
from statistics import mean
from imblearn.over_sampling import SMOTE

from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

sns.set_palette("muted")

Reading the data files

#Reading the data files

train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-mar-2021/sample_submission.csv')

Exploring the data

print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()Shape of train data: (300000, 32)
Missing values count: 0

train.info()
print ("*"*40)
train.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 32 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      300000 non-null  int64  
 1   cat0    300000 non-null  object 
 2   cat1    300000 non-null  object 
 3   cat2    300000 non-null  object 
 4   cat3    300000 non-null  object 
 5   cat4    300000 non-null  object 
 6   cat5    300000 non-null  object 
 7   cat6    300000 non-null  object 
 8   cat7    300000 non-null  object 
 9   cat8    300000 non-null  object 
 10  cat9    300000 non-null  object 
 11  cat10   300000 non-null  object 
 12  cat11   300000 non-null  object 
 13  cat12   300000 non-null  object 
 14  cat13   300000 non-null  object 
 15  cat14   300000 non-null  object 
 16  cat15   300000 non-null  object 
 17  cat16   300000 non-null  object 
 18  cat17   300000 non-null  object 
 19  cat18   300000 non-null  object 
 20  cont0   300000 non-null  float64
 21  cont1   300000 non-null  float64
 22  cont2   300000 non-null  float64
 23  cont3   300000 non-null  float64
 24  cont4   300000 non-null  float64
 25  cont5   300000 non-null  float64
 26  cont6   300000 non-null  float64
 27  cont7   300000 non-null  float64
 28  cont8   300000 non-null  float64
 29  cont9   300000 non-null  float64
 30  cont10  300000 non-null  float64
 31  target  300000 non-null  int64  
dtypes: float64(11), int64(2), object(19)
memory usage: 73.2+ MB
****************************************id        300000
cat0           2
cat1          15
cat2          19
cat3          13
cat4          20
cat5          84
cat6          16
cat7          51
cat8          61
cat9          19
cat10        299
cat11          2
cat12          2
cat13          2
cat14          2
cat15          4
cat16          4
cat17          4
cat18          4
cont0     299874
cont1     299861
cont2     299872
cont3     299818
cont4     299876
cont5     299791
cont6     299843
cont7     299880
cont8     299849
cont9     299859
cont10    299823
target         2
dtype: int64

Training data has 300000 records and 32 features.
Column ‘id’ is the primary key.
It’s a binary classification problem since we need to predict the binary ‘target’ feature.
There are 11 numerical features which are already scaled and 19 categorical features in the data.
There is no missing value in the data.

print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()Shape of test data: (200000, 31)
Missing values count: 0

test.info()
print ("*"*40)
test.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      200000 non-null  int64  
 1   cat0    200000 non-null  object 
 2   cat1    200000 non-null  object 
 3   cat2    200000 non-null  object 
 4   cat3    200000 non-null  object 
 5   cat4    200000 non-null  object 
 6   cat5    200000 non-null  object 
 7   cat6    200000 non-null  object 
 8   cat7    200000 non-null  object 
 9   cat8    200000 non-null  object 
 10  cat9    200000 non-null  object 
 11  cat10   200000 non-null  object 
 12  cat11   200000 non-null  object 
 13  cat12   200000 non-null  object 
 14  cat13   200000 non-null  object 
 15  cat14   200000 non-null  object 
 16  cat15   200000 non-null  object 
 17  cat16   200000 non-null  object 
 18  cat17   200000 non-null  object 
 19  cat18   200000 non-null  object 
 20  cont0   200000 non-null  float64
 21  cont1   200000 non-null  float64
 22  cont2   200000 non-null  float64
 23  cont3   200000 non-null  float64
 24  cont4   200000 non-null  float64
 25  cont5   200000 non-null  float64
 26  cont6   200000 non-null  float64
 27  cont7   200000 non-null  float64
 28  cont8   200000 non-null  float64
 29  cont9   200000 non-null  float64
 30  cont10  200000 non-null  float64
dtypes: float64(11), int64(1), object(19)
memory usage: 47.3+ MB
****************************************id        200000
cat0           2
cat1          15
cat2          19
cat3          13
cat4          20
cat5          84
cat6          16
cat7          51
cat8          61
cat9          19
cat10        295
cat11          2
cat12          2
cat13          2
cat14          2
cat15          4
cat16          4
cat17          4
cat18          4
cont0     199941
cont1     199945
cont2     199945
cont3     199927
cont4     199944
cont5     199907
cont6     199918
cont7     199953
cont8     199928
cont9     199946
cont10    199927
dtype: int64

Test data has 200000 records and 31 features.
Column ‘id’ is the primary key.
There are 11 numerical features which are already scaled and 19 categorical features in the data.
There is no missing value in the data.

sample.head()

We need to submit the predicted probability values for each id in the test data.

Exploratory Data Analysis (EDA)

# Setting index as 'id'
train = train.set_index('id')
test = test.set_index('id')#Checking if there is any difference between the behaviour of train and test data
train.describe() - test.describe()

There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in the correct validation.

#Checking shape and cardinalitiestrain.shape, train.nunique()((300000, 31),
 cat0           2
 cat1          15
 cat2          19
 cat3          13
 cat4          20
 cat5          84
 cat6          16
 cat7          51
 cat8          61
 cat9          19
 cat10        299
 cat11          2
 cat12          2
 cat13          2
 cat14          2
 cat15          4
 cat16          4
 cat17          4
 cat18          4
 cont0     299874
 cont1     299861
 cont2     299872
 cont3     299818
 cont4     299876
 cont5     299791
 cont6     299843
 cont7     299880
 cont8     299849
 cont9     299859
 cont10    299823
 target         2
 dtype: int64)

Features cat5, cat7, cat8, cat10 have high cardinality.

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

Target Feature

#Let's check the distribution of target variable

target1 = train['target'].value_counts()[1]
target0 = train['target'].value_counts()[0]
target1per = target1 / train.shape[0] * 100
target0per = target0 / train.shape[0] * 100

print('{} of {} records have target 1 and it is the {:.2f}% of the training set.'.format(target1, train.shape[0], target1per))
print('{} of {} records have target 0 and it is the {:.2f}% of the training set.'.format(target0, train.shape[0], target0per))

plt.figure(figsize=(10, 8))
sns.countplot(train['target'])

plt.xlabel('Target', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1), ['0 ({0:.2f}%)'.format(target0per), '1 ({0:.2f}%)'.format(target1per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.title('Training Set Target Distribution', size=15, y=1.05)

plt.show()79461 of 300000 records have target 1 and it is the 26.49% of the training set.
220539 of 300000 records have target 0 and it is the 73.51% of the training set.

The distribution of the target variable is imbalanced. We can try filling the minor class with synthetic samples using SMOTE.

Continuous Features

# Checking the distribution of continuous features

i = 1
fig, ax = plt.subplots(4, 3, figsize=(14, 14))

for feature in num_columns:
    plt.subplot(4, 3, i)
    sns.kdeplot(data = train, y = feature, vertical=True, hue='target', legend = True, shade = True)
    plt.xlabel(f'{feature}- Skew: {round(train[feature].skew(), 2)}')
    i += 1

fig.tight_layout()

fig.delaxes(ax[3,2])

plt.show()

No feature is highly skewed.
All continuous features are multimodal in nature.
We can observe differences in peaks between target 1 and target 0. This should help the model in classifying the target accurately.

Categorical Features

train.head()

# Checking the distribution of categorical features

fig, axs = plt.subplots(ncols=5, nrows=4, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_columns, 1):    
    plt.subplot(5, 4, i)
    sns.countplot(x=feature, hue='target', data=train)
    
    plt.xlabel('{}'.format(feature), size=20, labelpad=5)
    plt.ylabel('Count', size=20, labelpad=15)    
    plt.tick_params(axis='x', labelsize=20)
    plt.tick_params(axis='y', labelsize=20)
    
    plt.legend(['0', '1'], loc='upper right', prop={'size': 18})

plt.show()

We can observe that some categories are much more dominating than others. Such features are not useful for the models.
Let’s club the insignificant categories to reduce the cardinality.

#Clubbing the insignificant categories together

for i in cat_columns:
    x = train[i].value_counts()*100/train.shape[0]
    for j in x[x<1].index:
        train.loc[train[i] == j, i] = 'Clubbed'
        test.loc[test[i] == j, i] = 'Clubbed'# Checking the distribution of categorical features after clubbing

fig, axs = plt.subplots(ncols=5, nrows=4, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_columns, 1):    
    plt.subplot(5, 4, i)
    sns.countplot(x=feature, hue='target', data=train)
    
    plt.xlabel('{}'.format(feature), size=20, labelpad=5)
    plt.ylabel('Count', size=20, labelpad=15)    
    plt.tick_params(axis='x', labelsize=20)
    plt.tick_params(axis='y', labelsize=20)
    
    plt.legend(['0', '1'], loc='upper right', prop={'size': 18})

plt.show()

Scaling

train.describe()

All continuous features are already scaled in the dataset.

Correlation Check

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(12) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(12)
    tick.label.set_rotation(0)
    
plt.show()

(cont1 & cont2), (cont0 & cont10), (cont7 & cont10), (cont0 & cont7) are highly correlated with each other.
None of the features show a strong correlation with the target feature.

# Removing the correlated variables

train = train.drop(['cont2', 'cont10'], axis = 1)
test = test.drop(['cont2', 'cont10'], axis = 1)num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

Outlier Treatment

#Checking for mild outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])

#Checking for extreme outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])

There is no extreme outlier present in this data. But it has some mild outliers.
Let’s replace the mild outliers with median values.

#Replacing outliers with median value

def replace_outliers(data):
    for col in data.columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        median_ = data[col].median()
      
        data.loc[((data[col] < Q1 - 1.5*IQR) | (data[col] > Q3 + 1.5*IQR)), col] = median_
    return data

train[num_columns] = replace_outliers(train[num_columns])

Feature Engineering

Continuous Features

# Splitting and labelencoding the multimodal continuous variables

tr_size = len(train)
df_full = pd.concat([train, test])

for i in num_columns:
    df_full[i] = pd.qcut(df_full[i], 7)
    df_full[i] = LabelEncoder().fit_transform(df_full[i])
    
train = df_full[:tr_size]
test = df_full[tr_size:]# Checking the distribution of continuous features

fig, axs = plt.subplots(4, 3, figsize=(14,14))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(num_columns, 1):    
    plt.subplot(4, 3, i)
    sns.countplot(x=feature, hue='target', data=train)
    
    plt.xlabel('{}'.format(feature), size=12, labelpad=5)
    plt.ylabel('Count', size=12, labelpad=15)    
    plt.tick_params(axis='x', labelsize=12)
    plt.tick_params(axis='y', labelsize=12)
    
    plt.legend(['0', '1'], loc='upper right', prop={'size': 12})

fig.delaxes(axs[3,0])
fig.delaxes(axs[3,1])
fig.delaxes(axs[3,2])

plt.show()

We have turned the multimodal continuous features into ordinal categorical features.

Categorical Features

#Applying one hot encoding to categorical features

tr_size = len(train)
df_all = pd.concat([train, test])
df_all = pd.get_dummies(df_all, columns=cat_columns)

train = df_all[:tr_size]
test = df_all[tr_size:]test = test.drop('target', axis = 1, errors = 'ignore')train.shape, test.shape((300000, 170), (200000, 169))

Modeling and Validation

Let’s try different ML models and see which performs best.

train = train.reset_index(drop = True)# Storing the target variable separately

X_train = train.drop('target', axis = 1)
X_test = test
y_train = train['target']

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))X_train shape: (300000, 169)
y_train shape: (300000,)
X_test shape: (200000, 169)

`Stratified K fold Cross Validation`

def train_and_validate(model, N):
    
    regex = '^[^\(]+'
    match = re.findall(regex, str(model))
    print(f'Running {N} Fold CV with {match[0]} Model.')
    
    probs = pd.DataFrame(np.zeros((len(X_test), N * 2)), columns= 
   ['Fold_{}_Prob_{}'.format(i, j) for i in range(1, N + 1) for j in 
   range(2)])
    importances = pd.DataFrame(np.zeros((X_train.shape[1], N)),      columns=['Fold_{}'.format(i) for i in range(1, N + 1)],      index=train.drop('target', axis = 1).columns)
    fprs, tprs, scores = [], [], []

    skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)

    for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
        print('Fold {}\n'.format(fold))
        
        # Fitting the model
        model.fit(X_train.iloc[trn_idx], y_train[trn_idx])

        # Computing Train AUC score
        trn_fpr, trn_tpr, trn_thresholds = roc_curve(y_train[trn_idx], model.predict_proba(X_train.iloc[trn_idx])[:, 1])
        trn_auc_score = auc(trn_fpr, trn_tpr)
        # Computing Validation AUC score
        val_fpr, val_tpr, val_thresholds = roc_curve(y_train[val_idx], model.predict_proba(X_train.iloc[val_idx])[:, 1])
        val_auc_score = auc(val_fpr, val_tpr)  

        scores.append((trn_auc_score, val_auc_score))
        fprs.append(val_fpr)
        tprs.append(val_tpr)

        # X_test probabilities
        probs.loc[:, 'Fold_{}_Prob_0'.format(fold)] = model.predict_proba(X_test)[:, 0]
        probs.loc[:, 'Fold_{}_Prob_1'.format(fold)] = model.predict_proba(X_test)[:, 1]
        importances.iloc[:, fold - 1] = model.feature_importances_
        
        print(scores[-1])    
    
    trauc = mean([i[0] for i in scores])
    cvauc = mean([i[1] for i in scores])
    print(f'Average Training AUC: {trauc}, Average CV AUC: {cvauc}')
    print ("*"*40)
    print ("\n")
    
    return trauc, cvauc, importances, probs#Testing multiple ML models using stratified K fold CV

df_row = []
N = 3

for i in [
    LGBMClassifier(),
    RandomForestClassifier(n_estimators = 10, max_depth = 30),
    XGBClassifier(verbosity = 0)]:
    
    trauc, cvauc, importances, probs = train_and_validate(i, N)
    
    regex = '^[^\(]+'
    match = re.findall(regex, str(i))
    
    df_row.append([match[0], trauc, cvauc])

df = pd.DataFrame(df_row, columns = ['Model', f'{N} Fold Training AUC', f'{N} Fold CV AUC'])
dfRunning 3 Fold CV with LGBMClassifier Model.Average Training AUC: 0.893543370281815, Average CV AUC: 0.8876388122150632
****************************************

Running 3 Fold CV with RandomForestClassifier Model.Average Training AUC: 0.9952066033283958, Average CV AUC: 0.8646785996260924
****************************************

Running 3 Fold CV with XGBClassifier Model.Average Training AUC: 0.9098102519515038, Average CV AUC: 0.888680388946887

We can observe that XGBoost CV AUC is highest but if you look closer the difference between Training AUC and CV AUC is least in the case of LGBM Classifier. Hence, we will choose LGBM as our best-performing model since it is less overfitting.

#Plotting the XGBoost importances

importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)

plt.figure(figsize=(8,8))
sns.barplot(x='Mean_Importance', y=importances.head(15).index, data=importances.head(15))

plt.xlabel('')
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.title('Classifier Mean Feature Importance Between Folds', size=10)

plt.show()

Let’s try tuning the LGBM parameters using Optuna.

LGBM Hyperparameter Tuning using Optuna

## Install optuna library
# !pip install optuna#Importing optuna library
import optuna#Function for hyperparameter tuning using optuna

def objective(trial, data=X_train, target=y_train):
    seed = 2021
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=seed)

    for train_index, valid_index in split.split(data, target):
        X_train = data.iloc[train_index]
        y_train = target.iloc[train_index]
        X_valid = data.iloc[valid_index]
        y_valid = target.iloc[valid_index]


    lgbm_params = {
        'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 11, 333),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.01, 0.02, 0.05, 0.1]),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.5),
        'n_estimators': trial.suggest_int('n_estimators', 100, 5000),
        'random_state': seed,
        'boosting_type': 'gbdt',
        'metric': 'AUC',
        #'device': 'gpu'
    }
    

    model = LGBMClassifier(**lgbm_params)  
    
    model.fit(
            X_train,
            y_train,
            early_stopping_rounds=100,
            eval_set=[(X_valid, y_valid)],
            verbose=False
        )

    y_valid_pred = model.predict_proba(X_valid)[:,1]
    
    roc_auc = roc_auc_score(y_valid, y_valid_pred)
    
    return roc_auc#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction = 'maximize')
study.optimize(objective, n_trials = 10)#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
    print(f"\t\t{key}: {value}")Best value (rmse): 0.89454
	Best params:
		reg_alpha: 0.7719427188223845
		reg_lambda: 4.148696295661259
		num_leaves: 214
		min_child_samples: 76
		max_depth: 15
		learning_rate: 0.05
		colsample_bytree: 0.27115565543222925
		n_estimators: 2051#Storing final parameters

params=study.best_params#Training the best model
trauc, cvauc, importances, probs = train_and_validate(LGBMClassifier(**params), 3)Running 3 Fold CV with LGBMClassifier Model.
Fold 1

(0.9621877692654366, 0.8883507023028498)
Fold 2

(0.9617237292285297, 0.8876533776034585)
Fold 3

(0.962315446666268, 0.8890917406289454)
Average Training AUC: 0.9620756483867448, Average CV AUC: 0.8883652735117512
****************************************#Creating the submission
cols = [i for i in probs.columns if i.endswith('1')]

probs = probs[cols]

sample['target'] = probs.sum(axis = 1)/5
sample.to_csv('submission.csv', index = False)

Awesome! We got a leaderboard score of 0.89328 after tuning the LGBM Classifier which is very close to the CV AUC on test data.

Let’s conclude. However, the AUC can be improved further by stacking the models together.

Techniques that did not work:

The continuous features are multimodal in nature but still, Gaussian Mixture Modeling didn’t improve the score.
Standard scaling didn’t help in improving the score.
Applying SMOTE didn’t improve the leaderboard score.

The End!

Thank you for reading this publication. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.

Links:

Kaggle Kernel link.
Kaggle Profile link.
LinkedIn Profile Link.

Robust Validation Approach for Imbalanced Data

Learning practical data science with Kaggle Competitions

Table of Contents

Importing Libraries

Reading the data files

Exploring the data

Exploratory Data Analysis (EDA)

Target Feature

Continuous Features

Categorical Features

Scaling

Correlation Check

Outlier Treatment

Feature Engineering

Continuous Features

Categorical Features

Modeling and Validation

`Stratified K fold Cross Validation`

LGBM Hyperparameter Tuning using Optuna

Techniques that did not work:

The End!

Links:

Written by Shagun Kala