My Approach to Kaggle’s Jan 2021 Tabular Competition

Learning practical data science with Kaggle Competitions

9 min readApr 23, 2022

In this blog, I am sharing the approach I took to crack the Kaggle’s Jan 2021 Tabular Competition by scoring Top 10% on the leaderboard.
The Kaggle Competition link can be found here.

Importing Libraries

#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from lightgbm import LGBMRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import f1_score, confusion_matrix, classification_report

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold

import warnings
warnings.filterwarnings("ignore")

sns.set_palette("RdYlBu_r")

Reading the data files

#Reading the data files (Change the paths if running on google colab)

train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')

Exploring the data

print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()Shape of train data: (300000, 16)
Missing values count: 0

train.info()
print('\n')
train.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 16 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      300000 non-null  int64  
 1   cont1   300000 non-null  float64
 2   cont2   300000 non-null  float64
 3   cont3   300000 non-null  float64
 4   cont4   300000 non-null  float64
 5   cont5   300000 non-null  float64
 6   cont6   300000 non-null  float64
 7   cont7   300000 non-null  float64
 8   cont8   300000 non-null  float64
 9   cont9   300000 non-null  float64
 10  cont10  300000 non-null  float64
 11  cont11  300000 non-null  float64
 12  cont12  300000 non-null  float64
 13  cont13  300000 non-null  float64
 14  cont14  300000 non-null  float64
 15  target  300000 non-null  float64
dtypes: float64(15), int64(1)
memory usage: 36.6 MB


id        300000
cont1     299865
cont2     299906
cont3     299745
cont4     299892
cont5     299730
cont6     299830
cont7     299876
cont8     299853
cont9     299651
cont10    299851
cont11    299887
cont12    299886
cont13    299728
cont14    299868
target    299811
dtype: int64

Training data has 300000 records and 16 features.
Column ‘id’ is the primary key.
It’s a regression problem since we need to predict the ‘target’ feature which is continuous in nature.
There are 14 numerical features that are already scaled.
There is no missing value in this data and all features are numerical.

print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()Shape of test data: (200000, 15)
Missing values count: 0

test.info()
print('\n')
test.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      200000 non-null  int64  
 1   cont1   200000 non-null  float64
 2   cont2   200000 non-null  float64
 3   cont3   200000 non-null  float64
 4   cont4   200000 non-null  float64
 5   cont5   200000 non-null  float64
 6   cont6   200000 non-null  float64
 7   cont7   200000 non-null  float64
 8   cont8   200000 non-null  float64
 9   cont9   200000 non-null  float64
 10  cont10  200000 non-null  float64
 11  cont11  200000 non-null  float64
 12  cont12  200000 non-null  float64
 13  cont13  200000 non-null  float64
 14  cont14  200000 non-null  float64
dtypes: float64(14), int64(1)
memory usage: 22.9 MB

id        200000
cont1     199933
cont2     199957
cont3     199886
cont4     199957
cont5     199871
cont6     199936
cont7     199947
cont8     199935
cont9     199835
cont10    199939
cont11    199955
cont12    199952
cont13    199894
cont14    199928
dtype: int64

Test data has 200000 records and 15 features. ‘Target’ feature is absent as expected.
Column ‘id’ is the primary key.
There are 14 numerical features that are already scaled.
There is no missing value in this data and all features are numerical.

sample.head()

We need to submit the predicted target value for each id in the test data.

Pre- Modeling

Before jumping into EDA, let’s do a dry run to see how the Naive model and some basic models perform.

Train Test Split

#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target', 'id'], axis = 1)#Splitting the training data into 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Naive Model

This Naive model will ‘predict’ the median target value for all the records in test data.

This step is important to set up a benchmark score and improve further on that.

# Let's get a benchmark scoremodel_dummy = DummyRegressor(strategy='median')
model_dummy.fit(X_train, y_train)
y_dummy = model_dummy.predict(X_test)
score_dummy = mean_squared_error(y_test, y_dummy, squared=False)
print(f'{score_dummy:0.5f}')0.73385#Submitting the prediction
sample['target'] = model_dummy.predict(test.drop('id', axis = 1))
sample.to_csv('dummy.csv', index = False)

After submitting the results, we get a leaderboard score of 0.73487.

Simple ML Models

Let’s start with some simple ML models to see how well they perform with respect to the naive model score.

model_names = ["Linear",  "Lasso", "Ridge", "Decision Tree"]

models = [
    LinearRegression(fit_intercept=True),
    Lasso(fit_intercept=True),
    Ridge(fit_intercept=True),
    DecisionTreeRegressor()]

for name, model in zip(model_names, models):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    score = mean_squared_error(y_test, y_pred, squared=False)
    print(f'{name}: RMSE: {score}')Linear: RMSE: 0.72627260117725
Lasso: RMSE: 0.7332225702531663
Ridge: RMSE: 0.7262727324966892
Decision Tree: RMSE: 1.0034690945784999#Submitting the results from the best performing model so far.
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1))
sample.to_csv('simple_ml_model.csv', index = False)

After submitting the results, we get a leaderboard score of 0.72703. Not a bad start! We have beaten the benchmark score by 0.00784

Let’s take this improved score as our new benchmark.

Exploratory Data Analysis (EDA)

#Setting the 'id' primary key as an index

train = train.set_index('id')
test = test.set_index('id')#Checking if there is any difference between the behaviour of train and test data
train.describe()-test.describe()

There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in a correct validation.

#Let's check the distribution of target variable

sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()<matplotlib.legend.Legend at 0x7f7e42573910>

The distribution of the target variable is bimodal.

# Checking the distribution of other features

i = 1
plt.figure()
fig, ax = plt.subplots(5, 3, figsize=(14, 24))
for feature in test.columns:
    plt.subplot(5, 3, i)
    sns.distplot(train[feature], kde=True, bins=120, label='train')
    sns.distplot(test[feature], kde=True, bins=120, label='test')
    plt.xlabel(feature, fontsize=9); plt.legend()
    i += 1
plt.show()<Figure size 432x288 with 0 Axes>

Just like the target variable, all other features are either bimodal or multimodal in nature.
Train and test data values are overlapping.

Scaling

#Let's scale these values to convert them into normal distribution

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(train.drop('target', axis = 1))

train[train.drop('target', axis = 1).columns] = scaler.transform(train.drop('target', axis = 1))

test_scaled = scaler.transform(test)
test = pd.DataFrame(test_scaled, index=test.index, columns=test.columns)

Correlation Check

#Let's check how the features are inter-related to each other and with target variablef, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)
sns.heatmap(train[train.columns[train.columns != 'id']].corr(), vmin=-1, vmax=1, annot=True, cmap = 'Blues')

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(14) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(14)
    tick.label.set_rotation(0) 
plt.show()

Features cont6 and cont9 are strongly correlated with other features and less correlated with the target. Let’s drop them.

#Dropping the correlated featurestrain = train.drop(['cont6', 'cont9'], axis = 1)
test = test.drop(['cont6', 'cont9'], axis = 1)

Outlier Treatment

# Checking outliers using Box Plots

i = 1
plt.figure()
fig, ax = plt.subplots(5, 3, figsize=(14, 24))
for feature in train.columns:
    plt.subplot(5, 3, i)
    sns.boxplot(train[feature])
    plt.xlabel(feature, fontsize=9)
    i += 1
plt.show()<Figure size 432x288 with 0 Axes>

#Checking for mild outliersQ1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])

#Checking for extreme outliersQ1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])

The Target Feature has some extreme outliers and ‘cont7’, and ‘cont10’ have some mild outliers.

Let’s remove the records having target feature outliers and replace the outliers in ‘cont7’ and ‘cont10’ with the median values.

# Removing records with extreme outliers in target variabletrain = train.drop(train[(train['target'] < (Q1_train - 3*IQR_train)['target']) | (train['target'] > (Q1_train + 3*IQR_train)['target'])].index)Removed 2 records.#Replacing outliers with median value

def replace_outliers(data):
    for col in data.columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        median_ = data[col].median()
      
        data.loc[((data[col] < Q1 - 3*IQR) | (data[col] > Q3 + 3*IQR)), col] = median_
    return data

train[train.drop('target', axis = 1).columns] = replace_outliers(train.drop('target', axis = 1))#Checking the distribution of target variable again
sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()<matplotlib.legend.Legend at 0x7f7e2dc6c2d0>

Target Distribution is much lesser skewed now.

Modeling

Let’s try the ensemble models (like Random Forest, Light GBM, XGBoost) this time.

train = train.reset_index()#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target', 'id'], axis = 1)# Splitting the train data in 80:20 ratio.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)model_names = ["LGBM", "Random Forest", "XGBoost"]

models = [
    LGBMRegressor(),
    RandomForestRegressor(n_estimators = 10, max_depth = 10),
    XGBRegressor()]

for name, model in zip(model_names, models):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    score = mean_squared_error(y_test, y_pred, squared=False)
    print(f'{name}: RMSE: {score}')LGBM: RMSE: 0.7006733203452739
Random Forest: RMSE: 0.7093144916127819
XGBoost: RMSE: 0.7021882306288594

Woah!! Looks like the LightGBM model is fitting this data really well. Let’s try submitting this model's results.

model = LGBMRegressor()
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('lgbm.csv', index = False)

Great! We have got a leaderboard score of 0.70453. Much better than our previous benchmark score.

Since the LGBM model is showing good potential, let’s dive deep into the hyperparameter tuning of this best model.

LGBM Hyperparameter Tuning using Optuna

## Install optuna library
# !pip install optuna#Importing optuna library
import optuna#Function for hyperparameter tuning using optuna
def objective(trial,data=X,target=y):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=42)
    param = {
        'metric': 'rmse', 
        'random_state': 48,
        'n_estimators': 2000,
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
        'max_depth': trial.suggest_categorical('max_depth', [10,20,100]),
        'num_leaves' : trial.suggest_int('num_leaves', 1, 1000),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
        'cat_smooth' : trial.suggest_int('min_data_per_groups', 1, 100)
    }
    model = LGBMRegressor(**param)  
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    preds = model.predict(test_x)
    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

Number of finished trials: 10
Best trial: {'reg_alpha': 0.052633347414000095, 'reg_lambda': 0.01340929161546982, 'colsample_bytree': 0.5, 'subsample': 1.0, 'learning_rate': 0.01, 'max_depth': 100, 'num_leaves': 160, 'min_child_samples': 156, 'min_data_per_groups': 68}#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
    print(f"\t\t{key}: {value}")Best value (rmse): 0.69792
	Best params:
		reg_alpha: 0.052633347414000095
		reg_lambda: 0.01340929161546982
		colsample_bytree: 0.5
		subsample: 1.0
		learning_rate: 0.01
		max_depth: 100
		num_leaves: 160
		min_child_samples: 156
		min_data_per_groups: 68#Adding some additional parameters

params=study.best_params   
params['random_state'] = 48
params['n_estimators'] = 2000
params['metric'] = 'rmse'#Training LGBM with best set of hyperparameters

model = LGBMRegressor(**params)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('submission.csv', index = False)

Awesome! The leaderboard score has improved to 0.69932 after tuning the LGBM Regressor. This score is among the top 10% on the leaderboard!

The End!

Thank you for reading this publication. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.

Links:

Kaggle Kernel link.
Kaggle Profile link.
LinkedIn Profile Link.