My Approach to Kaggle’s Jan 2021 Tabular Competition
Learning practical data science with Kaggle Competitions
In this blog, I am sharing the approach I took to crack the Kaggle’s Jan 2021 Tabular Competition by scoring Top 10% on the leaderboard.
The Kaggle Competition link can be found here.
Importing Libraries
#Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from lightgbm import LGBMRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings("ignore")
sns.set_palette("RdYlBu_r")
Reading the data files
#Reading the data files (Change the paths if running on google colab)
train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')
Exploring the data
print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')
train.head()Shape of train data: (300000, 16)
Missing values count: 0
train.info()
print('\n')
train.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 300000 non-null int64
1 cont1 300000 non-null float64
2 cont2 300000 non-null float64
3 cont3 300000 non-null float64
4 cont4 300000 non-null float64
5 cont5 300000 non-null float64
6 cont6 300000 non-null float64
7 cont7 300000 non-null float64
8 cont8 300000 non-null float64
9 cont9 300000 non-null float64
10 cont10 300000 non-null float64
11 cont11 300000 non-null float64
12 cont12 300000 non-null float64
13 cont13 300000 non-null float64
14 cont14 300000 non-null float64
15 target 300000 non-null float64
dtypes: float64(15), int64(1)
memory usage: 36.6 MB
id 300000
cont1 299865
cont2 299906
cont3 299745
cont4 299892
cont5 299730
cont6 299830
cont7 299876
cont8 299853
cont9 299651
cont10 299851
cont11 299887
cont12 299886
cont13 299728
cont14 299868
target 299811
dtype: int64
- Training data has 300000 records and 16 features.
- Column ‘id’ is the primary key.
- It’s a regression problem since we need to predict the ‘target’ feature which is continuous in nature.
- There are 14 numerical features that are already scaled.
- There is no missing value in this data and all features are numerical.
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')
test.head()Shape of test data: (200000, 15)
Missing values count: 0
test.info()
print('\n')
test.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 200000 non-null int64
1 cont1 200000 non-null float64
2 cont2 200000 non-null float64
3 cont3 200000 non-null float64
4 cont4 200000 non-null float64
5 cont5 200000 non-null float64
6 cont6 200000 non-null float64
7 cont7 200000 non-null float64
8 cont8 200000 non-null float64
9 cont9 200000 non-null float64
10 cont10 200000 non-null float64
11 cont11 200000 non-null float64
12 cont12 200000 non-null float64
13 cont13 200000 non-null float64
14 cont14 200000 non-null float64
dtypes: float64(14), int64(1)
memory usage: 22.9 MB
id 200000
cont1 199933
cont2 199957
cont3 199886
cont4 199957
cont5 199871
cont6 199936
cont7 199947
cont8 199935
cont9 199835
cont10 199939
cont11 199955
cont12 199952
cont13 199894
cont14 199928
dtype: int64
- Test data has 200000 records and 15 features. ‘Target’ feature is absent as expected.
- Column ‘id’ is the primary key.
- There are 14 numerical features that are already scaled.
- There is no missing value in this data and all features are numerical.
sample.head()
- We need to submit the predicted target value for each id in the test data.
Pre- Modeling
Before jumping into EDA, let’s do a dry run to see how the Naive model and some basic models perform.
Train Test Split
#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target', 'id'], axis = 1)#Splitting the training data into 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
Naive Model
This Naive model will ‘predict’ the median target value for all the records in test data.
This step is important to set up a benchmark score and improve further on that.
# Let's get a benchmark scoremodel_dummy = DummyRegressor(strategy='median')
model_dummy.fit(X_train, y_train)
y_dummy = model_dummy.predict(X_test)
score_dummy = mean_squared_error(y_test, y_dummy, squared=False)
print(f'{score_dummy:0.5f}')0.73385#Submitting the prediction
sample['target'] = model_dummy.predict(test.drop('id', axis = 1))
sample.to_csv('dummy.csv', index = False)
After submitting the results, we get a leaderboard score of 0.73487.
Simple ML Models
Let’s start with some simple ML models to see how well they perform with respect to the naive model score.
model_names = ["Linear", "Lasso", "Ridge", "Decision Tree"]
models = [
LinearRegression(fit_intercept=True),
Lasso(fit_intercept=True),
Ridge(fit_intercept=True),
DecisionTreeRegressor()]
for name, model in zip(model_names, models):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = mean_squared_error(y_test, y_pred, squared=False)
print(f'{name}: RMSE: {score}')Linear: RMSE: 0.72627260117725
Lasso: RMSE: 0.7332225702531663
Ridge: RMSE: 0.7262727324966892
Decision Tree: RMSE: 1.0034690945784999#Submitting the results from the best performing model so far.
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1))
sample.to_csv('simple_ml_model.csv', index = False)
After submitting the results, we get a leaderboard score of 0.72703. Not a bad start! We have beaten the benchmark score by 0.00784
Let’s take this improved score as our new benchmark.
Exploratory Data Analysis (EDA)
#Setting the 'id' primary key as an index
train = train.set_index('id')
test = test.set_index('id')#Checking if there is any difference between the behaviour of train and test data
train.describe()-test.describe()
There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in a correct validation.
#Let's check the distribution of target variable
sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()<matplotlib.legend.Legend at 0x7f7e42573910>
The distribution of the target variable is bimodal.
# Checking the distribution of other features
i = 1
plt.figure()
fig, ax = plt.subplots(5, 3, figsize=(14, 24))
for feature in test.columns:
plt.subplot(5, 3, i)
sns.distplot(train[feature], kde=True, bins=120, label='train')
sns.distplot(test[feature], kde=True, bins=120, label='test')
plt.xlabel(feature, fontsize=9); plt.legend()
i += 1
plt.show()<Figure size 432x288 with 0 Axes>
- Just like the target variable, all other features are either bimodal or multimodal in nature.
- Train and test data values are overlapping.
Scaling
#Let's scale these values to convert them into normal distribution
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train.drop('target', axis = 1))
train[train.drop('target', axis = 1).columns] = scaler.transform(train.drop('target', axis = 1))
test_scaled = scaler.transform(test)
test = pd.DataFrame(test_scaled, index=test.index, columns=test.columns)
Correlation Check
#Let's check how the features are inter-related to each other and with target variablef, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)
sns.heatmap(train[train.columns[train.columns != 'id']].corr(), vmin=-1, vmax=1, annot=True, cmap = 'Blues')
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(14)
tick.label.set_rotation(90)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(14)
tick.label.set_rotation(0)
plt.show()
- Features cont6 and cont9 are strongly correlated with other features and less correlated with the target. Let’s drop them.
#Dropping the correlated featurestrain = train.drop(['cont6', 'cont9'], axis = 1)
test = test.drop(['cont6', 'cont9'], axis = 1)
Outlier Treatment
# Checking outliers using Box Plots
i = 1
plt.figure()
fig, ax = plt.subplots(5, 3, figsize=(14, 24))
for feature in train.columns:
plt.subplot(5, 3, i)
sns.boxplot(train[feature])
plt.xlabel(feature, fontsize=9)
i += 1
plt.show()<Figure size 432x288 with 0 Axes>
#Checking for mild outliersQ1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train
((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])
#Checking for extreme outliersQ1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train
((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])
The Target Feature has some extreme outliers and ‘cont7’, and ‘cont10’ have some mild outliers.
Let’s remove the records having target feature outliers and replace the outliers in ‘cont7’ and ‘cont10’ with the median values.
# Removing records with extreme outliers in target variabletrain = train.drop(train[(train['target'] < (Q1_train - 3*IQR_train)['target']) | (train['target'] > (Q1_train + 3*IQR_train)['target'])].index)Removed 2 records.#Replacing outliers with median value
def replace_outliers(data):
for col in data.columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
median_ = data[col].median()
data.loc[((data[col] < Q1 - 3*IQR) | (data[col] > Q3 + 3*IQR)), col] = median_
return data
train[train.drop('target', axis = 1).columns] = replace_outliers(train.drop('target', axis = 1))#Checking the distribution of target variable again
sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()<matplotlib.legend.Legend at 0x7f7e2dc6c2d0>
Target Distribution is much lesser skewed now.
Modeling
Let’s try the ensemble models (like Random Forest, Light GBM, XGBoost) this time.
train = train.reset_index()#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target', 'id'], axis = 1)# Splitting the train data in 80:20 ratio.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)model_names = ["LGBM", "Random Forest", "XGBoost"]
models = [
LGBMRegressor(),
RandomForestRegressor(n_estimators = 10, max_depth = 10),
XGBRegressor()]
for name, model in zip(model_names, models):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = mean_squared_error(y_test, y_pred, squared=False)
print(f'{name}: RMSE: {score}')LGBM: RMSE: 0.7006733203452739
Random Forest: RMSE: 0.7093144916127819
XGBoost: RMSE: 0.7021882306288594
Woah!! Looks like the LightGBM model is fitting this data really well. Let’s try submitting this model's results.
model = LGBMRegressor()
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('lgbm.csv', index = False)
Great! We have got a leaderboard score of 0.70453. Much better than our previous benchmark score.
Since the LGBM model is showing good potential, let’s dive deep into the hyperparameter tuning of this best model.
LGBM Hyperparameter Tuning using Optuna
## Install optuna library
# !pip install optuna#Importing optuna library
import optuna#Function for hyperparameter tuning using optuna
def objective(trial,data=X,target=y):
train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=42)
param = {
'metric': 'rmse',
'random_state': 48,
'n_estimators': 2000,
'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
'max_depth': trial.suggest_categorical('max_depth', [10,20,100]),
'num_leaves' : trial.suggest_int('num_leaves', 1, 1000),
'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
'cat_smooth' : trial.suggest_int('min_data_per_groups', 1, 100)
}
model = LGBMRegressor(**param)
model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
preds = model.predict(test_x)
rmse = mean_squared_error(test_y, preds,squared=False)
return rmse#Hyperparameter tuning to minimize the RMSE for predictions
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
Number of finished trials: 10
Best trial: {'reg_alpha': 0.052633347414000095, 'reg_lambda': 0.01340929161546982, 'colsample_bytree': 0.5, 'subsample': 1.0, 'learning_rate': 0.01, 'max_depth': 100, 'num_leaves': 160, 'min_child_samples': 156, 'min_data_per_groups': 68}#Checking the best set of hyperparameters
print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")
for key, value in study.best_params.items():
print(f"\t\t{key}: {value}")Best value (rmse): 0.69792
Best params:
reg_alpha: 0.052633347414000095
reg_lambda: 0.01340929161546982
colsample_bytree: 0.5
subsample: 1.0
learning_rate: 0.01
max_depth: 100
num_leaves: 160
min_child_samples: 156
min_data_per_groups: 68#Adding some additional parameters
params=study.best_params
params['random_state'] = 48
params['n_estimators'] = 2000
params['metric'] = 'rmse'#Training LGBM with best set of hyperparameters
model = LGBMRegressor(**params)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('submission.csv', index = False)
Awesome! The leaderboard score has improved to 0.69932 after tuning the LGBM Regressor. This score is among the top 10% on the leaderboard!
The End!
Thank you for reading this publication. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.
At last, please clap this publication if you liked it! Thanks in advance.
Links:
Kaggle Kernel link.
Kaggle Profile link.
LinkedIn Profile Link.