My Approach to handling the multimodal distributed features

Learning practical data science with Kaggle Competitions

Shagun Kala
10 min readApr 26, 2022

In this blog, I am sharing the approach I took to crack the Kaggle’s Feb 2021 Tabular Competition.

The Kaggle Competition link can be found here.

Evaluation Metric used: RMSE

Kaggle Competition Banner

Table of Contents

  • Importing Libraries
  • Reading the data files
  • Exploring the data
  • Exploratory Data Analysis (EDA)
  • Feature Engineering
  • Modeling
  • LGBM Hyperparameter Tuning with Optuna

Importing Libraries

#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from lightgbm import LGBMRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold

from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

sns.set_palette("muted")

Reading the data files

#Reading the data files (Change the paths if running on google colab)

train = pd.read_csv('../input/tabular-playground-series-feb-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-feb-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv')

Exploring the data

print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()
Shape of train data: (300000, 26)
Missing values count: 0
png
train.info()
print('\n')
train.nunique()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 300000 non-null int64
1 cat0 300000 non-null object
2 cat1 300000 non-null object
3 cat2 300000 non-null object
4 cat3 300000 non-null object
5 cat4 300000 non-null object
6 cat5 300000 non-null object
7 cat6 300000 non-null object
8 cat7 300000 non-null object
9 cat8 300000 non-null object
10 cat9 300000 non-null object
11 cont0 300000 non-null float64
12 cont1 300000 non-null float64
13 cont2 300000 non-null float64
14 cont3 300000 non-null float64
15 cont4 300000 non-null float64
16 cont5 300000 non-null float64
17 cont6 300000 non-null float64
18 cont7 300000 non-null float64
19 cont8 300000 non-null float64
20 cont9 300000 non-null float64
21 cont10 300000 non-null float64
22 cont11 300000 non-null float64
23 cont12 300000 non-null float64
24 cont13 300000 non-null float64
25 target 300000 non-null float64
dtypes: float64(15), int64(1), object(10)
memory usage: 59.5+ MB

id 300000
cat0 2
cat1 2
cat2 2
cat3 4
cat4 4
cat5 4
cat6 8
cat7 8
cat8 7
cat9 15
cont0 299830
cont1 299642
cont2 299707
cont3 299796
cont4 299736
cont5 299857
cont6 299875
cont7 299832
cont8 299765
cont9 299863
cont10 299894
cont11 299877
cont12 299824
cont13 299866
target 299648
dtype: int64
  • Training data has 300000 records and 26 features.
  • Column ‘id’ is the primary key.
  • It’s a regression problem since we need to predict the ‘target’ feature which is continuous in nature.
  • There are 14 numerical features which are already scaled and 10 categorical features in the data.
  • There is no missing value in the data.
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()
Shape of test data: (200000, 25)
Missing values count: 0
png
test.info()
print('\n')
test.nunique()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 200000 non-null int64
1 cat0 200000 non-null object
2 cat1 200000 non-null object
3 cat2 200000 non-null object
4 cat3 200000 non-null object
5 cat4 200000 non-null object
6 cat5 200000 non-null object
7 cat6 200000 non-null object
8 cat7 200000 non-null object
9 cat8 200000 non-null object
10 cat9 200000 non-null object
11 cont0 200000 non-null float64
12 cont1 200000 non-null float64
13 cont2 200000 non-null float64
14 cont3 200000 non-null float64
15 cont4 200000 non-null float64
16 cont5 200000 non-null float64
17 cont6 200000 non-null float64
18 cont7 200000 non-null float64
19 cont8 200000 non-null float64
20 cont9 200000 non-null float64
21 cont10 200000 non-null float64
22 cont11 200000 non-null float64
23 cont12 200000 non-null float64
24 cont13 200000 non-null float64
dtypes: float64(14), int64(1), object(10)
memory usage: 38.1+ MB

id 200000
cat0 2
cat1 2
cat2 2
cat3 4
cat4 4
cat5 4
cat6 7
cat7 8
cat8 7
cat9 15
cont0 199937
cont1 199835
cont2 199875
cont3 199902
cont4 199903
cont5 199929
cont6 199927
cont7 199926
cont8 199915
cont9 199944
cont10 199948
cont11 199946
cont12 199916
cont13 199949
dtype: int64
  • Test data has 200000 records and 25 features. ‘Target’ feature is absent as expected.
  • Column ‘id’ is the primary key.
  • There are 14 numerical features which are already scaled and 10 categorical features in the data.
  • There is no missing value in the data.
sample.head()
png
  • We need to submit the predicted target value for each id in the test data.

Exploratory Data Analysis (EDA)

train = train.set_index('id')
test = test.set_index('id')
#Checking if there is any difference between the behaviour of train and test datatrain.describe() - test.describe()
png

There is not a major difference in the distribution of all features among test and train data. This is a good sign and will help us in a correct validation.

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns
#Let's check the distribution of target variable

sns.distplot(train['target'], kde=True, bins=120, label="Skew: %.2f"%(train['target'].skew()))
plt.xlabel('Target', fontsize=12); plt.legend()
<matplotlib.legend.Legend at 0x7f92978254d0>
png

The distribution of the target variable is bimodal.

Continuous Features

# Checking the distribution of continuous features

i = 1
plt.figure()
fig, ax = plt.subplots(4, 4, figsize=(14, 14))

for feature in num_columns:
plt.subplot(4, 4, i)
sns.distplot(train[feature], kde=True, bins=120, label="Skew: %.2f"%(train[feature].skew()))
plt.xlabel(feature, fontsize=9); plt.legend(loc="best")
i += 1

fig.tight_layout()

fig.delaxes(ax[3,2])
fig.delaxes(ax[3,3])

plt.show()
<Figure size 432x288 with 0 Axes>
png
  • No feature is highly skewed.
  • All continuous features are multimodal in nature.
#Scatterplot for continuous features
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
plt.subplot(5, 3, i+1)
sns.scatterplot(x=feature,
y="target",
data=train, s = 1)
plt.xlabel(feature, fontsize=12)

fig.delaxes(ax[4,2])
plt.show()
png
  • We can observe some clusters in these scatter plots.
  • Cont1 feature has some clearly defined clusters.
  • We should try the clustering approach in the feature engineering section.

Categorical Features

train.head()
png
# Checking the distribution of categorical features

i = 1
plt.figure()
fig, ax = plt.subplots(3, 4, figsize=(15,12))

for feature in cat_columns:
plt.subplot(3, 4, i)
sns.histplot(x=feature, data=train)
plt.xlabel(feature, fontsize = 9)
i += 1

fig.suptitle('Distribution of Categorical Features')
plt.tight_layout()

fig.delaxes(ax[2,2])
fig.delaxes(ax[2,3])

plt.show()
<Figure size 432x288 with 0 Axes>
png
  • We can observe that some categories are much dominating than others. Such features are not useful for the models.

Scaling

All continuous features are already scaled in the dataset.

Correlation Check

#Let's check how the features are inter-related to each other and with target variablef, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
tick.label.set_rotation(90)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
tick.label.set_rotation(0)

plt.show()
png
  • None of the features are highly correlated with each other.
  • None of the features are directly correlated with the target feature.

Outlier Treatment

#Checking for mild outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])
png
#Checking for extreme outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])
png

The Target Feature has some extreme outliers but there is no significant outlier in other features.

Let’s remove the records having target feature outliers and replace the mild outliers in other features with median values.

# Removing records with extreme outliers in target variable
train = train.drop(train[(train['target'] < (Q1_train - 3*IQR_train)['target']) | (train['target'] > (Q1_train + 3*IQR_train)['target'])].index)

Removed 3 records.

train_num = train.select_dtypes(exclude=['object'])#Replacing outliers with median value

def replace_outliers(data):
for col in data.columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
median_ = data[col].median()

data.loc[((data[col] < Q1 - 1.5*IQR) | (data[col] > Q3 + 1.5*IQR)), col] = median_
return data

train[train_num.drop('target', axis = 1).columns] = replace_outliers(train_num.drop('target', axis = 1))
#Checking the distribution of target variable again
sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()
<matplotlib.legend.Legend at 0x7f92978218d0>
png

Feature Engineering

Continuous Features

#Defining number of bins based on above scatterplot and using Gaussian Mixture Model to cluster the data

inits = [4,11,8,6,6,6,4,8,8,9,8,5,8,9]
gmms = []
for feature, init in zip(num_columns, inits):
X_ = np.array(train[feature].tolist()).reshape(-1, 1)
gmm_ = GaussianMixture(n_components=init).fit(X_)
gmms.append(gmm_)
preds = gmm_.predict(X_)
train[f'{feature}_gmm'] = preds
train[f'{feature}_gmm'] = preds[:len(train)]
test[f'{feature}_gmm'] = preds[:len(test)]
#Plotting scatterplot with clusters

fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
plt.subplot(5, 3, i+1)
sns.scatterplot(x=feature,
y="target",
data=train,
hue=f'{feature}_gmm', s = 1, palette='muted')

plt.xlabel(feature, fontsize=12)

fig.delaxes(ax[4,2])
plt.show()
png
#Let's plot the histograms as well with the clusters
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
plt.subplot(5, 3, i+1)
sns.histplot(x=feature,
data=train[::100],
hue=f'{feature}_gmm',
kde=True,
bins=100,
palette='muted')
plt.xlabel(feature, fontsize=9)

fig.delaxes(ax[4,2])
plt.show()
png
  • We can see how well the gaussian mixture model has worked in identifying these clusters. This should really help our models to score well on this data.

Categorical Features

#Applying label encoding on the categorical features

for feature in cat_columns:
le = LabelEncoder()
le.fit(train[feature])
train[feature] = le.transform(train[feature])
test[feature] = le.transform(test[feature])

Modeling

Let’s try different ML models and see which performs best.

train = train.reset_index(drop = True)#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target'], axis = 1)
# Splitting the train data in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
model_names = ["Linear", "Lasso", "Ridge", "Decision Tree", "LGBM", "Random Forest", "XGBoost"]

models = [
LinearRegression(fit_intercept=True),
Lasso(fit_intercept=True),
Ridge(fit_intercept=True),
DecisionTreeRegressor(),
LGBMRegressor(),
RandomForestRegressor(n_estimators = 10, max_depth = 50),
XGBRegressor()]

for name, model in zip(model_names, models):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = mean_squared_error(y_test, y_pred, squared=False)
print(f'{name}: RMSE: {score}')
Linear: RMSE: 0.8679301934565669
Lasso: RMSE: 0.8889082968790681
Ridge: RMSE: 0.8679302181948934
Decision Tree: RMSE: 1.2303531438161233
LGBM: RMSE: 0.8484144148561924
Random Forest: RMSE: 0.9005961602464834
XGBoost: RMSE: 0.8507598896392843

Best performing model: LightGBM. It is fitting this data much better than other models. Let’s try submitting this model on test data.

X_train.columns.symmetric_difference(test.columns)Index([], dtype='object')train.shape, test.shape((299997, 39), (200000, 38))test = test.reset_index(drop = True)model = LGBMRegressor()
model.fit(X_train, y_train)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('lgbm.csv', index = False)

Great! We have got a leaderboard RMSE score of 0.85081.

Since the LGBM model is showing good potential, let’s dive deep into the hyperparameter tuning of this best model.

LGBM Hyperparameter Tuning using Optuna

## Install optuna library
# !pip install optuna
#Importing optuna library
import optuna
#Function for hyperparameter tuning using optuna

def objective(trial,data=X,target=y):

train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=42)
param = {
'metric': 'rmse',
'random_state': 48,
'n_estimators': 2000,
'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
'max_depth': trial.suggest_categorical('max_depth', [10,20,100]),
'num_leaves' : trial.suggest_int('num_leaves', 1, 1000),
'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
'cat_smooth' : trial.suggest_int('min_data_per_groups', 1, 100)
}
model = LGBMRegressor(**param)
model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
preds = model.predict(test_x)

rmse = mean_squared_error(test_y, preds,squared=False)

return rmse
#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

Number of finished trials: 10
Best trial: {'reg_alpha': 2.2944935017828656, 'reg_lambda': 0.019608626617733788, 'colsample_bytree': 0.3, 'subsample': 0.6, 'learning_rate': 0.008, 'max_depth': 10, 'num_leaves': 629, 'min_child_samples': 191, 'min_data_per_groups': 48}
#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
print(f"\t\t{key}: {value}")
Best value (rmse): 0.84490
Best params:
reg_alpha: 2.2944935017828656
reg_lambda: 0.019608626617733788
colsample_bytree: 0.3
subsample: 0.6
learning_rate: 0.008
max_depth: 10
num_leaves: 629
min_child_samples: 191
min_data_per_groups: 48
#Adding some additional parameters

params=study.best_params
params['random_state'] = 48
params['n_estimators'] = 2000
params['metric'] = 'rmse'
#Training LGBM with best set of hyperparameters

model = LGBMRegressor(**params)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('submission.csv', index = False)

Awesome! We got a leaderboard RMSE score of 0.84854 after tuning the LGBM Regressor.

However, it can be improved further by stacking the models together.

The End!

Thank you for reading this publication. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.

Links:

Kaggle Kernel link.

Kaggle Profile link.

LinkedIn Profile Link.

--

--

Shagun Kala
Shagun Kala

No responses yet