My Approach to handling the multimodal distributed features

Learning practical data science with Kaggle Competitions

10 min readApr 26, 2022

In this blog, I am sharing the approach I took to crack the Kaggle’s Feb 2021 Tabular Competition.
The Kaggle Competition link can be found here.
Evaluation Metric used: RMSE

Importing Libraries
Reading the data files
Exploring the data
Exploratory Data Analysis (EDA)
Feature Engineering
Modeling
LGBM Hyperparameter Tuning with Optuna

Importing Libraries

#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from lightgbm import LGBMRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold

from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

sns.set_palette("muted")

Reading the data files

#Reading the data files (Change the paths if running on google colab)

train = pd.read_csv('../input/tabular-playground-series-feb-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-feb-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv')

Exploring the data

print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()Shape of train data: (300000, 26)
Missing values count: 0

train.info()
print('\n')
train.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 26 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      300000 non-null  int64  
 1   cat0    300000 non-null  object 
 2   cat1    300000 non-null  object 
 3   cat2    300000 non-null  object 
 4   cat3    300000 non-null  object 
 5   cat4    300000 non-null  object 
 6   cat5    300000 non-null  object 
 7   cat6    300000 non-null  object 
 8   cat7    300000 non-null  object 
 9   cat8    300000 non-null  object 
 10  cat9    300000 non-null  object 
 11  cont0   300000 non-null  float64
 12  cont1   300000 non-null  float64
 13  cont2   300000 non-null  float64
 14  cont3   300000 non-null  float64
 15  cont4   300000 non-null  float64
 16  cont5   300000 non-null  float64
 17  cont6   300000 non-null  float64
 18  cont7   300000 non-null  float64
 19  cont8   300000 non-null  float64
 20  cont9   300000 non-null  float64
 21  cont10  300000 non-null  float64
 22  cont11  300000 non-null  float64
 23  cont12  300000 non-null  float64
 24  cont13  300000 non-null  float64
 25  target  300000 non-null  float64
dtypes: float64(15), int64(1), object(10)
memory usage: 59.5+ MB

id        300000
cat0           2
cat1           2
cat2           2
cat3           4
cat4           4
cat5           4
cat6           8
cat7           8
cat8           7
cat9          15
cont0     299830
cont1     299642
cont2     299707
cont3     299796
cont4     299736
cont5     299857
cont6     299875
cont7     299832
cont8     299765
cont9     299863
cont10    299894
cont11    299877
cont12    299824
cont13    299866
target    299648
dtype: int64

Training data has 300000 records and 26 features.
Column ‘id’ is the primary key.
It’s a regression problem since we need to predict the ‘target’ feature which is continuous in nature.
There are 14 numerical features which are already scaled and 10 categorical features in the data.
There is no missing value in the data.

print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()Shape of test data: (200000, 25)
Missing values count: 0

test.info()
print('\n')
test.nunique()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 25 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      200000 non-null  int64  
 1   cat0    200000 non-null  object 
 2   cat1    200000 non-null  object 
 3   cat2    200000 non-null  object 
 4   cat3    200000 non-null  object 
 5   cat4    200000 non-null  object 
 6   cat5    200000 non-null  object 
 7   cat6    200000 non-null  object 
 8   cat7    200000 non-null  object 
 9   cat8    200000 non-null  object 
 10  cat9    200000 non-null  object 
 11  cont0   200000 non-null  float64
 12  cont1   200000 non-null  float64
 13  cont2   200000 non-null  float64
 14  cont3   200000 non-null  float64
 15  cont4   200000 non-null  float64
 16  cont5   200000 non-null  float64
 17  cont6   200000 non-null  float64
 18  cont7   200000 non-null  float64
 19  cont8   200000 non-null  float64
 20  cont9   200000 non-null  float64
 21  cont10  200000 non-null  float64
 22  cont11  200000 non-null  float64
 23  cont12  200000 non-null  float64
 24  cont13  200000 non-null  float64
dtypes: float64(14), int64(1), object(10)
memory usage: 38.1+ MB

id        200000
cat0           2
cat1           2
cat2           2
cat3           4
cat4           4
cat5           4
cat6           7
cat7           8
cat8           7
cat9          15
cont0     199937
cont1     199835
cont2     199875
cont3     199902
cont4     199903
cont5     199929
cont6     199927
cont7     199926
cont8     199915
cont9     199944
cont10    199948
cont11    199946
cont12    199916
cont13    199949
dtype: int64

Test data has 200000 records and 25 features. ‘Target’ feature is absent as expected.
Column ‘id’ is the primary key.
There are 14 numerical features which are already scaled and 10 categorical features in the data.
There is no missing value in the data.

sample.head()

We need to submit the predicted target value for each id in the test data.

Exploratory Data Analysis (EDA)

train = train.set_index('id')
test = test.set_index('id')#Checking if there is any difference between the behaviour of train and test datatrain.describe() - test.describe()

There is not a major difference in the distribution of all features among test and train data. This is a good sign and will help us in a correct validation.

num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns#Let's check the distribution of target variable

sns.distplot(train['target'], kde=True, bins=120, label="Skew: %.2f"%(train['target'].skew()))
plt.xlabel('Target', fontsize=12); plt.legend()<matplotlib.legend.Legend at 0x7f92978254d0>

The distribution of the target variable is bimodal.

Continuous Features

# Checking the distribution of continuous features

i = 1
plt.figure()
fig, ax = plt.subplots(4, 4, figsize=(14, 14))

for feature in num_columns:
    plt.subplot(4, 4, i)
    sns.distplot(train[feature], kde=True, bins=120, label="Skew: %.2f"%(train[feature].skew()))
    plt.xlabel(feature, fontsize=9); plt.legend(loc="best")
    i += 1

fig.tight_layout()

fig.delaxes(ax[3,2])
fig.delaxes(ax[3,3])

plt.show()<Figure size 432x288 with 0 Axes>

No feature is highly skewed.
All continuous features are multimodal in nature.

#Scatterplot for continuous features
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
    plt.subplot(5, 3, i+1)
    sns.scatterplot(x=feature, 
                    y="target", 
                    data=train, s = 1)
    plt.xlabel(feature, fontsize=12)

fig.delaxes(ax[4,2])
plt.show()

We can observe some clusters in these scatter plots.
Cont1 feature has some clearly defined clusters.
We should try the clustering approach in the feature engineering section.

Categorical Features

train.head()

# Checking the distribution of categorical features

i = 1
plt.figure()
fig, ax = plt.subplots(3, 4, figsize=(15,12))

for feature in cat_columns:
    plt.subplot(3, 4, i)
    sns.histplot(x=feature, data=train)
    plt.xlabel(feature, fontsize = 9)
    i += 1

fig.suptitle('Distribution of Categorical Features')
plt.tight_layout()

fig.delaxes(ax[2,2])
fig.delaxes(ax[2,3])

plt.show()<Figure size 432x288 with 0 Axes>

We can observe that some categories are much dominating than others. Such features are not useful for the models.

Scaling

All continuous features are already scaled in the dataset.

Correlation Check

#Let's check how the features are inter-related to each other and with target variablef, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(12) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(12)
    tick.label.set_rotation(0)
    
plt.show()

None of the features are highly correlated with each other.
None of the features are directly correlated with the target feature.

Outlier Treatment

#Checking for mild outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])

#Checking for extreme outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])

The Target Feature has some extreme outliers but there is no significant outlier in other features.

Let’s remove the records having target feature outliers and replace the mild outliers in other features with median values.

# Removing records with extreme outliers in target variable
train = train.drop(train[(train['target'] < (Q1_train - 3*IQR_train)['target']) | (train['target'] > (Q1_train + 3*IQR_train)['target'])].index)

Removed 3 records.

train_num = train.select_dtypes(exclude=['object'])#Replacing outliers with median value

def replace_outliers(data):
    for col in data.columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        median_ = data[col].median()
      
        data.loc[((data[col] < Q1 - 1.5*IQR) | (data[col] > Q3 + 1.5*IQR)), col] = median_
    return data

train[train_num.drop('target', axis = 1).columns] = replace_outliers(train_num.drop('target', axis = 1))#Checking the distribution of target variable again
sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()<matplotlib.legend.Legend at 0x7f92978218d0>

Feature Engineering

Continuous Features

#Defining number of bins based on above scatterplot and using Gaussian Mixture Model to cluster the data

inits = [4,11,8,6,6,6,4,8,8,9,8,5,8,9]
gmms = []
for feature, init in zip(num_columns, inits):
    X_ = np.array(train[feature].tolist()).reshape(-1, 1)
    gmm_ = GaussianMixture(n_components=init).fit(X_)
    gmms.append(gmm_)
    preds = gmm_.predict(X_)
    train[f'{feature}_gmm'] = preds
    train[f'{feature}_gmm'] = preds[:len(train)]
    test[f'{feature}_gmm'] = preds[:len(test)]#Plotting scatterplot with clusters

fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
    plt.subplot(5, 3, i+1)
    sns.scatterplot(x=feature, 
                    y="target", 
                    data=train, 
                    hue=f'{feature}_gmm', s = 1, palette='muted')
    
    plt.xlabel(feature, fontsize=12)
    
fig.delaxes(ax[4,2])
plt.show()

#Let's plot the histograms as well with the clusters
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
    plt.subplot(5, 3, i+1)
    sns.histplot(x=feature, 
                 data=train[::100], 
                 hue=f'{feature}_gmm', 
                 kde=True, 
                 bins=100, 
                 palette='muted')
    plt.xlabel(feature, fontsize=9)
    
fig.delaxes(ax[4,2])
plt.show()

We can see how well the gaussian mixture model has worked in identifying these clusters. This should really help our models to score well on this data.

Categorical Features

#Applying label encoding on the categorical features

for feature in cat_columns:
    le = LabelEncoder()
    le.fit(train[feature])
    train[feature] = le.transform(train[feature])
    test[feature] = le.transform(test[feature])

Modeling

Let’s try different ML models and see which performs best.

train = train.reset_index(drop = True)#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target'], axis = 1)# Splitting the train data in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)model_names = ["Linear",  "Lasso", "Ridge", "Decision Tree", "LGBM", "Random Forest", "XGBoost"]

models = [
    LinearRegression(fit_intercept=True),
    Lasso(fit_intercept=True),
    Ridge(fit_intercept=True),
    DecisionTreeRegressor(),
    LGBMRegressor(),
    RandomForestRegressor(n_estimators = 10, max_depth = 50),
    XGBRegressor()]

for name, model in zip(model_names, models):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    score = mean_squared_error(y_test, y_pred, squared=False)
    print(f'{name}: RMSE: {score}')Linear: RMSE: 0.8679301934565669
Lasso: RMSE: 0.8889082968790681
Ridge: RMSE: 0.8679302181948934
Decision Tree: RMSE: 1.2303531438161233
LGBM: RMSE: 0.8484144148561924
Random Forest: RMSE: 0.9005961602464834
XGBoost: RMSE: 0.8507598896392843

Best performing model: LightGBM. It is fitting this data much better than other models. Let’s try submitting this model on test data.

X_train.columns.symmetric_difference(test.columns)Index([], dtype='object')train.shape, test.shape((299997, 39), (200000, 38))test = test.reset_index(drop = True)model = LGBMRegressor()
model.fit(X_train, y_train)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('lgbm.csv', index = False)

Great! We have got a leaderboard RMSE score of 0.85081.

Since the LGBM model is showing good potential, let’s dive deep into the hyperparameter tuning of this best model.

LGBM Hyperparameter Tuning using Optuna

## Install optuna library
# !pip install optuna#Importing optuna library
import optuna#Function for hyperparameter tuning using optuna

def objective(trial,data=X,target=y):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=42)
    param = {
        'metric': 'rmse', 
        'random_state': 48,
        'n_estimators': 2000,
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
        'max_depth': trial.suggest_categorical('max_depth', [10,20,100]),
        'num_leaves' : trial.suggest_int('num_leaves', 1, 1000),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
        'cat_smooth' : trial.suggest_int('min_data_per_groups', 1, 100)
    }
    model = LGBMRegressor(**param)  
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    preds = model.predict(test_x)
    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

Number of finished trials: 10
Best trial: {'reg_alpha': 2.2944935017828656, 'reg_lambda': 0.019608626617733788, 'colsample_bytree': 0.3, 'subsample': 0.6, 'learning_rate': 0.008, 'max_depth': 10, 'num_leaves': 629, 'min_child_samples': 191, 'min_data_per_groups': 48}#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
    print(f"\t\t{key}: {value}")Best value (rmse): 0.84490
	Best params:
		reg_alpha: 2.2944935017828656
		reg_lambda: 0.019608626617733788
		colsample_bytree: 0.3
		subsample: 0.6
		learning_rate: 0.008
		max_depth: 10
		num_leaves: 629
		min_child_samples: 191
		min_data_per_groups: 48#Adding some additional parameters

params=study.best_params   
params['random_state'] = 48
params['n_estimators'] = 2000
params['metric'] = 'rmse'#Training LGBM with best set of hyperparameters

model = LGBMRegressor(**params)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('submission.csv', index = False)

Awesome! We got a leaderboard RMSE score of 0.84854 after tuning the LGBM Regressor.

However, it can be improved further by stacking the models together.

The End!

Thank you for reading this publication. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.

Links:

Kaggle Kernel link.
Kaggle Profile link.
LinkedIn Profile Link.