Grant Access using Machine Learning

Amazon Employee Access Challenge

12 min readJul 15, 2020

Table of Contents

Introduction
Usage of ML for this problem
Data Inspection
Scoring Metric
Exploratory Data Analysis (EDA)
Feature Engineering
Building Machine Learning Models
Kaggle Submission
Deployment
Future Improvements
Code Reference
Contact Links
References

Introduction

An employee may need to apply for different resources during his or her career at the company. For giants like Google and Amazon, due to their highly complicated employee and resource situations, the application review process is generally done by different human administrators. In this project, based on the historical data of 2010 to 2011 done by human administrators at Amazon Inc., we aim to build up an employee access control system, which automatically approves or reject employee’s resource application.

Usage of ML for this problem

Machine Learning is used when traditional programming methods can’t deal efficiently with the problem. Since the problem in hand cannot be solved by traditional programming as there can be any number of roles and departments in an organization. So we aim to develop a Machine Learning model that takes an employee’s access request as input which contains details about the employee’s attributes like role, department, manager id, etc. and then the model has to decide whether to provide access or not.

Data Inspection

Source of Data: https://www.kaggle.com/c/amazon-employee-access-challenge

This data contains real historical employee access data collected between 2010 to 2011 done by human administrators at Amazon Inc. The dataset consists of 2 files train.csv and test.csv. Train.csv contains 32,769 data points whereas test.csv contains 58,921 data points.

Imported train.csv in dataframe called ‘data’ using Pandas.

Imported test.csv in dataframe called ‘data_test’ using Pandas.

Features Overview:

RESOURCE - An ID for each resource
MGR_ID -The EMPLOYEE ID of the manager of the current EMPLOYEE ID record. An employee may have only one manager at a time.
ROLE_ROLLUP_1 -Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 -Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME -Company role department description (e.g. Retail)
ROLE_TITLE -Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC -Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY - Company role family description (e.g. Retail Manager)
ROLE_CODE -Company role code; this code is unique to each role (e.g. Manager)

We can observe here that all our features are categorical and nominal in nature i.e. having no particular ordering.

Checking for Cardinality of features:

An important observation: All the categorical features are having higher cardinality. This will emphasize our feature engineering.

Checking For Missing Data:

We do not have any missing data here.

Checking for Duplicate Rows:

We do not have any duplicate data here.

Scoring Metric

The scoring metric used in this Kaggle competition is AUC Score.

What is the AUC Score?

AUC -ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with the disease and no disease.

The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

Defining terms used in AUC-ROC Curve:

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability.

But why are we using AUC Score here?

AUC score is sensitive to class imbalance. Since the dataset given in this challenge is imbalanced so the AUC score is a better metric than Accuracy here.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the process of exploring data, generating insights, testing hypotheses, checking assumptions, and revealing underlying hidden patterns in the data.

Univariate Analysis

Target Feature:

Target Variable Plot with percentages

We can observe that our dataset is highly imbalanced with a ratio of 94:5 and it is a binary classification problem.

‘ACTION’ is our target variable. ACTION = 1: Access Requested Accepted.

ACTION = 0: Access Request Rejected.

Other Features:

RESOURCE

Here, we can notice that when the RESOURCE value is between 60000 and 90000 then the probability of application getting rejected is more.

2. ROLE_FAMILY

Seaborn Distplot for ROLE_FAMILY feature

Here, we can notice two peaks here and see that majority of the ROLE_FAMILY values lie within these peaks. Also, in these two peaks probability of application getting rejected is much higher.

3. MGR_ID

Here also we can notice one big peak and we can safely say that when the MGR_ID value is less than 25000 then the probability of access application getting rejected is much higher.

4. ROLE_ROLLUP

Seaborn Distplot for ROLE_ROLLUP features

The values in these two ROLE_ROLLUP features are very concentrated. But we can observe a big difference between these two features.

In ROLE_ROLLUP_1 the peak is touched by ACTION = 1 whereas in ROLE_ROLLUP_2 it is touched by ACTION = 0.

Values in ROLE_ROLLUP_1 feature corresponds to access applications getting accepted whereas in ROLE_ROLLUP_2 it corresponds to applications getting rejected.

Similarly, we plotted for remaining features as well. But couldn’t find anything worth noting.

Pairplot is also a very useful plot to perform univariate and bivariate analyses all at once. But it gets cumbersome as the number of features increase.

Bivariate Analysis

Code Snippet to plot Scatterplot using Seaborn

Seaborn Scatterplot between ROLE_TITLE & ROLE_CODE Feature

This is one key finding we observed by plotting a pairplot. The ROLE_CODE and ROLE_TITLE features are behaving very similarly. And they are highly dependent on each other.

Hence we can conclude here that we can safely remove one of these features in our further analysis.

Multivariate Analysis

Through this correlation heatmap, we can observe that the features are not highly correlated and each of the features is important in its own place.

Feature Engineering

“More data beats clever algorithms, but better data beats more data.”

Feature engineering is a process of transforming the given data into a form that is easier to interpret. Here, we are interested in making our data more transparent for a machine learning model.

Feature Engineering is a work of art in data science and machine learning.

Some of the feature engineering techniques are discussed below.

Encoding Techniques

Label Encoding

Label Encoding refers to converting the labels into a numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

Limitation of label Encoding
Label encoding converts the data in a machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets. A label with high value may be considered to have a high priority than a label having a lower value.

2. Binary Encoding

Binary Encoding creates fewer features than one-hot, while preserving some uniqueness of values in the the column. It can work well with higher dimensionality ordinal data. For nominal data, a hashing algorithm with more fine-grained control usually makes more sense.

3. One-Hot Encoding

One-hot Encoding creates one column for each value to compare against all other values. For each new column, a row gets a 1 if the row contained that column’s value and a 0 if it did not. Here’s how it looks:

Limitation of One-hot Encoding

One-hot encoding can create very high dimensionality depending on the number of categorical features you have and the number of categories per feature.

4. Frequency Encoding

Frequency encoding converts categories to the frequencies with which they appear in the data. This way it converts categorical features to numerical data while giving more weightage to the category having a higher frequency.

5. Hashing Encoding

This encoding technique is mainly used when we have nominal categorical variables with high cardinality. Nominal Categorical Variables are unordered and they have no numerical importance. The vector obtained from this technique is very similar to a one-hot encoded vector but with much lower dimensions.

Hybrid Features

Since these features are already encoded randomly and they are nominal features. So, adding, subtracting the feature values won’t help much. However, we can try concatenating them with each other. Also, we will be adding count frequencies for each feature as numeric hybrid features.

Creating Duplets of Features

2. Creating Triplets of Features

3. Using Count Frequencies as Standardized Numerical Features

Building ML Models

We have removed the ROLE_CODE feature from both our train and test data due to the outcome we got from EDA.

Let’s have an overview of all the ML models used in this case study.

Logistic Regression

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. The logistic regression model predicts P(Y=1) as a function of X. It uses the sigmoid function to predict probability values.

2. Random Forest (Ensemble: Bagging)

The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

3. XGBoost (Ensemble: Boosting)

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class. XGBoost has been a proven model in data science competition and hackathons for its accuracy, speed, and scale.

4. K-Nearest Neighbour

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. The following two properties would define KNN well −

Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.
Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.

5. CatBoost Classifier

CatBoost is a high-performance open-source library for gradient boosting on decision trees.

It is a readymade classifier in scikit-learn’s conventions terms that would deal with categorical features automatically. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It is developed by Yandex researchers and engineers and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction, and many other tasks.

It is especially powerful in two ways:

It yields state-of-the-art results without extensive data training typically required by other machine learning methods, and
Provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems.

CatBoost uses oblivious decision trees, where the same splitting criterion is used across an entire level of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up prediction significantly at testing time.

5. Extra Trees Classifier

Extremely Randomized Trees Classifier(Extra Trees Classifier) is a type of ensemble learning technique that aggregates the results of multiple de-correlated decision trees collected in a “forest” to output its classification result. In concept, it is very similar to a Random Forest Classifier and only differs from it in the manner of construction of the decision trees in the forest.

6. Ensemble: Stacking

Stacking or Stacked Generalization is an ensemble machine learning algorithm. It uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms.

The benefit of stacking is that it can harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble.

In this case study, we have used Logistic Regression as a Meta-Learning Model and Logistic Regression, CatBoost, XGBoost, Random Forest, KNN, ExtraTrees as base learning models.

Base Model

Base model is used to set a benchmark of the performance metric to compare more advanced models. For the Base Model, we are using only original features and not hybrid features.

Logistic Regression worked best among other models like SVM, Decision Trees, Naive Bayes. Here is the summary of results using various encoding techniques with the Logistic Regression Model.

One-Hot Encoding gave best results with the Logistic Regression model.

Base Model- Best AUC Score: 0.8497

Advanced Models

Let’s discuss few of the key concepts used in the advanced models.

10-fold Cross-Validation: I used 10-fold CV to minimize the variance in results and detect minor changes in the performance of the model after any change. This is a really useful technique to rely on especially when we need to compare the performance of models in a Kaggle Competition.

2. Forward Feature Selection: This is another very useful technique to squeeze out even a small rise in the performance of the model. This technique also helps in reducing the training time by selecting the most important features from many features. Here is a small code to do greedy forward feature selection.

Let’s check how all the ML models performed in the following pretty table.

The best results we got from the Logistic Regression model using more advanced features while selecting most important features.

However, our stacked model didn’t work well. Which can be taken as an area for further improvement in this case study.

Best Performing Model: Logistic Regression with Greedy Forward Feature Selection

Best Cross-Validation AUC Score: 0.8943

Best Kaggle Submission AUC Score: 0.9052

Kaggle Submission

Following is the image of the Kaggle submission for our best performing model.

This AUC Score could land us under Top 15% on the Kaggle Competition leaderboard.

Deployment

The final Pipeline of this solution using the best model weights can be found in the GitHub code reference.

The following are the snapshots of the deployed model on localhost using Flask API.

HTML Page showing the form to input Employee Details

HTML page showing the values filled in the form. Before clicking on Predict Button.

Model states whether the request is approved or rejected on HTML Page.

Further Improvements

Here are some of the leads which can improve the results from the above-discussed solution :

More rigorous hyperparameter-tuning of XGBoost, Random Forest, ExtraTrees Models.
Stacked model not performing well due to overfitting. Possible reason might be data leakage.
Using more advanced encoding techniques like SVD Encoding, Target Encoding, etc model results can be improved.