End to End Document Classification using BERT and CI-CD Pipelines
GitHub Actions+DistilBERT+Flask+Docker+AWS+Runners
This article is a comprehensive guide to developing an end-to-end production-grade machine learning project using BERT Classifier, GitHub Actions, Flask Framework, Docker, and AWS Deployment. To demonstrate this workflow, I have used the 20 Newsgroup Dataset from Kaggle which is a multi-class classification problem. I have fine-tuned a DistilBERT model classify the text file data among the 20 different newsgroups.
Dataset link: https://www.kaggle.com/datasets/au1206/20-newsgroup-original
Following are the steps we will be walking through in this project.
Broad Steps:
- Setup of environment and project structure with GitHub
- Building Logger and Exception functionality
- Data Ingestion, Data Transformation, and Model Trainer
- Prediction Pipeline file
- Local Model Deployment using Flask
- Containerization using Docker
- Application deployment in AWS EC2 Instance using CI/CD Pipelines and GitHub Actions
1. Setup of environment and project structure with GitHub
The first step is to create a new environment in VSCode and a new repository on GitHub. Use the following commands in VSCode Command Prompt to create a new environment:
conda create -p venv python==3.9 -y
conda activate venv/
Use the following commands to sync the new GitHub repository with our project in VSCode.
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin <GITHUB_REPOSITORY_LINK>
git config --global user.name <GITHUB_USERNAME>
git config --global user.email <GITHUB_EMAIL>
git push -u origin main
Overview of the project structure:
Let’s start by creating the following setup files:
a. gitignore file
A gitignore file specifies intentionally untracked files that Git should ignore. A good practice is to place the environment folder (‘venv’ in our case) in this file to avoid committing it to GitHub.
b. setup.py file
The setup script is the center of all activity in building, distributing, and installing modules using the Distutils. The main purpose of the setup script is to describe your module distribution to the Distutils, so that the various commands that operate on your modules do the right thing.
c. requirements.txt file
A requirements.txt file is a type of file that usually stores information about all the libraries, modules, and packages in itself that are used while developing a particular project. “-e .” tells pip to install the requirements specified in setup.py along with the requirements.txt file.
2. Building Logger and Exception functionality
The ‘logger’ object creates and controls logging statements in the project. ‘Exceptions’ are unique objects that Python uses to control errors that occur when a program is running. Exception errors arise when correct syntactically Python programs produce an error. Python creates an exception object whenever these mistakes occur.
3. Data Ingestion, Data Transformation, and Model Trainer
These are the core files of our project. The entire machine learning workflow involving data loading, feature selection, preprocessing, and model training happens inside these files.
a. Data Ingestion
Data ingestion is the process of pulling data from a variety of sources into your system with the purpose of easily exploring, maintaining, and processing the data. We are reading the text data files to create train and test data frames in this file.
b. Data Transformation
Data transformation is the process of converting raw data into a format or structure that would be more suitable for the model or algorithm and also data discovery in general. It is an essential step in feature engineering that facilitates discovering insights. We are preprocessing the text using the column transformer pipeline which we have saved using a pickle file.
c. Model Trainer
In this file, we are fine-tuning the pre-trained distilBERT Uncased model to classify the documents into 20 different newsgroup classes and evaluating the model performance using the F1 score. The fine-tuned model file is saved as a .h5 file for predictions purpose.
About DistilBERT:
DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased and runs 60% faster while preserving over 95% of BERT’s performance.
4. Prediction Pipeline file
This file automates the prediction workflow starting from receiving the input then preprocessing using the preprocessing pickle file and making predictions using the saved model .h5 file.
5. Local Model Deployment using Flask
We will be using the Flask framework to deploy the model on the localhost. The following 2 files are required to do this task:
a. home.html
This HTML file handles the webpage form interface to accept the input from the user.
b. application.py
This file is used to read the input received from the HTML file, run the prediction pipeline on the received input, and finally show the predicted result on the webpage.
The next part requires a basic understanding of CI/CD pipelines.
What are CI/CD pipelines?
With a CI/CD pipeline, development teams can make changes to code that are then automatically tested and pushed out for delivery and deployment.
6. Containerization using Docker
About Docker:
Docker is the containerization platform that is used to package your application and all its dependencies together in the form of containers to make sure that your application works seamlessly in any environment which can be developed or tested or in production. Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.
First, create a docker file to containerize the dependencies.
Now, use the following commands to create a docker image using the docker file.
docker build -t <USERNAME>/document_classification .
docker images
docker push <USERNAME>/document_classification:latest
You will need to log in to the docker hub before pushing the docker image to the docker hub using the command below.
docker login
7. Application deployment in AWS EC2 Instance using CI/CD Pipelines and GitHub Actions
The image below summarizes the CI/CD workflow with GitHub Actions:
Follow the below steps to deploy the docker image to AWS ECR:
- Create new “Deploy to Amazon ECS” workflow using GitHub actions and commit the YAML file.
- In AWS IAM, create a new user and the user should have “AmazonEC2ContainerRegistryFullAccess” and “AmazonEC2FullAccess” permissions.
- Get the access key for that user in AWS IAM.
- Create an Amazon Elastic Container Registry (ECR) repository.
- Launch an Amazon EC2 instance, I have used t2.large for this project.
- Connect to the EC2 instance and install docker packages using the following commands:
sudo apt-get update -y
sudo apt-get upgrade
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu
newgrp docker
7. Now in the GitHub repository, create a new self-hosted runner and set it up in the EC2 instance using the following commands:
$ mkdir actions-runner && cd actions-runner# Download the latest runner package
$ curl -o actions-runner-linux-x64-2.304.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.304.0/actions-runner-linux-x64-2.304.0.tar.gz# Optional: Validate the hash
$ echo "292e8770bdeafca135c2c06cd5426f9dda49a775568f45fcc25cc2b576afc12f actions-runner-linux-x64-2.304.0.tar.gz" | shasum -a 256 -c# Extract the installer
$ tar xzf ./actions-runner-linux-x64-2.304.0.tar.gz
Configure
# Create the runner and start the configuration experience
$ ./config.sh --url https://github.com/Shagun-25/Newsgroup_Classification_end_to_end --token APQSZTFJTP3WHJZNMTNC5CLEKJB6K# Last step, run it!
$ ./run.sh
Using your self-hosted runner
# Use this YAML in your workflow file for each job
runs-on: self-hosted
The runner is now running successfully and is waiting for any new commits in GitHub.
8. The last step is to create Actions Secret Keys in GitHub to connect GitHub Actions with AWS using the AWS credentials. Create the following 5 secret keys:
AWS_ACCESS_KEY_ID: IAM Access ID
AWS_ECR_LOGIN_URI: ECR URL
AWS_REGION: EC2 Instance server location
AWS_SECRET_ACCESS_KEY: IAM Access Key
ECR_REPOSITORY_NAME: ECR Repository Name
Now if we make any change in the GitHub code, the CI CD pipeline will automatically run and deploy the code on AWS.
Now, we are ready to run the private IP of the EC2 instance and test the deployed application.
Final deployed application in AWS:
The End!
Thank you for reading this tutorial. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.
At last, please clap this publication if you liked it! Thanks in advance.
Code Reference
Contact Links
Email Id: kala.shagun@gmail.com
Linkedin: https://www.linkedin.com/in/shagun-kala