Youtube Virality Prediction using BERT and CatBoost Ensemble

8 min readMay 13, 2024

Project’s Objective

In today’s digital landscape, content creators face an ever-increasing challenge in capturing the attention of their target audience amidst the vast sea of online content. One of the most significant hurdles they encounter is the daunting task of identifying content ideas with the potential to go viral. The elusive nature of virality often leaves creators grappling with uncertainty, as they strive to craft engaging content that resonates with viewers. This uncertainty can lead to inefficiencies in content production, as creators may invest time and resources into ideas that fail to gain traction, ultimately hindering their growth and reach.

Our analysis seeks to address this pressing issue by providing content creators with a valuable tool to predict the virality potential of their YouTube videos before publication. By leveraging machine learning techniques to analyze key elements such as the title, description, thumbnail, and additional metadata, our model aims to forecast the like to view ratio — a crucial indicator of a video’s potential to go viral. Armed with this predictive insight, content creators can make informed decisions about which content ideas to pursue, thereby maximizing their chances of creating compelling, high-performing videos that resonate with their audience and drive engagement. This proactive approach not only empowers creators to optimize their content strategy but also enhances their ability to grow their audience and expand their reach in the competitive digital landscape.

Importing Libraries

import os
import gc
import regex as re
import spacy
import xgboost
import lightgbm
import catboost
import itertools
import numpy as np
import pandas as pd
import codecs,string
import matplotlib.pyplot as plt
from sklearn import decomposition
from sklearn import model_selection
from tqdm import tqdm_notebook as tqdm
from sklearn import metrics, preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

import torch
from transformers import BertTokenizer, BertModel
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim
from sklearn.metrics import mean_absolute_error

gc.enable()

Reading the data files

We’ve extracted and employed video data from parquet files sourced from a Kaggle repository, harnessing valuable insights for analysis and modeling.

The Kaggle dataset link used in this analysis can be found here.

Data at a Glance

Data has 92275 records and 20 features.
Column ‘video_id’ is the primary key.
It’s a regression problem since we need to predict the ‘target’ feature which is continuous in nature.
There is no missing value in this data set.

Data Preprocessing

Understanding the data:

Before we dive into the preprocessing steps, let’s take a quick look at the structure of our dataset. We have various types of columns:

Categorical Columns: These include attributes like video ID, channel ID, category ID, and indicators for comments and ratings being disabled.
Numerical Columns: Mainly, we have the duration of the videos in seconds.
Date Columns: These include the published date and trending date of the videos.
Drop Columns: Columns like ‘id’ that we won’t need for our analysis.

2. Encoding Categorical Columns:

Categorical data, such as video and channel IDs, needs to be converted into numerical format for our machine learning algorithms to process. We achieve this using Label Encoding. Each unique category is assigned a numerical value, thereby transforming text data into a numerical representation.

3. Normalizing Numerical Columns:

Normalization is crucial for numerical columns like video durations. By scaling these values between 0 and 1, we ensure that no single feature dominates the others, preventing bias in our analysis.

4. Creating Date Time Features:

Date and time data are rich sources of information. By converting our date columns into datetime objects, we unlock a plethora of possibilities. We extract various features like year, week of year, month, day of week, and whether it’s a weekend or not. These features could reveal interesting patterns, like videos published on weekends garnering more views.

5. Calculating Video Age:

The age of a video, measured by the difference in days between its publication and trending date, provides valuable insights into its virality. We calculate this age to understand how engagement evolves over time.

6. Introducing Boolean Columns:

Boolean columns are binary indicators that capture specific characteristics of the videos. For instance, we create a column to denote whether a video is shorter than 60 seconds, a factor that often correlates with higher engagement. We also explore if missing duration data itself holds predictive power.

7. Leveraging Channel ID Information:

Understanding the influence of channels is key to deciphering YouTube trends. We create features like ‘channel_occurance’ and ‘channel_unique_video_count’ to quantify a channel’s impact on video engagement.

8. Text Data Cleaning:

Lastly, we consolidate textual information from various columns like channel title, video title, description, and hashtags into a single text column. Beforehand, we clean this text data by removing URLs, special characters, and unnecessary tags, ensuring our analysis focuses on relevant content.

Feature Engineering

Decoding Linguistic Diversity

Understanding the linguistic diversity within video descriptions is crucial for catering to diverse audiences. We’ve implemented advanced techniques to identify popular languages beyond English, such as Arabic, Korean, Japanese, and Hindi. This ensures our model remains adaptable and inclusive, resonating with viewers worldwide.

2. Quantifying Textual Complexity

The complexity of textual descriptions significantly influences viewer engagement. By quantifying metrics like word count and character count, we gain insights into the depth and richness of each description. Longer descriptions often provide more context, aiding viewers in their decision-making process.

3. Optimizing Feature Space

Consolidating multiple text-related columns into one eliminates redundancy and streamlines our feature space. We’ve carefully curated relevant features while discarding redundant information like thumbnail links and engagement metrics. This ensures our model focuses on the most pertinent information for accurate predictions.

4. Strategic Validation Folds

Validation is critical for assessing model performance. Employing a strategic 5-fold cross-validation strategy and binning the target variable ensures balanced folds. This approach guarantees consistency in model evaluation across different subsets of data, enhancing the reliability of our results.

Models

1. CatBoost Regressor

CatBoost, a gradient boosting algorithm, excels in regression analysis by effortlessly handling categorical features, making it ideal for datasets with mixed variable types. Its unique ordered boosting technique simplifies modeling and reduces overfitting risks, ensuring robust predictions. In predicting YouTube video virality, CatBoost’s proficiency in managing categorical data allows it to capture nuanced relationships between title, description, and thumbnail elements, empowering content creators to make informed decisions about engaging content creation.

 model = catboost.CatBoostRegressor(max_depth=10, verbose=0, task_type="GPU")

2. BERT Regressor

BERT is a sophisticated natural language processing model adept at understanding text contextually bidirectionally. Its ability to grasp nuanced language nuances makes it suitable for regression analysis tasks like predicting YouTube video virality potential. By leveraging BERT’s contextual understanding, our model can analyze key elements of videos to forecast the like-to-view ratio accurately. This empowers content creators to make informed decisions, optimizing their content strategy for maximum engagement and audience reach in the competitive digital landscape.

Creating Dataset Class in Pytorch:

class CustomDataset(Dataset):
    def __init__(self, texts, targets, tokenizer, max_len):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        target = float(self.targets[idx])
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'target': torch.tensor(target, dtype=torch.float)
        }

Tokenizer and Model:

We’ve adopted the bert-base-uncased model for our task, a scaled-down version of BERT comprising 110 million parameters. Trained on an extensive corpus of unpublished books and Wikipedia articles, it boasts a vocabulary size of 30,000. Leveraging pre-trained weights, we’ve tailored this model to suit our specific needs, ensuring efficiency and effectiveness in our task.

Load tokenizer and model:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

BERT Model Architecture

class BERTRegression(nn.Module):
    def __init__(self):
        super(BERTRegression, self).__init__()
        self.bert = bert_model
        self.fc = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        return self.fc(pooled_output)

Loss Curves for K-Fold training:

3. Ensemble Model

In our pursuit of optimal predictive performance, we’ve adopted an ensemble modeling strategy that combines the predictive outputs of both CatBoost and BERT regression models. By leveraging the diverse capabilities of these two models and averaging their predictions, we aim to enhance the overall robustness and accuracy of our predictive framework. CatBoost’s proficiency in handling categorical features and BERT’s expertise in understanding textual data complement each other effectively, leading to a synergistic enhancement in predictive power. This strategic fusion enables us to capitalize on the respective strengths of each model while mitigating potential weaknesses, resulting in a more reliable and impactful predictive solution.

# Ensemble of both predictions
bert_preds = np.vstack(bert_preds).reshape(-1)
ensemble_preds = (catboost_preds + bert_preds) / 2

Conclusion

The comparison of validation Mean Absolute Error (MAE) across three methods — CatBoost, BERT, and Ensemble — unveils interesting insights. While Emsemble consistently demonstrates superior performance with the lowest MAE across all folds, BERT exhibits higher variability in its predictive accuracy, with fluctuations observed in different folds. Notably, the Ensemble approach, leveraging the combined predictive strengths of both CatBoost and BERT, strikes a balance between the two individual methods, yielding competitive MAE values that often outperform BERT alone and also outperform CatBoost. This underscores the potential of ensemble modeling in enhancing predictive robustness by amalgamating complementary modeling approaches.

The End!

Thank you for reading this analysis. I have learned a lot from this exercise, hope you have learned something too. Please share feedback if you find any flaws or have a better approach.

At last, please clap this publication if you liked it! Thanks in advance.