Stock Market Prediction using News Sentiments- I

An End-to-End Project on Time Series Analysis using Machine Learning.

9 min readAug 16, 2020

This is Part-1 of a two-part series.
DISCLAIMER: The information and methods published in this article are solely meant to be used for informational and educational purposes. The contents of this article are not a trading advice and should not be used for real trading practices.

Table of Contents

Part-1

Introduction
Why are we using News Sentiments?
Scoring Metric
Data Collection
Data Inspection
Data Cleaning and Pre-Processing
Feature Engineering
Train Test Split
Exploratory Data Analysis (EDA)
Making time-series stationary

Part-2

Modeling
Anomaly Detection
Final Model Pipeline for Deployment
Post-Training Quantization
Quantized Model Pipeline for Deployment
Conclusion
Future Improvements
Code Reference
Contact Links
Papers Referred and Other References

Introduction

Predicting stock market prices has been a topic of interest among both analysts and researchers for a long time. Stock prices are hard to predict because of their high volatile nature which depends on diverse political and economic factors, change of leadership, investor sentiment, and many other factors. Predicting stock prices based on either historical data or textual information alone has proven to be insufficient.

Why are we using News Sentiments?

Market sentiment is a qualitative measure of the attitude and mood of investors to financial markets in general, and specific sectors or assets in particular. Positive and negative sentiment drive price action, and also create trading and investment opportunities for active traders and long-term investors.

Existing studies in sentiment analysis have found that there is a strong correlation between the movement of stock prices and the publication of news articles. Several sentiment analysis studies have been attempted at various levels using algorithms such as Support Vector Machines, Naive Bayes Regression, and deep learning. The accuracy of deep learning algorithms depends upon the amount of training data provided.

Scoring Metric

Since it is a regression problem and the price values of the index are higher, we will use Root Mean Square Error (RMSE) to compare the performance of various models.

Data Collection

We are using the Nifty 50 Index for the stock price data and Stock News from a popular twitter handle.

Nifty 50 Index Data

The historical data of Nifty 50 Stock Price for the last 20 years i.e. from 01/01/2000 to 31/12/2019 was scraped using BeautifulSoup from investing.com.

Twitter News Data

Tweets from popular news handle @NDTVProfit were collected using Twint Scraper Library. Historical data for the last 5 years i.e. from 01/01/2015 to 31/12/2019 was collected.

Data Inspection

Nifty 50 Index Data

This data contains stock prices corresponding to each day whenever the stock market was functional. We have data for the last 4972 dates and the closing Nifty 50 Index Stock Price for that day.

Twitter News Data

This data contains all the tweets which NDTV Profit tweeted in the last 5 years. We have 64,278 tweets data. There is a lot more extra information present for each tweet like username, URL, photos, mentions, likes, retweets. But in this case study, we will focus on only dates and the tweets tweeted on that day.

Data Cleaning & Pre-Processing

Nifty 50 Index Data

The date column is not in proper DateTime datatype and prices contain ‘,’ in-between we need to remove them to consider prices as numerical numbers.

Let’s fix these problems.

Twitter News Data

Twitter is a social media and formal language is often ignored. However, since its a news channel so we are expecting the tweets to be written in a formal language.

Some Raw Tweets after Scraping:

We need to perform the following operations on the tweets in order to clean them:

Convert them into lowercase.
Remove all the links starting with either http or pic.twitter.com or https
Remove all the special characters, emoticons
Remove all the hashtags (#), @ symbol.
Remove these words: ETMarkets, ndtv, moneycontrol, marketsupdate, biznews, NewsAlert, Click here for LIVE updates.
Remove all the numbers.

After cleaning we need to combine all the tweets in a single paragraph for each day.

After all the cleaning and pre-processing our data looks like this:

Feature Engineering

We have not tokenized, removed stopwords, and got bigrams, etc because we will be using a pre-trained sentiment analyzer VADER here since our data is unsupervised. We are choosing VADER here because it works very well especially with social media text.

VADER gives a compound score for each paragraph. Score = -1 signifies negative news and score = 1 signifies positive news. Positive news should raise the index prices and vice versa.

Train Test Split

Since it is a time-series data we cannot split it randomly. Hence, we are considering the latest 20% data as a test set and the first 80% data as a train set. Train Data Shape: (3977, 2), Test Data Shape: (995, 2)

Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the process of exploring data, generating insights, testing hypotheses, checking assumptions, and revealing underlying hidden patterns in the data.

Twitter News Data

Checking for Missing Data

We do not have any missing data here.

Checking for Duplicate Rows

We do not have any duplicate data here.

Sentiments Distribution

Here, Sentiment = 1: Positive News and Sentiment = -1: Negative News.

We can observe a more number of positive news which reflects why nifty shows an upward trend overall. We have 38% negative tweets here. The results are looking realistic so far.

Word Clouds

We can observe that words like gain, top, rise, surge result in a positive tweet, and words like lower, fall, hit, drop, slump result in a negative tweet. Looks like bank stocks are the most fluctuating ones.

Probability Distribution of Sentiment Scores by VADER

For the majority of news, VADER is confident in detecting either positive or negative sentiment since most of the points lie on the boundary. This shows accurate and confident prediction from VADER library

Date-wise Distribution of News Sentiments

We can observe that we had more positive news before 2018 than after 2018. The number of news also increased as we move towards 2020.

Nifty 50 Index Data

Checking for Missing Data

We do not have any missing data here.

Checking for Duplicate Rows

We do not have any duplicate data here.

Probability Distribution of Nifty index Closing Prices

Nifty index price hovered over 1000 and 5000 levels for the most time in the past 20 years. It rarely went past 13000 levels.

Stationarity of a Time Series

There are three basic criteria for a time series to understand whether it is a stationary series or not. Statistical properties of time series such as mean and variance should remain constant over time to call time series stationary.

Following are the 3 qualities of a stationary time series:

Constant mean
Constant variance
Autocovariance that does not depend on time. Autocovariance is the covariance between the time series and lagged time series.

Let’s visualize and check the seasonality and trend of our time series first.

Trend: This time series shows an upward trend. This is a non-stationary time series. We need to convert it to stationary to forecast accurately. Let’s also check for seasonality.

Seasonality: The time series has a slight seasonal variation.

We can observe a decline in prices in the latter half of the year. From January to June months we can see a general upward trend. The first 6 months are relatively safer for investing and one should sell by June or July month.

If one observes a downward trend in the graph for the first 6 months of the year then chances are that it will continue to drop further in the next 6 months. So one should sell as soon as possible in this case or keep holding the stock for a longer period.

Now let’s check the stationarity of time series. It can be checked using the following methods:

Plotting Rolling Statistics: We have a window lets say window size is 6 and then we find rolling mean and variance to check stationary.
Dickey-Fuller Test: The test results comprise a Test Statistic and some Critical Values for different confidence levels. If the test statistic is less than the critical value, we can say that the time series is stationary.

We will use hypothesis testing here.

We state Null Hypothesis here that time series has a unit root which means it is non-stationary. And an Alternative Hypothesis that time series is stationary.

For time series to be stationary we should get a p-value of less than 5% to reject the null hypothesis.

a. Our first criterion for stationary is a constant mean. So we fail because mean is not constant as you can see from the plot (black line) above.

b. The second one is a constant variance. It looks like constant. (Green Graph above)

c. The third one is that if the test statistic is less than the critical value then we can say that time series is stationary.

Lets look: test statistic = 0.674 and critical values = {‘1%’: -3.431667761145687, ‘5%’: -2.8621223070279247, ‘10%’: -2.5670799628923104}.

The test statistic is bigger than the critical values. So, no stationary.

As a result, we are sure that our time series is not stationary. Let's make time-series stationery in the next part.

Two methods which can help us make it stationary:

Moving Average Method
Differencing Method

Mean is constant over time. There is no trend visible, the p-value is also less than 5%. But the test statistic is not less than Critical Value. Variance is not constant. Our time series is still not stationary.

Let’s try the differencing method.

Much better results! Differencing method wins here because it is producing more stationary time series than the moving average method as Test Statistic is lesser in case of differencing method.

We will be using this stationary time series for forecasting. STAY TUNED!

Link to Part-2 of this blog: https://medium.com/@kala.shagun/stock-market-prediction-using-news-sentiments-dc4c24c976f7

Code Reference

Shagun-25/Nifty-Index-Prediction-Using-News-Sentiments

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Contact Links

Email Id: kala.shagun@gmail.com
Linkedin: https://www.linkedin.com/in/shagun-kala-061a3b127/

Papers Referred

Stock Price Prediction Using News Sentiment Analysis: http://davidanastasiu.net/pdf/papers/2019-MohanMSVA-BDS-stock.pdf
Sentiment Analysis of Twitter Data for Predicting Stock Market Movements: http://arxiv.org/pdf/1610.09225v1.pdf

Stock Market Prediction using News Sentiments- I

An End-to-End Project on Time Series Analysis using Machine Learning.

Introduction

Why are we using News Sentiments?

Scoring Metric

Data Collection

Nifty 50 Index Data

Twitter News Data

Data Inspection

Nifty 50 Index Data

Twitter News Data

Data Cleaning & Pre-Processing

Nifty 50 Index Data

Twitter News Data

Feature Engineering

Train Test Split

Exploratory Data Analysis (EDA)

Twitter News Data

Nifty 50 Index Data

Stationarity of a Time Series

Code Reference

Shagun-25/Nifty-Index-Prediction-Using-News-Sentiments

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Contact Links

Papers Referred

Other References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shagun Kala

No responses yet