Stock Market Prediction using News Sentiments- II

An End-to-End Project on Time Series Analysis using Machine Learning.

Shagun Kala
9 min readAug 16, 2020

This is Part-2 of a two-part series. Please read Part-1 here: https://medium.com/@kala.shagun/stock-market-prediction-using-news-sentiments-f9101e5ee1f4

Modeling

The ML Models used here are selected based on the production requirement. We want to deploy the model. As we know that time series model needs to be trained every time in production with the new data points for accurate prediction so we will be using only those models which have low time complexity in training i.e. which trains faster with new data.

1. ARIMA

An ARIMA model is a class of statistical models for analyzing and forecasting time series data.

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.

This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:

  • AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.
  • I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from observation at the previous time step) in order to make the time series stationary.
  • MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Each of these components is explicitly specified in the model as a parameter. The parameters of the ARIMA model are defined as follows:

  • p: The number of lag observations included in the model, also called the lag order.
  • d: The number of times that the raw observations are differenced also called the degree of differencing.
  • q: The size of the moving average window, also called the order of moving average.

Let’s plot Autocorrelation and Partial Autocorrelation Plot to identify the above parameter values.

p — The lag value where the PACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case, p=2.

q — The lag value where the ACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case, q=2.

d — In differencing method, a shift of 1 period produced a stationary timer series. So we will use d = 1.

We forecast stationary time-series which we got after the differencing method using ARIMA. Then transform the results to get our original time series.

Forecasting using ARIMA Model

RMSE from ARIMA = 1707.77

Let’s try to improve the prediction using more advanced methods.

2. SARIMAX

ARIMA model considers only trends information in the data and ignores seasonal variation. SARIMAX is a variation of the ARIMA model which considers seasonal variation in the data as well. Though, our data do not have high seasonality but why not give it a try.

RMSE from SARIMAX = 964.97

Woah! RMSE got down to 964 from 1707. SARIMAX really works well.

3. Facebook Prophet

The prophet is an open-source library published by Facebook that is based on decomposable (trend+seasonality+holidays) models. It provides us with the ability to make time-series predictions with good accuracy using simple intuitive parameters and has support for including the impact of custom seasonality and holidays!

RMSE from Facebook Prophet = 709.70

Nice! RMSE has further reduced to 709 from 964. It is still far from acceptable prediction. Let’s try deep learning models now.

Before going ahead, let’s look at some useful plots Facebook Prophet provides:

Our data has some seasonal information present. This is why SARIMAX also performed well.

Following points can be observed from the above graphs:

  1. Our data shows an upward trend.
  2. Stock price gets up on Saturday and remains almost flat during weekdays.
  3. There is a high chance to observe 52 weeks low Stock Price in the August End- Sept starting period.
  4. Stock Price fluctuates during the whole day.

4. LSTM Model

Finance is highly nonlinear and sometimes stock price data can even seem completely random. Traditional time series methods such as ARIMA, SARIMAX models are effective only when the series is stationary, which is a restricting assumption that requires the series to be preprocessed by taking log returns (or other transforms). However, the main issue arises in implementing these models in a live trading system, as there is no guarantee of stationarity as new data is added.

This is combated by using Neural Networks (sequential models like LSTM, GRU, etc.), which do not require any stationarity to be used. Furthermore, neural networks by nature are effective in finding the relationships between data and using it to predict (or classify) new data.

The LSTM model needs validation data as well to fine-tune the parameters. Let’s split the data again.

Code used to prepare a dataset for LSTM Model

To prepare the data, stock price data is scaled first using MinMax Scaler. We are giving LSTM 60 features. X = Stock Prices of last 60 consecutive days as 60 features. Y = Actual Stock Price on 61st day.

Since our data is not too large and cumbersome so we will build a simple single-layer model.

RMSE from LSTM = 285.53

Hats off to this deep learning marvel. RMSE has gone down to 285 compared to 709 from Facebook Prophet Model. LSTM Model has predicted very accurately. Let’s try more advanced LSTM variations.

5. LSTM with news polarity

We will be using only 5 years of stock data for this model. Since we have news available for only 5 years period 2015–2019.

This model will take 61 features. X = Stock Prices of last 60 consecutive days as 60 features and News Sentiment of 60th day. Y = Actual Stock Price on 61st day. All stock prices are scaled here as well.

RMSE from LSTM with news polarity = 170.91

Wow! RMSE has further reduced to 170.91 from 285. News Sentiments has helped LSTM to improve the prediction further.

Predicted results are looking very accurate now. We will pick this last model as our best model.

This best model takes stock price data of the last 60 days along with News Sentiment Compound Score from VADER for the last day and it will predict the stock price for the next day.

Summary

Best Model: LSTM with News Sentiments

RMSE from Best Model: 170.91

Anomaly Detection

In this section, we will try to find the anomalies in our stock price data which is not learned correctly by our best model.

Let’s plot the errors to identify the outliers.

Considering a 3% acceptable error, let’s find anomalies.

Anomalies identified on Error Plot
Anomalies identified on Nifty Stock Data

We can observe that anomalies are present when there is a steep rise or drop in stock prices. This can happen due to a major event that occurred during these days. Let’s analyze tweets for the days having anomaly.

Word Cloud for Positive News
Word Cloud for Negative News

We can see that in positive tweets the most common word is ‘cut’ which is a negative sentiment word and in negative tweets, some common words are ‘highest’, ‘biggest’, ‘peak’ all these words are of positive sentiment.

We can conclude that VADER Sentiment Analyzer didn’t put correct sentiments and the sentiment score for these tweets.

Final Model Pipeline for Deployment

The normal best model is taking 649ms to predict the stock price for the next day.

Post-training quantization

Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating-point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. This allows the model to run faster but this comes at a cost of accuracy.

By default, the model weights are saved in Float32 format but it can be reduced to Float16 or Int8 to get the calculations faster but due to approximation, we can expect a little drop in the accuracy.

As we can see above that it currently takes around 800ms for our model to predict the next price. With the help of quantization techniques, we will try to reduce this runtime of our model.

Converting Models into TFLite Models and Saving Them

For quantization: We will be converting Float32 weights into Float16 weights to make the prediction calculation faster.

Quantized Model Pipeline for Deployment

The quantized model is taking 60.8ms to predict the stock price for the next day. That’s a huge improvement! But notice the increase in RMSE as well.

Performance Comparision between the normal and quantized model:

We can clearly observe that time for prediction has reduced significantly when we use the quantized model. But accuracy got decreased as RMSE has increased in the quantized model.

Quantization is a great technique when we are required to make faster predictions without caring too much about accuracy. These models consume lesser space as well. These models are perfect for Mobile and Online Applications use where we need quick results.

Conclusion

In this case study, We learned how to handle and process time-series data and build deep learning models with a production perspective. Stock Price time series is considered the most challenging time series and we were able to predict the Nifty Index Data with high accuracy. We also learned how to optimize the model in post-training phase to make it ready for deployment.

Further Improvements

Here are some of the leads which can improve the results from the above-discussed solution:

  1. Collect news data for more years to have more data points.
  2. Deep Learning Models work very well with large data. Since we have limited stock price data. To do a more extensive stock analysis, we can take hourly stock price data instead of daily stock price data to increase the data points. This shall improve accuracy.
  3. Play more with the LSTM architecture and hyperparameters to improve the model accuracy.
  4. Instead of using a pre-trained VADER Sentiment Analyzer, we can train our own model by first creating training data. This custom trained model should give better sentiment results since it will get trained on the stock market news language.
  5. There is recent research going on stating that GAN, Reinforcement Learning can also be used to predict the stock market better.
  6. Anomalies can be handled better by retraining the data with the correct sentiment scores with the help of a custom trained sentiment analyzer.
  7. For even faster prediction, quantization of model to int8 can be used but it will reduce the accuracy significantly.

Code Reference

Contact Links

Email Id: kala.shagun@gmail.com
Linkedin: https://www.linkedin.com/in/shagun-kala-061a3b127/

Papers Referred

  1. Stock Price Prediction Using News Sentiment Analysis: http://davidanastasiu.net/pdf/papers/2019-MohanMSVA-BDS-stock.pdf
  2. Sentiment Analysis of Twitter Data for Predicting Stock Market Movements: http://arxiv.org/pdf/1610.09225v1.pdf

Other References

  1. AppliedAICourse.com
  2. www.tensorflow.org/lite/performance/post_training_quantization
  3. https://github.com/sonalimedani/TF_Quantization/blob/master/quantization.ipynb
  4. https://towardsdatascience.com/end-to-end-time-series-analysis-and-modelling-8c34f09a3014
  5. https://medium.com/analytics-vidhya/stock-prices-prediction-using-machine-learning-and-deep-learning-techniques-with-python-codes-a630c0d3f137
  6. https://udibhaskar.github.io/practical-ml/debugging%20nn/neural%20network/overfit/underfit/2020/02/03/Effective_Training_and_Debugging_of_a_Neural_Networks.html
  7. https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

--

--

Shagun Kala
Shagun Kala

Responses (5)