COVID-19 Sentiment Analysis with Logistic Regression and LSTM

COVID-19 Sentiment Analysis with Logistic Regression and LSTM

Background:

Using the COVID-19 dataset from Kaggle , build sentiment analysis models using logistic regression and LSTM.

Data Preprocessing:

I imported the needed libraries to make the preprocessing possible.

Screen Shot 2021-02-13 at 8.47.25 PM.png

loaded the csv file and encoded it using latin1 as UTF-8 or just reading without encoding returns errors. I used the .info function to get information about the dataset.

Screen Shot 2021-02-13 at 8.47.36 PM.png

Then I checked the first five columns of both the training and test datasets to see the row headings of each column. Screen Shot 2021-02-13 at 8.47.45 PM.png

Looking at the sentiments, it has five classes: Positive, Negative, Extremely positive, Extremely negative and Neutral. I had to reduce the five classes to three which are; Positive, Negative and Neutral and applied the changes to both the training and test sets. Using matplotlib.pyplot, I visualized the updated dataset. Screen Shot 2021-02-13 at 8.54.10 PM.png

I checked for null values in both training and test sets and found that location had null values. Screen Shot 2021-02-13 at 8.55.20 PM.png and the training and test sets had (41157,6) and (3798,6) columns and rows respectively.

Data Cleaning:

This project has text dataset and its cleaning has to be thorough to prevent errors. Importing the libraries required for the cleaning Screen Shot 2021-02-13 at 8.57.55 PM.png

I removed all stop words which included tags, urls, mentions, digits etc. I applied these changes to the training and test sets. The basis of analysis are the Tweet columns and Sentiment columns which were selected and others dropped. Visualizing the updated dataset Screen Shot 2021-02-13 at 9.00.22 PM.png

I lemmatized both the training and the test sets.

Splitting and vectorization:

I imported the libraries that will aid the splitting and vectorization of the datasets which are scikit-learn frameworks. Screen Shot 2021-02-13 at 9.03.08 PM.png

I split the datasets and applied vectorization to it. The vectorizer I used was the TFID vectorizer because I considered the overall weightage of a word which included its frequency. I visualized the features, weights and frequency of occurrence Screen Shot 2021-02-13 at 9.06.09 PM.png

Building and Training:

Logistic Regression Model:

I imported the libraries required to do the work Screen Shot 2021-02-13 at 9.07.56 PM.png

Putting the classifier report to work, I got the following result in the training dataset Screen Shot 2021-02-13 at 9.08.08 PM.png

I also tested on the validation dataset using two parameters. The first parameter had: penalty = 11 solver = lbfgs and got a lower accuracy. Then I changed to the second parameter: penalty = 12 solver = saga and got a higher accuracy which is shown below Screen Shot 2021-02-13 at 9.09.19 PM.png

Checking the report with the test set, I had to also vectorize the test set before testing. On making predictions with the model, I got the following report Screen Shot 2021-02-13 at 9.13.26 PM.png

Model visualization:

I used seaborn and the confusion metrics to visualize the result Screen Shot 2021-02-13 at 9.15.08 PM.png

LSTM Model:

This is the second model using the same dataset, I used a tokenizer for the datasets in this model. Below are what the required libraries looked like Screen Shot 2021-02-13 at 9.17.05 PM.png

I also padded the dataset to achieve equal length Screen Shot 2021-02-13 at 9.18.12 PM.png

Building LSTM model:

I used the sequential model to build the neural networks with loss of sparse categorical cross-entropy, the Adam optimizer and the 'accuracy' metrics to compile the model. Summary of model: Screen Shot 2021-02-13 at 9.23.04 PM.png

I split the dataset and trained the model using two epochs. Screen Shot 2021-02-13 at 9.24.17 PM.png

Prediction:

I predicted using the test dataset after padding and got 83.99157451290152% accuracy.

Visualization:

Using confusion metrics and seaborn, the model is visualized. Screen Shot 2021-02-13 at 9.27.00 PM.png

Conclusion:

This is a take home assignment from an interview earlier this week. This notebook can be found in Github and Kaggle. For suggestions or corrections, I can be reached through LinkedIn.

Thank you for reading.