Background:
Using the COVID-19 dataset from Kaggle , build sentiment analysis models using logistic regression and LSTM.
Data Preprocessing:
I imported the needed libraries to make the preprocessing possible.
loaded the csv file and encoded it using latin1 as UTF-8 or just reading without encoding returns errors. I used the .info function to get information about the dataset.
Then I checked the first five columns of both the training and test datasets to see the row headings of each column.
Looking at the sentiments, it has five classes: Positive, Negative, Extremely positive, Extremely negative and Neutral. I had to reduce the five classes to three which are; Positive, Negative and Neutral and applied the changes to both the training and test sets. Using matplotlib.pyplot, I visualized the updated dataset.
I checked for null values in both training and test sets and found that location had null values. and the training and test sets had (41157,6) and (3798,6) columns and rows respectively.
Data Cleaning:
This project has text dataset and its cleaning has to be thorough to prevent errors. Importing the libraries required for the cleaning
I removed all stop words which included tags, urls, mentions, digits etc. I applied these changes to the training and test sets. The basis of analysis are the Tweet columns and Sentiment columns which were selected and others dropped. Visualizing the updated dataset
I lemmatized both the training and the test sets.
Splitting and vectorization:
I imported the libraries that will aid the splitting and vectorization of the datasets which are scikit-learn frameworks.
I split the datasets and applied vectorization to it. The vectorizer I used was the TFID vectorizer because I considered the overall weightage of a word which included its frequency. I visualized the features, weights and frequency of occurrence
Building and Training:
Logistic Regression Model:
I imported the libraries required to do the work
Putting the classifier report to work, I got the following result in the training dataset
I also tested on the validation dataset using two parameters. The first parameter had: penalty = 11 solver = lbfgs and got a lower accuracy. Then I changed to the second parameter: penalty = 12 solver = saga and got a higher accuracy which is shown below
Checking the report with the test set, I had to also vectorize the test set before testing. On making predictions with the model, I got the following report
Model visualization:
I used seaborn and the confusion metrics to visualize the result
LSTM Model:
This is the second model using the same dataset, I used a tokenizer for the datasets in this model. Below are what the required libraries looked like
I also padded the dataset to achieve equal length
Building LSTM model:
I used the sequential model to build the neural networks with loss of sparse categorical cross-entropy, the Adam optimizer and the 'accuracy' metrics to compile the model. Summary of model:
I split the dataset and trained the model using two epochs.
Prediction:
I predicted using the test dataset after padding and got 83.99157451290152% accuracy.
Visualization:
Using confusion metrics and seaborn, the model is visualized.
Conclusion:
This is a take home assignment from an interview earlier this week. This notebook can be found in Github and Kaggle. For suggestions or corrections, I can be reached through LinkedIn.
Thank you for reading.