GitHub - uxa/CS410-final-project: UIUC MCS-DS Fall 2019 final project

CS410 Final Project, Fall 2019

Team Wolfram

Pramav Velamakanni (pranavv2@illinois.edu), Tarik Koric (koric1@illinois.edu)

Introduction

The aim of this project is to design a system that can provide live sentiment analysis on a stream of tweets from Twitter. This is achieved by training 3 models on a dataset with 1.6 million tweets.

The CSV file containing the tweets can be downloaded from - https://www.kaggle.com/kazanova/sentiment140/download

To be able to retain the original state of the trained model and to provide predictions quickly, the trained models and vectors are saved to a binary file using Pickle. These models and vectors are loaded into memory when the application is executed. Live tweets are streamed from Twitter and passed to the models to predict the sentiment. The application supports several arguments which are discussed below.

The final prediction is decided based on the individual scores by the 3 models. Models with higher accuracy are given a higher weight for the final prediction probability.

Tools used in this project

Python 3.7
Jupyter - notebooks used to train and test the models
Pickle - used to save the trained models and vectors as binary files
Pandas, NumPy - load and manipulate data using DataFrames
NLTK - used in data pre-processing and cleaning
Scikit-learn - machine learning algorithm toolkit
Tweepy - Twitter API to stream live tweets
Matplotlib - tool to visualize the results

Set up the enviornment to use this application

Method 1 (pip)

Please ensure you have Python 3 installed

The following command can be run to install all the dependencies (using pip) needed for this app to run.

pip install --upgrade jupyter pandas numpy nltk scikit-learn tweepy matplotlib

Method 2 (Conda environment)

If you have Anaconda installed, the dependencies can be installed to a custom enviornment (ideal if you have other projects using different versions of the libraries)

Create the environment: conda create -n TeamWolfram python=3.7

Activate the environment: conda activate TeamWolfram

From this project workspace execute: pip install --requirement requirements.txt

Download stopwords package

From the terminal, run: python -c "import nltk; nltk.download('stopwords')"

Files in this workspace

app.py - Main application file that interacts with the tweets and the models
TrainModel.ipynb - This notebook contains the pre-processing and model training
requirements.txt - File containing the Python requirements for this project
Test/ directory (Misc: tests performed while testing and tweaking the application)
- Test.ipynb - Notebook containing test code to unpack and load the model for predictions
- twitter_analysis.py - Initial tests using the Twitter API and the trained models
- twitter_api.py - Initial tests setting up the Twitter API
- TweetStreamAnalysis.txt - Test file containing tweets saved after running a stream
- TweetSummaryPlot.png - Test pie chart generated from the stream predictions
Pickled data/ directory
- LR.pickle - Pickled trained Logistic regression model
- naive-bayes.pickle - Pickled trained Naive Bayes model
- nn.pickle - Pickled trained Neural Network model
- vector.pickle - Pickled TF-IDF vector to transform the data

Models and achieved accuracy

Logistic Regression - 77%
Naive Bayes - 76%
Neural Network - 71%

Data and pre-processing

1.6 m individual tweets with a 1 (Positive) or 0 (Negative) label
Data cleaning involved the following steps
- Convert the tweet to lowercase, remove stopwords
- Remove the hashtag symbol (#)
- Remove @ mentions, websites
- Perform stemming
TF-IDF vector with the following specs
- 10000 max features
- 1-2 Ngrams
- L2 normalization

Modules used

Data processing: Numpy, Pandas, NLTK
Analysis: Scikit-learn
Model packaging: Pickle
Twitter API: Tweepy

How to use the app

app.py is a command line app that supports the following arguments
- Tweets from a specific user
  - --user or -u - username of the user to fetch tweets from (example - elonmusk (without the @))
  - --count or -c - number of tweets to fetch and analyze (example - 5, defaults to 10)
- Stream tweets for a list of topics
  - --stream - list of topics to fetch live tweets from Twitter and perform analysis (example - "trump" "Tesla" "Penguins")
  - --time - total duration of the stream in seconds (example - 10, defaults to 20)
  - --file - save the tweets and performed analysis to a file named TweetStreamAnalysis.txt in the current workspace
  - --visualize - visualizes the predictions using a pie chart. Saves to file when --file flag is used

Please Note: This app provides default API access keys to use this application for testing purposes. It is however, recommended to change these to values in app.py for extensive usage. Instructions to generate new keys can be found here if you would like to change these keys.

Examples

Using Streams with multiple topics for a period of 10 seconds

❯ python app.py --stream "trump" "Pittsburgh Penguins" "NHL" --time 10 --file                                                                                                                                                               ─╯
Tweet: RT @EgSophie: For reference: this is Jovi Val, on the left posing in front of a swastika and giving the Nazi salute, and on the right in hi…
LogisticRegression: Positive
NaiveBayes: Positive
NeuralNetwork: Positive
Prediction: Positive with a probability of 100.0%

Tweet: A year long physical 🤔  maybe plastic surgery 😅😂🤣 President Trump began phase one of his annual physical at Walter… https://t.co/HKpuHODmLV
LogisticRegression: Negative
NaiveBayes: Positive
NeuralNetwork: Positive
Prediction: Positive with a probability of 60.0%

Tweet: RT @michellemalkin: The EB-5 racket is a ghastly selling out of US citizenship to the highest foreign bidders. John Miano &amp; I exposed the s…
LogisticRegression: Positive
NaiveBayes: Positive
NeuralNetwork: Negative
Prediction: Positive with a probability of 75.0%

Tweet: RT @ThePlumLineGS: Important exchange here: https://t.co/WxhLeO2AWx
LogisticRegression: Positive
NaiveBayes: Positive
NeuralNetwork: Positive
Prediction: Positive with a probability of 100.0%

Tweet: RT @maggie_pdx: @brianklaas Every day I wonder what other 'favors' Trump has attempted collection on in service of his 2020 reelection.
LogisticRegression: Positive
NaiveBayes: Negative
NeuralNetwork: Positive
Prediction: Positive with a probability of 65.0%

Tweet: RT @PressSec: Very well said! If the dems had the votes they wouldn’t be prolonging this charade. They’re just working with their partners…
LogisticRegression: Positive
NaiveBayes: Positive
NeuralNetwork: Positive
Prediction: Positive with a probability of 100.0%

Picking a specific user and fetching last 20 tweets

❯ python app.py --user elonmusk --count 20                                                                                                                                                                                                  ─╯
Tweet: @farrxy @Ford I’d be way too embarrassed to put that on a Tesla. It’s like a kid’s drawing.
LogisticRegression: Negative
NaiveBayes: Positive
NeuralNetwork: Negative
Prediction: Negative with a probability of 65.0%

Tweet: @Ford Congratulations on the Mach E! Sustainable/electric cars are the future!! Excited to see this announcement fr… https://t.co/vlFHJeb7Mt
LogisticRegression: Positive
NaiveBayes: Negative
NeuralNetwork: Positive
Prediction: Positive with a probability of 65.0%

Tweet: @flcnhvy Exactly! Well said.
LogisticRegression: Positive
NaiveBayes: Negative
NeuralNetwork: Negative
Prediction: Negative with a probability of 60.0%

Tweet: @cleantechnica Surprisingly common
LogisticRegression: Positive
NaiveBayes: Negative
NeuralNetwork: Positive
Prediction: Positive with a probability of 65.0%

Streaming topics and visualizing the results

❯ python app.py --stream "tesla" "elon musk" "cybertruck" --time 15 --visualize                                                                                                                                                                             ─╯
Tweet: AHAHAHAHAHAHAHAHA but real talk, the #cybertruck is my favorite new car to be released since probably the ND miata.
LogisticRegression: Positive
NaiveBayes: Positive
NeuralNetwork: Negative
Prediction: Positive with a probability of 75.0%

Tweet: Metro Boomin turn this hoe into a mosh pit, Tesla build got my flying like a cockpit
LogisticRegression: Negative
NaiveBayes: Positive
NeuralNetwork: Positive
Prediction: Positive with a probability of 60.0%

Tweet: RT @jfagone: His company SpaceX could have minimized the interference by simply painting the satellites black. But they didn’t do that. htt…
LogisticRegression: Positive
NaiveBayes: Negative
NeuralNetwork: Negative
Prediction: Negative with a probability of 60.0%

Tweet: RT @CNN: A Ford executive, reacting to a video of Tesla's all-electric Cybertruck winning a tug-of-war against a Ford F-150, challenged Tes…
LogisticRegression: Positive
NaiveBayes: Positive
NeuralNetwork: Positive
Prediction: Positive with a probability of 100.0%

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Pickled data		Pickled data
Test		Test
.gitignore		.gitignore
README.md		README.md
TrainModel.ipynb		TrainModel.ipynb
TweetAnalysis.txt		TweetAnalysis.txt
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS410 Final Project, Fall 2019

Team Wolfram

Pramav Velamakanni (pranavv2@illinois.edu), Tarik Koric (koric1@illinois.edu)

Introduction

Tools used in this project

Set up the enviornment to use this application

Method 1 (pip)

Method 2 (Conda environment)

Download stopwords package

Files in this workspace

Models and achieved accuracy

Data and pre-processing

Modules used

How to use the app

Examples

Using Streams with multiple topics for a period of 10 seconds

Picking a specific user and fetching last 20 tweets

Streaming topics and visualizing the results

About

Releases

Packages

Contributors 2

Languages

uxa/CS410-final-project

Folders and files

Latest commit

History

Repository files navigation

CS410 Final Project, Fall 2019

Team Wolfram

Pramav Velamakanni (pranavv2@illinois.edu), Tarik Koric (koric1@illinois.edu)

Introduction

Tools used in this project

Set up the enviornment to use this application

Method 1 (pip)

Method 2 (Conda environment)

Download stopwords package

Files in this workspace

Models and achieved accuracy

Data and pre-processing

Modules used

How to use the app

Examples

Using Streams with multiple topics for a period of 10 seconds

Picking a specific user and fetching last 20 tweets

Streaming topics and visualizing the results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages