Skip to content

Latest commit

 

History

History
69 lines (58 loc) · 4.29 KB

File metadata and controls

69 lines (58 loc) · 4.29 KB

ML Project 2: Twitter Sentiment Classification

Repository for the second project in Machine Learning, EPFL 2016/2017.

Project: Text classification
Authors: Emanuele Bugliarello, Manik Garg, Zander Harteveld
Team: Memory Error

Overview

This repository contains the material shipped on December 22 and consists of the following folders:

  • data: contains the Twitter data files from Kaggle
  • code: contains the Python files used to train the model and generate new predictions. Details about the files are available in the README inside the code folder.
  • report: contains the submitted report and the files used to generate it.

Dependencies

The code is written in Python3, that you can download from here (we recommend installing a virtual environment such as Anaconda that already comes with many libraries).
The libraries required are:

  • NumPy (>= 1.6.1): you can install it by typying pip install -U numpy on the terminal (it is included with Anaconda).

  • NLTK (3.0): you can install it by typying pip install -U nltk on the terminal.

  • NLTK packages: you can download alll the packages of NLTK by typying python on the terminal. Then:

    import nltk
    nltk.download('all')

    It will automatically install all the packages of NLTK. Note that it takes a lot of time to download the panlex_lite package but you can stop the execution because the packages needed by our scripts will have been already installed.

  • SciPy (>=0.9): you can install it by typying pip install -U scipy on the terminal (it is in included with Anaconda).

  • scikit-learn (0.18.1): you can install it by typying pip install -U scikit-learn, or conda install scikit-learn if you use Anaconda, on the terminal.

Methodology

The final model consists of a Logistic Regression classifier.
We apply the following pre-processing steps before feeding the data into the classifier:

  1. Remove the pound sign (#) in fron of words
  2. Stem words (by using EnglishStemmer from nltk.stem.snowball)
  3. Replace two or more consecutive repetitions of a letter with two of the same

We then convert the collection of text documents to a matrix of token counts. We do this with CountVectorizer from sklearn.feature_extraction.text, with the following hyperparameters:

  • analyzer = 'word'
  • tokenizer = tokenize (function that tokenizes the text by applying the pre-processing steps described above)
  • lowercase = True
  • ngram_range = (1,3)
  • max_df = 0.9261187281287935
  • min_df = 4

After that, we transform the count matrix to a normalized tf-idf representation with TfidfTransformerfrom sklearn.feature_extraction.text.

Finally, we feed this representation into the Logistic Regression classifier from sklearn.linear_model, parameterized with the following value of the inverse of regularization strength:

  • C = 3.41

Kaggle result reproduction

In order to generate the top Kaggle submission, please ensure all Python requirements are installed and then run:

cd code
python run.py

This makes use of the pre-trained classifier available in the code/models folder to predict labels for new tweets and store them in a .csv file in the code/results folder. The default test data file is data/test_data.txt (the one provided for the Kaggle competition) but it can be easily changed in code/run.py.

Training from scratch

You can train the classifier that we use for the top Kaggle submission. To do:

  1. Ensure all Python requirements are installed
  2. Ensure the Twitter data files from Kaggle are in the data/ folder.
  3. Run:
cd code
python train.py

This file makes use of data/train_pos_full.txt and data/train_neg_full.txt (data files from the Kaggle competition) as the training sets and creates a model in the code/models folder. The time needed to run it is between 50 and 60 minutes: around 50 minutes for pre-processing and around 10 minutes for fitting the classifier (depending on your machine).

You can then predict labels for new data as described in the previous section.