Repository for the second project in Machine Learning, EPFL 2016/2017.
Project: Text classification
Authors: Emanuele Bugliarello, Manik Garg, Zander Harteveld
Team: Memory Error
_For an easier examination by the EPFL TAs, this content is also available on [GitHub](http://github.com/e-bug/pcml-project2 "GitHub repository for Memory Error's code") (publicly after the deadline)._
## Overview This repository contains the material shipped on December 22 and consists of the following folders: - `data`: contains the [Twitter data files from Kaggle](https://inclass.kaggle.com/c/epfml-text/data) - `code`: contains the Python files used to train the model and generate new predictions. Details about the files are available in the [README](code/README.md) inside the `code` folder. - `report`: contains the submitted report and the files used to generate it.
The code is written in Python3
, that you can download from here (we recommend installing a virtual environment such as Anaconda that already comes with many libraries).
The libraries required are:
-
NumPy (>= 1.6.1): you can install it by typying
pip install -U numpy
on the terminal (it is included with Anaconda). -
NLTK (3.0): you can install it by typying
pip install -U nltk
on the terminal. -
NLTK packages: you can download alll the packages of NLTK by typying
python
on the terminal. Then:import nltk nltk.download('all')
It will automatically install all the packages of NLTK. Note that it takes a lot of time to download the
panlex_lite
package but you can stop the execution because the packages needed by our scripts will have been already installed. -
SciPy (>=0.9): you can install it by typying
pip install -U scipy
on the terminal (it is in included with Anaconda). -
scikit-learn (0.18.1): you can install it by typying
pip install -U scikit-learn
, orconda install scikit-learn
if you use Anaconda, on the terminal.
The final model consists of a Logistic Regression classifier.
We apply the following pre-processing steps before feeding the data into the classifier:
- Remove the pound sign (#) in fron of words
- Stem words (by using
EnglishStemmer
fromnltk.stem.snowball
) - Replace two or more consecutive repetitions of a letter with two of the same
We then convert the collection of text documents to a matrix of token counts. We do this with CountVectorizer
from sklearn.feature_extraction.text
, with the following hyperparameters:
- analyzer = 'word'
- tokenizer = tokenize (function that tokenizes the text by applying the pre-processing steps described above)
- lowercase = True
- ngram_range = (1,3)
- max_df = 0.9261187281287935
- min_df = 4
After that, we transform the count matrix to a normalized tf-idf representation with TfidfTransformer
from sklearn.feature_extraction.text
.
Finally, we feed this representation into the Logistic Regression classifier from sklearn.linear_model
, parameterized with the following value of the inverse of regularization strength:
- C = 3.41
In order to generate the top Kaggle submission, please ensure all Python requirements are installed and then run:
cd code
python run.py
This makes use of the pre-trained classifier available in the code/models
folder to predict labels for new tweets and store them in a .csv
file in the code/results
folder. The default test data file is data/test_data.txt
(the one provided for the Kaggle competition) but it can be easily changed in code/run.py
.
You can train the classifier that we use for the top Kaggle submission. To do:
- Ensure all Python requirements are installed
- Ensure the Twitter data files from Kaggle are in the
data/
folder. - Run:
cd code
python train.py
This file makes use of data/train_pos_full.txt
and data/train_neg_full.txt
(data files from the Kaggle competition) as the training sets and creates a model in the code/models
folder.
The time needed to run it is between 50 and 60 minutes: around 50 minutes for pre-processing and around 10 minutes for fitting the classifier (depending on your machine).
You can then predict labels for new data as described in the previous section.