This is an entry to Kaggle's Sentiment Analysis on Movie Reviews (SAMR) competition.
It's written for Python 3.3 and it's based on scikit-learn
and nltk
.
Quoting from Kaggle's description page:
This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive.
Some examples:
- 4 (positive): "They works spectacularly well... A shiver-inducing, nerve-rattling ride."
- 3 (somewhat positive): "rooted in a sincere performance by the title character undergoing midlife crisis"
- 2 (neutral): "Its everything you would expect -- but nothing more."
- 1 (somewhat negative): "But it does not leave you with much."
- 0 (negative): "The movies progression into rambling incoherence gives new meaning to the phrase fatal script error."
So the goal of the competition is to produce an algorithm to classify phrases
into these categories. And that's what samr
does.
After installing just run:
generate_kaggle_submission.py samr/data/model2.json > submission.csv
And that will generate a Kaggle submission file that scores near 0.65844
on the
leaderboard
(should take 3 minutes, and as of 2014-07-22 that score is the 2nd place).
The model2.json
argument above is a configuration file for samr
that
determines how the scikit-learn
pipeline is going to be built and other
hyperparameters, here is how it looks:
{
"classifier":"randomforest",
"classifier_args":{"n_estimators": 100, "min_samples_leaf":10, "n_jobs":-1},
"lowercase":"true",
"map_to_synsets":"true",
"map_to_lex":"true",
"duplicates":"true"
}
You can try samr
with different configuration files you make (as long as the
options are implemented), yielding
different scores and perhaps even better scores.
In particular model2.json
feeds a random forest classifier
with a concatenation of 3 kinds of features:
- The decision functions of set of vanilla SGDClassifiers trained in a one-versus-others scheme using bag-of-words as features. It's classifier inside a classifier, yo dawg!
- The decision functions of set of vanilla SGDClassifiers trained in a one-versus-others scheme using bag-of-words on the wordnet synsets of the words in a phrase.
- The amount of "positive" and "negative" words in a phrase as dictated by the Harvard Inquirer sentiment lexicon
During prediction, it also checks for duplicates between the training set and the train set (there are quite a few).
And that's it! Want more details? see the code! it's only 350 lines.
If you know the drill, this should be enough:
git clone https://github.com/rafacarrascosa/samr.git
pip install -e samr -r samr/docs/setup/requirements-dev.txt
download_3rdparty_data.py
Then you will need to manually download train.tsv
and test.tsv
from the
competition's data folder
and unzip them into the samr/data
folder. You may be asked to join Kaggle and/or
accept the competition rules before downloading the data.
Even though samr
is writen for Python 3.3 it may also work with Python 2.7
(and the last time I checked it was), but this is not supported and it may
break in the future.
If the short instructions are not enough, read on.
These instructions will install the development version of samr
inside a
Python 3.3 virtualenv and were thought for a blank, vanilla Ubuntu 14.04 and
tested using Docker (awesome tool btw). They should
work more or less unchanged with other Ubuntu versions and Debian-based OSs.
Open a console and 'cd' into an empty folder of your choice. Now, execute the following commands:
Install python 3.3 and compilation requirements for numpy and scipy:
sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:fkrull/deadsnakes
sudo apt-get update
sudo apt-get install -y python3.3 python3.3-dev python-scipy gfortran libopenblas-dev liblapack-dev git wget
Create virtualenv, bootstrap pip and boostrap numpy:
python3.3 -m venv venv
source venv/bin/activate
wget https://bootstrap.pypa.io/get-pip.py
python3.3 get-pip.py
echo 'PATH="$VIRTUAL_ENV/local/bin:$PATH"; export PATH' >> venv/bin/activate
source venv/bin/activate
pip install numpy==1.8.1
Clone and install samr:
git clone https://github.com/rafacarrascosa/samr.git
pip install -e samr -r samr/docs/setup/requirements-dev.txt
download_3rdparty_data.py
Optionally run the tests:
nosetests samr/tests
Lastly, you will need to manually download train.tsv
and test.tsv
from the
competition's data folder
and unzip them into the samr/data
folder. You may be asked to join Kaggle and/or
accept the competition rules before downloading the data.
The installation is self-contained (within the folder you chose at the start) with two exceptions:
- Lines starting with
sudo apt-get
made system-wide changes, to uninstall those you will to usesudo apt-get remove
. nltk
downloads data to~/nltk_data
, once you don't usenltk
it's safe to erase that folder.
This project is open-source and BSD licensed, see the LICENSE file for details.
This license basically allows you to do anything, but in case you're wondering:
I'm ok if you use samr
to beat my score at the competition, just share back
what you've learned!
This project was developed by Rafael Carrascosa, you can contact me at rafacarrascosa@gmail.com.