revisited-baselines

Improved baselines for sentence and document representations

This mini-project was undertaken as part of COMP-551 at McGill University.

The goal of this project was to revisit statements made in the work of Le & al with regard to the performance of Paragraph vectors in natural language processing applications. The authors claimed that Paragraph vectors achieved state-of-the-art results on text classification and sentiment analysis tasks. To verify this statement, the best baselines referenced in this report were reproduced. All comparisons were made on the IMDB sentiment dataset. A NB-SVM baseline was used and improved. The latter achieved an accuracy of 92.096% on the test set. This is 0.876% above the baseline reported in the original article.

The following scripts were used:

data_load.py : to load review comments

textprocessing.py : to remove special characters, stop words, lemmatize or stem words, etc

pipeline.py : main file used to generate predictions

See the writeup.pdf for details on the methodology and results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

revisited-baselines

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
data_load.py		data_load.py
pipeline.py		pipeline.py
textprocessing.py		textprocessing.py
writeup.pdf		writeup.pdf

JSGrondin/revisited-baselines

Folders and files

Latest commit

History

Repository files navigation

revisited-baselines

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages