GitHub - eli-osherovich/fasterText: A (much) faster fork of FastText

Introduction

fasterText is a library designed for fast and accurate word embeddings. While originated from fastText it may not be fully compatible with it.

Requirements

As a pre-requisite you will need:

Intel Integrated Performance Primitives
Boost tokenizer
a modern C++ compiler (with good C++11 support)

Building fasterText

$ wget https://github.com/eli-osherovich/fasterText/archive/master.zip
$ unzip master.zip
$ cd fasterText-master
$ make

This will produce object files for all the classes as well as the main binary fastertext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Word representation learning

In order to learn word vectors, do:

$ ./fastertext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Comparison with fastText

Following are the results that were obtained (*) with latest versions of fastText () and fasterText () on a full dump of the English Wikipedia (4.5B words) on two popular benchmarks: WS353 and RW.

Framework	Time (mins)	Embedding Dimensionality	Epochs	WS353 (OOV%)	RW (OOV%)	RW common (OOV%)
FastText	122	100	5	71 (0%)	44 (5%)	44 (3%)
FasterText	62	100	5	72 (0%)	43 (3%)	43 (3%)
FastText	813	300	10	74 (0%)	49 (5%)	49 (3%)
FasterText	255	300	10	74 (0%)	47 (3%)	48 (3%)

(*) The results above were obtained on an Amazon EC2 instance m5.x24 using 96 threads. (**) fasterText: 790c791d30c060a4a8c738c303ba859aed9eb766, fasttext: 3e64bf0f5b916532b34be6706c161d7d0a4957a4

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
.circleci		.circleci
docs		docs
python		python
scripts		scripts
src		src
tests		tests
website		website
.clang-format		.clang-format
.clang_complete		.clang_complete
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
PATENTS		PATENTS
README.md		README.md
classification-example.sh		classification-example.sh
classification-results.sh		classification-results.sh
eval.py		eval.py
get-wikimedia.sh		get-wikimedia.sh
pretrained-vectors.md		pretrained-vectors.md
quantization-example.sh		quantization-example.sh
runtests.py		runtests.py
setup.cfg		setup.cfg
setup.py		setup.py
wikifil.pl		wikifil.pl
word-vector-example.sh		word-vector-example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Requirements

Building fasterText

Word representation learning

Comparison with fastText

About

Releases

Packages

Languages

License

eli-osherovich/fasterText

Folders and files

Latest commit

History

Repository files navigation

Introduction

Requirements

Building fasterText

Word representation learning

Comparison with fastText

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages