Skip to content

eli-osherovich/fasterText

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

fasterText is a library designed for fast and accurate word embeddings. While originated from fastText it may not be fully compatible with it.

Requirements

As a pre-requisite you will need:

Building fasterText

$ wget https://github.com/eli-osherovich/fasterText/archive/master.zip
$ unzip master.zip
$ cd fasterText-master
$ make

This will produce object files for all the classes as well as the main binary fastertext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Word representation learning

In order to learn word vectors, do:

$ ./fastertext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Comparison with fastText

Following are the results that were obtained (*) with latest versions of fastText () and fasterText () on a full dump of the English Wikipedia (4.5B words) on two popular benchmarks: WS353 and RW.

​Framework Time (mins) Embedding Dimensionality Epochs ​WS353 (OOV%) ​RW (OOV%) ​RW common (OOV%)
​FastText ​122 ​ 100 ​ 5 ​ 71 (0%) ​44 (5%) ​44 (3%)
​FasterText ​62 ​ 100 ​ 5 ​72 (0%) ​43 (3%) ​43 (3%)
​FastText ​813 ​ 300 ​10 ​74 (0%) ​49 (5%) 49 (3%)
​FasterText ​255 ​ 300 ​ 10 ​74 (0%) ​47 (3%) ​48 (3%)

(*) The results above were obtained on an Amazon EC2 instance m5.x24 using 96 threads. (**) fasterText: 790c791d30c060a4a8c738c303ba859aed9eb766, fasttext: 3e64bf0f5b916532b34be6706c161d7d0a4957a4

About

A (much) faster fork of FastText

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 71.7%
  • JavaScript 9.8%
  • C++ 7.2%
  • Python 4.2%
  • C 2.8%
  • CSS 2.1%
  • Other 2.2%