Skip to content

Robust Cross-lingual Embeddings from Parallel Sentences

Notifications You must be signed in to change notification settings

epfml/Bi-Sent2Vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bi-Sent2Vec

TLDR: This library provides cross-lingual numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task with applications geared towards cross-lingual word translation, cross-lingual sentence retrieval as well as cross-lingual downstream NLP tasks. The library is a cross-lingual extension of Sent2Vec.

Bi-Sent2Vec vectors are also well suited to monolingual tasks as indicated by a marked improvement in the monolingual quality of the word embeddings. (For more details, see paper)

Table of Contents

Setup and Requirements

Our code builds upon Facebook's FastText library.

To compile the library, simply run the make command.

Using the model

For the purpose of generating cross-lingual word and sentence representations, we introduce our Bi-Sent2vec method and provide code and models.

The method uses a simple but efficient objective to train distributed representations of sentences. The algorithm outperforms the current state-of-the-art bag-of-words based models on most of the benchmark tasks, and is also competitive with deep models on some of the tasks, highlighting the robustness of the produced word and sentence embeddings, see the paper for more details.

Downloading Bi-Sent2Vec pre-trained vectors

Models trained and tested in the Bi-Sent2Vec paper can be downloaded from the following links. Users are encouraged to add more bi-lingual models to the list provided they have been benchmarked properly.

Unigram

EN-DEEN-ESEN-FIEN-FREN-HUEN-IT

Bigram

EN-DEEN-ESEN-FIEN-FREN-HUEN-IT

Train a New Bi-Sent2Vec Model

Tokenizing and data format

Bi-Sent2Vec requires parallel sentences (sentences which are translations of each other) for training. We use spacy tokenizer to tokenize the text.

The required data format is one sentence pair per line. The two parallel sentences are separated by a <<split>> token and each word has its language code attached to it as a prefix. For example, here is an example of a snapshot of a valid English-French dataset -

the_en train_en is_en arriving_en ._en <<split>> le_fr train_fr arrive_fr ._fr
france_en won_en the_en world_en cup_en ._en <<split>> la_fr france a_fr gagné_fr la_fr coupe_fr du_fr monde_fr ._fr

Training

Assuming en-fr_sentences.txt is the pre-processed training corpus, here is an example of a command to train a Bi-Sent2Vec model:

./fasttext bisent2vec -input en-fr_sentences.txt -output model-en-fr -dim 300 -lr 0.2 -neg 10 -bucket 2000000 -maxVocabSize 750000 -thread 30 -t 0.000005 -epoch 5 -minCount 8 -dropoutK 4 -loss ns -wordNgrams 2 -numCheckPoints 5

Here is a description of all available arguments:

The following arguments are mandatory:
  -input              training file path
  -output             output file path (model is stored in the .bin file and the vectors in .vec file)

The following arguments are optional:
  -lr                 learning rate [0.2]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                dimension of word and sentence vectors [100]
  -epoch              number of epochs [5]
  -minCount           minimal number of word occurences [5]
  -minCountLabel      minimal number of label occurences [0]
  -neg                number of negatives sampled [10]
  -wordNgrams         max length of word ngram [2]
  -loss               loss function {ns, hs, softmax} [ns]
  -bucket             number of hash buckets for vocabulary [2000000]
  -thread             number of threads [2]
  -t                  sampling threshold [0.0001]
  -dropoutK           number of ngrams dropped when training a Bi-Sent2Vec model [2]
  -verbose            verbosity level [2]
  -maxVocabSize       vocabulary exceeding this size will be truncated [None]
  -numCheckPoints     number of intermediary checkpoints to save when training [1]

Post Processing

Use vectors_by_lang.py to separate the vectors for the two different languages. Example -

python vectors_by_lang.py model-en-fr.vec en fr

This code will create two files model-en-fr_en.vec and model-en-fr_fr.vec in word2vec format containing vectors for English and French respectively.

Evaluation

Our models are evaluated using the standard evaluation tool in the MUSE repository by Facebook AI Research.

References

When using this code or some of our pretrained vectors for your application, please cite the following paper:

Ali Sabet, Prakhar Gupta, Jean-Baptiste Cordonnier, Robert West, Martin Jaggi Robust Cross-lingual Embeddings from Parallel Sentences

@article{Sabet2019RobustCE,
  title={Robust Cross-lingual Embeddings from Parallel Sentences},
  author={Ali Sabet and Prakhar Gupta and Jean-Baptiste Cordonnier and Robert West and Martin Jaggi},
  journal={ArXiv 1912.12481},
  year={2020},
}