Sentence encoder and training code for the paper Learning Universal Sentence Representations with Mean-Max Attention Autoencoder.
This code is written in python. To use it you will need:
- Python 2.7
- TensorFlow
- NumPy
- NLTK
- SciPy
- Ray (for parallel evaluation on transfer tasks)
The pre-processed Toronto BookCorpus we used for training our model is available here.
To download GloVe vector:
curl -Lo data/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip data/glove.840B.300d.zip -d data/
To get all the transfer tasks datasets, run (in data/):
./get_transfer_data.bash
This will automatically download and preprocess the transfer tasks datasets, and store them in data/.
We provide a simple interface to encode English sentences. Get started with the following steps:
1) Download our Mean-Max AAE models and put it to SentEncoding directory, then decompress:
unzip models.zip
2) Make sure you have the NLTK tokenizer by running the following once:
import nltk
nltk.download('punkt')
3) Load our pre-trained model:
import master
m = master.Master('conf.json')
m.creat_graph()
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
m.prepare()
3) Build the vocabulary of word vectors (i.e keep only those needed):
vocab = m.build_vocab(sentences, tokenize=True)
m.build_emb(vocab)
where sentences is your list of n sentences.
4) Encode your sentences:
embeddings = m.encode(sentences, tokenize=True)
This outputs a numpy array with n vectors of dimension 4096.
If you found this code useful, please cite the following paper:
@inproceedings{zhang2018learning,
title={Learning Universal Sentence Representations with Mean-Max Attention Autoencoder},
author={Zhang, Minghua and Wu, Yunfang and Li, Weikang and Li, Wei},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
pages={4514--4523},
year={2018}
}