PyTorch implementation of my improved version of Hash Embedding for efficient Representation (NIPS 2017). Submission to the NIPS Implementation Challenge (Featured Winner).
The use of this directory is two-fold:
- Implementing an improved hash embedding layer in PyTorch. This can be found in the
./hashembed
folder. It works both for Python 2 and 3. - Implementing an easy pipeline in PyTorch to evaluate a new NLP classification algorithm / a new type of embedding. This can be found in the
./evaluate
folder. This has only been tested on python 3.
Hash Embedding are a generalization of the hashing trick in order to get a larger vocabulary with the same amount of parameters, or in other words it can be used to approximate the hashing trick using less parameters. The hashing trick (in the NLP context) is a popular technique where you use a hash table rather than a dictionnary for the word embeddings, which enables online learning (as the table's size is fixed with respect to the vocabulary size) and often helps against overfitting.
See my explanation of hashembeddings below for more details about the layer.
# clone repo
pip install -r requirements.txt
# install pytorch : http://pytorch.org/
If you only want to use the hashembedding:
from hashembed import HashEmbedding
- Download and untar the data in the
data
folder. If the link check Xiang Zhang's Crepe directory on github
- use
python evaluate/main <param>
to run a single experiment. If you want perfect replicabiility use, define the python hash seed withPYTHONHASHSEED=0 python evaluate/main <param>
.
usage: main.py [-h]
[-x {custom,std-embed-dict,hash-embed-dict,hash-embed-nodict,std-embed-nodict,ensemble-hash-embed-dict}]
[-d {ag,amazon,amazon-polarity,dbpedia,sogou,yahoo,yelp,yelp-polarity}]
[--no-shuffle] [--no-checkpoint] [--val-loss-callback]
[-e EPOCHS] [-b BATCH_SIZE] [-v VALIDATION_SIZE] [-s SEED]
[-p PATIENCE] [-V VERBOSE] [-P [PATIENCE [FACTOR ...]]]
[--no-cuda] [-w NUM_WORKERS] [--dictionnary]
[-g [MIN_NGRAM [MAX_NGRAM ...]]]
[-f [MIN_FATURES [MAX_FATURES ...]]] [--no-hashembed]
[--no-append-weight] [--old-hashembed] [-D DIM]
[-B NUM_BUCKETS] [-N NUM_EMBEDING] [-H NUM_HASH]
[-m {embed-softmax,embed-3L-softmax,ensemble-embed-3L-softmax}]
PyTorch implementation and evaluation of HashEmbeddings, which uses multiple
hashes to efficiently approximate an Embedding layer.
optional arguments:
-h, --help show this help message and exit
Predefined experiments:
-x {custom,std-embed-dict,hash-embed-dict,hash-embed-nodict,std-embed-nodict,ensemble-hash-embed-dict}, --experiment {custom,std-embed-dict,hash-embed-dict,hash-embed-nodict,std-embed-nodict,ensemble-hash-embed-dict}
Predefined experiments to run. If different than
`custom` then only the dataset argument will be
considered. (default: custom)
Dataset options:
-d {ag,amazon,amazon-polarity,dbpedia,sogou,yahoo,yelp,yelp-polarity}, --dataset {ag,amazon,amazon-polarity,dbpedia,sogou,yahoo,yelp,yelp-polarity}
path to training data csv. (default: ag)
Learning options:
--no-shuffle Disables shuffling batches when training. (default:
False)
--no-checkpoint Disables model checkpoint. I.e saving best model based
on validation loss. (default: False)
--val-loss-callback Whether should monitor the callbacks (early stopping ?
decrease LR on plateau/ ... on the loss rather than
accuracy on validation set. (default: False)
-e EPOCHS, --epochs EPOCHS
Maximum number of epochs to run for. (default: 300)
-b BATCH_SIZE, --batch-size BATCH_SIZE
Batch size for training. (default: 64)
-v VALIDATION_SIZE, --validation-size VALIDATION_SIZE
Percentage of training set to use as validation.
(default: 0.05)
-s SEED, --seed SEED Random seed. (default: 1234)
-p PATIENCE, --patience PATIENCE
Patience if early stopping. None means no early
stopping. (default: 10)
-V VERBOSE, --verbose VERBOSE
Verbosity in [0,3]. (default: 3)
-P [PATIENCE [FACTOR ...]], --plateau-reduce-lr [PATIENCE [FACTOR ...]]
If specified, if loss did not improve since PATIENCE
epochs then multiply lr by FACTOR. [None,None] means
no reducing of lr on plateau. (default: [4, 0.5])
Device options:
--no-cuda Disables CUDA training, even when have one. (default:
False)
-w NUM_WORKERS, --num-workers NUM_WORKERS
Number of subprocesses used for data loading.
(default: 0)
Featurizing options:
--dictionnary Uses a dictionnary. (default: False)
-g [MIN_NGRAM [MAX_NGRAM ...]], --ngrams-range [MIN_NGRAM [MAX_NGRAM ...]]
Range of ngrams to generate. ngrams in
[minNgram,maxNgram[. (default: [1, 3])
-f [MIN_FATURES [MAX_FATURES ...]], --num-features-range [MIN_FATURES [MAX_FATURES ...]]
If specified, during training each phrase will have a
random number of features in range
[minFeatures,maxFeatures[. None if take all. (default:
[4, 100])
Embedding options:
--no-hashembed Uses the default embedding. (default: False)
--no-append-weight Whether to append the importance parameters. (default:
False)
--old-hashembed Uses the paper version of hash embeddings rather than
the improved one. (default: False)
-D DIM, --dim DIM Dimension of word vectors. Higher improves downstream
task for fixed vocabulary size. (default: 20)
-B NUM_BUCKETS, --num-buckets NUM_BUCKETS
Number of buckets in the shared embedding table.
Higher improves approximation quality. (default:
1000000)
-N NUM_EMBEDING, --num-embeding NUM_EMBEDING
Number of rows in the importance matrix. Approximate
the number of rows in a usual embedding. Higher will
increase possible vocabulary size. (default: 10000000)
-H NUM_HASH, --num-hash NUM_HASH
Number of different hashes to use. Higher improves
approximation quality. (default: 2)
Model options:
-m {embed-softmax,embed-3L-softmax,ensemble-embed-3L-softmax}, --model {embed-softmax,embed-3L-softmax,ensemble-embed-3L-softmax}
which model to use. Default is a simple softmax.
(default: embed-softmax)
To get the same results as me, run all the following experiments (Warning: computationally very intensive.):
bin/no_dict_all.sh
: runs the experiments whithout dictionnary on all datasets.bin/dict_all.sh
: runs the experiments whith dictionnary on all datasets. Note that as explained in the results section, I haven't ran it completely.bin/accuracy_distribution.sh
: runs the expirements whithout dictionnary on the 2 smallest datasets with 10 different random seed to see robustness of the model and results to the random seed.
In order to understand the advantages of Hash Embeddings and how they work it is probably a good idea to review and compare to a usual dictionnary and to the hashing trick.
Notation: scalar a, matrix M, ith row in a matrix: m=M[i], row vector v = (v^1, ..., v^i, ...) , function f().
-- General --
- w_i : word or token i.
- d : dimension of each word embedding.
- E : table containing all the word embeddings. Size: n*d.
- e_w : vector embedding of word w (word embedding).
- M[i] : looks up index i in matrix M, returns the value in the table.
- h(x) : hash function returns the associated index with x.
-- Hash Embeddings --
- u_i(x) : universal hash function i (i.e "sampled hash function") returns the associated index with x.
- k: number of hash functions.
- E : table containing all the word embeddings. Size: b*d.
- P: matrix containing the weight for doing the weighted average of the k embeddings. Size: n*k.
- C_w: matrix containing the k different word embeddings for word w. Size: k*d.
Nota Bene: In the paper n for hash embeddings is called K but to differentiate with k and for consistencz with teh hashing trick I use n.
- Usual Dictionnary:
- Pretraining step: Loop through the corpus once and count how many times you see each words. Keep the n most common words. Initialize E.
- Training step: loop over word w in corpus and update associated embedding in E: e_w = E[w] (size 1*d).
- Online Training: hard.
- Number of trainable parameters
- Hashing Trick:
- Pretraining step: initalize E.
- Training step: loop over word w in corpus and update associated embedding in E : e_w = E[H(w)] (size 1*d).
- Online Training: trivially continue training.
- Hashing Trick:
- Pretraining step: initalize E and P.
- Training step: loop over word w in corpus and update P and E such that : p_w = P[u_i(w)] (size 1*k) and C_w = (E[u_1(w)],...,E[u_k(w)]) (size k*d) and e_w = p_w * C_w (size 1*d).
- Online Training: trivially continue training.
As we say, a picture is worth a million words. Let's save both of us some time :) :
I have improved the papers hash embeddings by 3 modification:
In the paper (when without dictionnary) they first use a hash function D_1 (which I call h) with a range [1,n] and then they use the output of D_1 both for the index of P (like me) and as input to D_2 (which has the same output as my universal hash u). This means that everytime there is a collision in D_1 the final word embedding would be the same. In my implementation, you need to have a collision both in h and u in order to end up with the same word embedding (i.e same component embeddings and same weights). Let's quantify the improvement:
Theorem (Theorem 4.1 of the paper, birthday problem in disguise) Let h be a hash function with |in| input and |out| outputs. Then the probability p_col that w_0 ∈ Tcollides with one or more other tokens is given by (approximation for large |out|):
The number of collision is trivially : nCol = p_col * |in|
In the papers implementation, 2 words would have the same word embedding (collision) if either D_1 or D_2 collides. I.e p_col = p_col_D1 + (1-p_col_D1) * p_col_D2 (only the tokens that did not colide in first layer can colide in second). For D1: |in| = |vocab|, |out| = n. For D2: |in| = n, |out| = b^k. So p_col:
In my implementation 2 words would have the same word embedding (collision) only if both h and u collided. I.e p_col = p_col_h * p_col_u (independant). For h: |in| = |vocab|, |out| = n. For u: |in| = |vocab|, |out| = b^k. So p_col:
This is a huge improvement. For example if we run the the experiment whithout dictionnary and limit the number of toke to 10^9, i.e |vocab| = 10^9, n=10^7, B=10^6, k=2 the p_col_old≈1 while p_col_new≈0.01.
Nota Bene: The authors mentionned this in the paper but they talked about adding a new hash function D_3 to do so, while I do it whithout adding a new hash function. The reason I think they didn't implement as I did is because this required a universal hashing function which I believe they weren't using (see the next improvement)
From the authors keras implmentation, it seems that they implemented D_2 as a table filled with randint % b. This explains also why they needed to use D_2 on the output of an other hash D_1 as they had to look up in the hashing table. This means that they had to store in memory and lookup k times per token in a table of hash of dimension n*k (i.e same as P).
I removed thisby using a universal family of hashing function, to generate k independent hashes.
In the paper the authors would first multiply each component vector C_w[i] by p_w then sum all of these (e_w = p_w * C_w). This makes a lot of sense as it inutitively makes a weighted average of the different component vectors. I extended this by giving the possibility of concatenating the weighted vectors or taking the median of them (note that mean should do the same as sum as the the weights are learnable so they could learn to divide them selves by k to make a weighted average rather than a weighted sum).
- Datasets : from Table 1 in the paper.
Dataset | #Train | #Test | #Classes | Task |
---|---|---|---|---|
AG’s news | 120k | 7.6k | 4 | English news categorization |
DBPedia | 450k | 70k | 14 | Ontology classification |
Yelp Review Polarity | 560k | 38k | 2 | Sentiment analysis |
Yelp Review Full | 560k | 50k | 5 | Sentiment analysis |
Yahoo! Answers | 650k | 60k | 10 | Topic classification |
Amazon Review Full | 3000k | 650k | 5 | Sentiment analysis |
Amazon Review Polarity | 3600k | 400k | 2 | Sentiment analysis |
- Preprocessing :
- Remove punctuation, lowercase, remove single characters. Note that I used the default Keras preprocessing as it's the papers framework, but I now realized it was written that they only remove punctuation, I didn't have the time to rerun everything.
- Converts to n-grams.
- Select randomly between 4-100 n-grams as input (~dropout).
- Training:
- Sum n-gram embedding to make phrase emebdding.
- Adam optimizer.
- Learning rate = 0.001.
- Early stooping with patience = 10.
- 5% of train set used for validation.
In order to compare to the results in the paper I ran the same experiments:
- Without a dictionnary: embedding followed by a softmax.
- Standard Embeddings :
- Hyper parameters: n = 10^7, d = 20, ngram range : [1,3[.
- Number of trainable parameters : 200 * 10^6
- Hash Embeddings :
- Hyper parameters: n = 10^7, k = 2, b = 10^6, d = 20, ngram range : [1,3[.
- Number of trainable parameters : 40 * 10^6.
- Standard Embeddings :
- With a dictionnary: embedding followed by 3 fully connected layers with 1000 hidden units and ReLu activation. Then ends in softmax layer. With batch normalization.
- Standard Embeddings :
- Hyper parameters: n = cross-validate([10K, 25K, 50K, 300K, 500K, 1M]), d = 200, ngram range : [1,10[.
- Number of trainable parameters : ...
- Hash Embeddings :
- Hyper parameters: n = 10^6, k = 2, b = cross-validate([500, 10K, 50K, 100K, 150K]), d = 200, ngram range : [1,10[.
- Number of trainable parameters : ...
- Standard Embeddings :
Please note that currently I only ran all the experimenths without dictionnary (although I ran with a dictionnary on 2 datasets, and all the code is in the package). I decided to do so because:
- Using hashing rather than dictionnary is probably the most useful in practice (better results and enables online learning).
- The results without dictionnary are the ones where hash embeddings seem to do a lot better than.
- I didn't have the computational power to run cross validation on neural networks for large datsets (using cv on real problems also seems less likely).
- I wanted to show that the hash embeddings were working, and then spend some time on improving them.
- I was able to replicate the results with a dictionnary for 2 datasets, so though that it wouldn't add so much to test on the others.
No Dict. & Hash Emb. |
No Dict. & Std Emb. |
|
---|---|---|
# of Parameters | 40M | 200M |
AG’s news (#train: 120k) | 92.1 | 91.9 |
Amazon Review Full (#train: 450k) | 59.1 | 58.8 |
DBPedia (#train: 450k) | 98.7 | 98.5 |
Yahoo! Answers (#train: 650k) | 72.9 | 73.1 |
Yelp Review Full (#train: 560k) | 62.5 | 62.1 |
Amazon Review Polarity (#train: 3000k) | 94.3 | 94.2 |
Yelp Review Polarity (#train: 560k) | 95.8 | 95.6 |
The difference between hashembeddings and standard embeddings seems consistent with the papers result (Hashembedding is always better besides 1 dataset. In our case Yahoo, in the paper DBPedia). It seems that the average accuracy is slighly lower for both than in the paper, this might be because:
- I only used a single seed, no cherry picking.
- I didn't do any hyperparameters optimization.
- I used either Pytorch defaults initialization / defaults or the ones from Keras. I often used Keras defaults as the authors used this frameowrk and I was hoping they kept the default parameters for better replicability.
In order to investigate the effects of my improvements and of the hyperparameter "append weight" (the optional step of appending the importance weights p to e_w), I ran a few experiments. Because the effects of each components might be less important than the variabilty due to the seed, I ran the experiment multiple times to make a more robust conclusion. This might also give some idea about the statistical significance of some of the papers and my results. As this nice paper reminds us: looking at the distribution matters! Unfortunately I don't have the computational power to run large experiments multiple times so I decided to run only for smaller datasets. Because I chose the smaller datasets of the ones we have above, I divided both b and n by 5 in order to understand if some methods need less parameters.
Nota Bene: yhe evaluation on the training set is the evaluation during training phase, i.e I still sample randomlyn-grams from the text (the accuracy would be around 1 if not)
From the plots we see that the old and the new hash-embeddings are significantly different on the test set, although it seems that the greater number of collision in the standard hash embeddings might have a regularization effect (at least on this small dataset). Indeed the results on the training set seems significantly higher with the improved hash embeddings.
The first plot also seems to indicate that the "Ag News" dataset should have been used to evaluate a model with less parameters, indeed it seems that the large number of parameters in standard embeddings only makes it overfit.
The second plot is interesting as it shows that the hash-embeddings do indeed work better when there are the same number of parameters. Interestingly it seems to indicate that hash-embeddings work just as well as using a dictionary with the same amount of parameters. I.e there's no more reason (besides simplicity) to use a dictionary instead of hash embeddings.
From this plot we see that the default hyperparameters seem to be the best (sum and append weights). Although appending weights doesn't seem to give a statistically significant result difference in our case.
from Table 2 in the paper:
No Dict. & Hash Emb. |
No Dict. & Std Emb. |
Dict. & Hash Emb. |
Dict. & Std Emb. |
Dict. & Ensemble Hash Emb. |
|
---|---|---|---|---|---|
# of Parameters | 40M | 200M | Unspecified | Unspecified | |
AG’s news (#train: 120k) | 92.4 | 92.0 | 91.5 | 91.7 | 92.0 |
Amazon Review Full (#train: 450k) | 60.0 | 58.3 | 59.4 | 58.5 | 60.5 |
DBPedia (#train: 560k) | 98.5 | 98.6 | 98.7 | 98.6 | 98.8 |
Yahoo! Answers (#train: 560k) | 72.3 | 72.3 | 71.3 | 65.8 | 72.9 |
Yelp Review Full (#train: 650k) | 63.8 | 62.6 | 62.6 | 61.4 | 62.9 |
Amazon Review Polarity (#train: 3000k) | 94.4 | 94.2 | 94.7 | 93.6 | 94.7 |
Yelp Review Polarity (#train: 3600k) | 95.9 | 95.5 | 95.8 | 95.0 | 95.7 |