We present implementation of following latent variable models suitable for large scale deployment:
CoverTree
- Fast nearest neighbour searchKMeans
- Simple, fast, and distributed clustering with option of various initializationGMM
- Fast and distributed inference for Gaussian Mixture Models with diagonal covariance matricesLDA
- Fast and distributed inference for Latent Dirichlet AllocationGLDA
- Fast and distributed inference for Gaussian LDA with diagonal covariance matricesHDP
- Fast inference for Hierarchical Dirichlet Process
Under active development
- All codes are under
src
within respective folder - Dependencies are provided under
lib
folder - Python wrapper classes reside in
fastlvm
folder - For running different models an example script is provided under
scripts
data
is a placeholder folder where to put the databuild
anddist
folder will be created to hold the executables
- gcc >= 5.0 or Intel® C++ Compiler 2017 for using C++14 features
- Python 3.5+
There are two ways to utilize the package: using Python wrapper or directly in C++
Just use python setup.py install
and then in python you can import fastlvm
. Example and test code is in test.py
.
The python API details are provided in API.pdf
, but all of the models utilise the following structure:
class LVM:
init(self, # hyperparameters)
return model
fit(self, X, ...):
return validation score
predict(self, X):
return prediction on each test example
evaluate(self, X):
return test score
If you do not have root priveledges, install with python setup.py install --user
and make sure to have the folder in path.
We will show how to compile our package and run, for example nearest neighbour search using cover trees, on a single machine using synthetic dataset
-
First of all compile by hitting make
make
-
Generate synthetic dataset
python data/generateData.py
-
Run Cover Tree
dist/cover_tree data/train_100d_1000k_1000.dat data/test_100d_1000k_10.dat
The make file has some useful features:
-
if you have Intel® C++ Compiler, then you can instead
make intel
-
or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
make inteltogether
-
Also you can selectively compile individual modules by specifying
make <module-name>
-
or clean individually by
make clean-<module-name>
cd tests
python -m unittest discover # requires unittest 3.2 and newer
We use a distributed and parallel extension and implementation of Cover Tree data structure for nearest neighbour search. The data structure was originally presented in and improved in:
- Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- Mike Izbicki and Christian Shelton. "Faster cover trees." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.
We implement a modified inference for Gaussian LDA. The original model was presented in:
- Rajarshi Das, Manzil Zaheer, Chris Dyer. "Gaussian LDA for Topic Models with Word Embeddings." Proceedings of ACL (pp. 795-804) 2015.
We implement a modified inference for Hierarchical Dirichlet Process. The original model and inference methods were presented in:
- Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(576):1566{1581, 2006.
- C. Chen, L. Du, and W.L. Buntine. Sampling table configurations for the hierarchical poisson-dirichlet process. In European Conference on Machine Learning, pages 296-311. Springer, 2011.
If the build fails and throws error like "instruction not found", then most probably the system does not support AVX2 instruction sets. To solve this issue, in setup.py
and src/cover_tree/makefile
please change march=core-avx2
to march=corei7
.