Light-weight installation without UMAP and HDBSCAN #2289
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adjust the code base such that BERTopic can be installed without needing to install
sentence-transformers
,umap-learn
, orhdbscan
to be more light-weight and modular:Ideally, I would want to have the
pip install bertopic[light]
option, but that is currently not supported in neither pip and uv I believe. That said, I would love to know if anybody else has any ideas other than the above (which is not ideal).NOTE: I also remove the notebook which was outdated since the Google Colab notebooks are the primary source for tutorials (and can also be downloaded to use offline).
Lightweight installation
The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely
"all-MiniLM-L6-v2"
. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requirespytorch
which often results in a rather large environment, memory-wise.Fortunately, it is possible to install BERTopic without
sentence-transformers
,UMAP
, and/orHDBSCAN
. This can be to reduce your docker images for inference or when you do not usepytorch
but for instance Model2Vec instead. The installation can be done as follows:This installs a bare-bones version of BERTopic. If you want to use UMAP and Model2Vec for instance, you'll need to first install them:
pip install model2vec umap-learn
Then, you can BERTopic without needing to have a CPU:
As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!
Before submitting