Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light-weight installation without UMAP and HDBSCAN #2289

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Feb 18, 2025

What does this PR do?

Adjust the code base such that BERTopic can be installed without needing to install sentence-transformers, umap-learn, or hdbscan to be more light-weight and modular:

pip install --no-deps bertopic
pip install --upgrade numpy pandas scikit-learn tqdm plotly pyyaml

Ideally, I would want to have the pip install bertopic[light] option, but that is currently not supported in neither pip and uv I believe. That said, I would love to know if anybody else has any ideas other than the above (which is not ideal).

NOTE: I also remove the notebook which was outdated since the Google Colab notebooks are the primary source for tutorials (and can also be downloaded to use offline).

Lightweight installation

The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely "all-MiniLM-L6-v2". Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires pytorch which often results in a rather large environment, memory-wise.

Fortunately, it is possible to install BERTopic without sentence-transformers, UMAP, and/or HDBSCAN. This can be to reduce your docker images for inference or when you do not use pytorch but for instance Model2Vec instead. The installation can be done as follows:

pip install --no-deps bertopic
pip install --upgrade numpy pandas scikit-learn tqdm plotly pyyaml

This installs a bare-bones version of BERTopic. If you want to use UMAP and Model2Vec for instance, you'll need to first install them:

pip install model2vec umap-learn

Then, you can BERTopic without needing to have a CPU:

from bertopic import BERTopic
from model2vec import StaticModel

# Model2Vec
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# BERTopic
topic_model = BERTopic(embedding_model=embedding_model)

As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant