-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update embeddings tools for 2025-01-30 LTS #1354
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
30d98c7
Update README.md for census-embeddings-indexer
mlin f42a27e
Merge branch 'main' into mlin/update-indexer-readme
ivirshup fc9e71b
update Geneformer upstream version and associated defaults
mlin acb492b
tools/models/scvi docs adjustments
mlin 7f54520
docker updates
mlin 96f16ed
bump Geneformer
mlin c05adc8
Update tools/models/scvi/scvi-init.sh
mlin b7abbce
Merge remote-tracking branch 'origin/main' into mlin/geneformer-dec2024
mlin 06252f3
Update scvi-create-latent-update.py
ebezzi de3fc54
update census_embeddings_indexer
mlin b21dc63
advance geneformer EMBEDDINGS_TILEDBSOMA_VERSION
mlin 754267d
pin torchdata
mlin eefeeaa
pin torchdata
mlin 1b3f744
fix CellDatasetBuilder typing
mlin bfc9a70
Merge remote-tracking branch 'origin/ebezzi/fix-scvi-idx' into mlin/g…
mlin 5002f56
geneformer generate_embeddings.wdl: upsize workers
mlin 99082c9
update geneformer_tokenizer docstring
mlin 9da86d9
scvi-init.sh: strike update-alternatives python2 (errors on fresh 20.04)
mlin 040637b
census_contrib: update tiledbsoma & cellxgene-census versions
mlin 0ab84d0
census_embeddings_indexer: monkey patch TileDB-Vector-Search to work …
mlin b0f810f
census_contrib: bump tiledbsoma
mlin 70e2271
Merge remote-tracking branch 'origin/main' into mlin/geneformer-dec2024
mlin 45f165e
Merge remote-tracking branch 'origin/main' into mlin/geneformer-dec2024
mlin cd9da26
census_embeddings_indexer: vacuum index arrays
mlin 55fa46e
Merge remote-tracking branch 'origin/mlin/update-indexer-readme' into…
mlin 81e15bd
update census_embeddings_indexer readme
mlin b26e245
update census_embeddings_indexer readme
mlin b355603
Merge remote-tracking branch 'origin/mlin/scvi-readme' into mlin/gene…
mlin dd46610
docs
mlin 53f0f96
one more deprecation
mlin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,22 @@ | ||
FROM ubuntu:22.04 | ||
# TILEDB_VECTOR_SEARCH_VERSION should be the newest that doesn't need a newer version of tiledb | ||
# than the client tiledbsoma: https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/0.2.2/pyproject.toml | ||
ARG TILEDB_VECTOR_SEARCH_VERSION=0.2.2 | ||
|
||
# TILEDB_PY_VERSION should be set such that the TileDB Embedded version will match that used by | ||
# tiledbsoma in cellxgene_census_builder and census_contrib. | ||
# https://github.com/single-cell-data/TileDB-SOMA/blob/1.15.3/libtiledbsoma/cmake/Modules/FindTileDB_EP.cmake#L93 (2.27.0) | ||
# == | ||
# https://github.com/TileDB-Inc/TileDB-Py/blob/0.33.3/CMakeLists.txt#L49 (2.27.0) | ||
ARG TILEDB_PY_VERSION=0.33.3 | ||
# TILEDB_VECTOR_SEARCH_VERSION should be the newest compatible with TILEDB_PY_VERSION. | ||
# https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/0.11.0/pyproject.toml#L23 (tiledb-py>=0.32.0) | ||
ARG TILEDB_VECTOR_SEARCH_VERSION=0.11.0 | ||
|
||
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \ | ||
python3-pip | ||
RUN pip3 install \ | ||
cellxgene_census \ | ||
tiledb==$TILEDB_PY_VERSION \ | ||
tiledb-vector-search==$TILEDB_VECTOR_SEARCH_VERSION | ||
|
||
# FIXME: monkey patch tiledb-vector-search 0.11 for https://github.com/TileDB-Inc/TileDB-Vector-Search/issues/564 | ||
# This should be removed when we update to a new version addressing that issue. | ||
ADD ingestion.py.patch /tmp | ||
RUN patch /usr/local/lib/python3.10/dist-packages/tiledb/vector_search/ingestion.py /tmp/ingestion.py.patch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,31 @@ | ||
# census_embeddings_indexer | ||
|
||
This is a Docker+WDL pipeline to build [TileDB-Vector-Search](https://github.com/TileDB-Inc/TileDB-Vector-Search) indexes for Census cell embeddings, supporting cell similarity search in embedding space. It's meant to run on the AWS HealthOmics workflow service using the [miniwdl-omics-run](https://github.com/miniwdl-ext/miniwdl-omics-run) launcher (assuming account setup documented there). | ||
This is a Docker+WDL pipeline to build [TileDB-Vector-Search](https://github.com/TileDB-Inc/TileDB-Vector-Search) indexes for Census cell embeddings, supporting cell similarity search in embedding space. It's meant to run on the AWS HealthOmics workflow service using the [miniwdl-omics-run](https://github.com/miniwdl-ext/miniwdl-omics-run) launcher (`pip3 install miniwdl-omics-run`; one-time account setup steps documented there are probably already done in the relevant CZI AWS account). | ||
|
||
The pipeline consumes one or more of the existing TileDB arrays for hosted and contributed [Census embeddings](https://cellxgene.cziscience.com/census-models) stored on S3. The resulting indexes are themselves TileDB groups to be stored on S3. | ||
The pipeline consumes one or more of the existing TileDB arrays for [Census embeddings](https://cellxgene.cziscience.com/census-models) stored on S3. The resulting indexes are themselves TileDB groups to be stored on S3. | ||
|
||
```bash | ||
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) | ||
export AWS_DEFAULT_REGION=$(aws configure get region) | ||
export ECR_ENDPT=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com | ||
export WDL_OUTPUT_BUCKET=mlin-census-screatch | ||
export WDL_OUTPUT_BUCKET=mlin-census-scratch | ||
|
||
docker build -t ${ECR_ENDPT}/omics:census_embeddings_indexer . | ||
docker build --platform linux/amd64 -t ${ECR_ENDPT}/omics:census_embeddings_indexer . | ||
aws ecr get-login-password | docker login --username AWS --password-stdin "$ECR_ENDPT" | ||
docker push ${ECR_ENDPT}/omics:census_embeddings_indexer | ||
|
||
miniwdl-omics-run census_embeddings_indexer.wdl \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-czi-1 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-czi-4 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-czi-5 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-contrib-1 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-contrib-2 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-contrib-3 \ | ||
census_version=2023-12-15 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-czi-6 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-czi-7 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-czi-8 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-contrib-7 \ | ||
census_version=2024-07-01 \ | ||
s3_region=$AWS_DEFAULT_REGION \ | ||
docker=${ECR_ENDPT}/omics:census_embeddings_indexer \ | ||
--output-uri s3://${WDL_OUTPUT_BUCKET}/census_embeddings_indexer/out/ \ | ||
--role poweromics | ||
--role poweromics --storage-capacity 4800 | ||
``` | ||
|
||
(The `embeddings_s3_uris=s3_//...` with `s3_//` instead of `s3://` is a workaround for an AWS-side existence check that doesn't seem to work right on public buckets.) | ||
|
||
The Dockerfile has an argument for the TileDB-Vector-Search version to use. We should use the newest version that doesn't need a newer version of TileDB than the intended client tiledbsoma/cellxgene_census. | ||
The [Dockerfile](Dockerfile) has arguments for the TileDB-Py and TileDB-Vector-Search versions to use; see comments there for guidance on setting them. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
56a57 | ||
> dimensions_override: int = -1, | ||
3144a3146,3147 | ||
> if dimensions_override >= 0: | ||
> dimensions = min(dimensions, dimensions_override) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very minor, but numpy has comparison + assertion functions that give helpful errors messages under
np.testing
. E.g.: