-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update embeddings tools for 2025-01-30 LTS (#1354)
Variety of minor changes & readme updates for our embedding preparation tooling, motivated as we exercised them new LTS.
- Loading branch information
Showing
17 changed files
with
118 additions
and
77 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,22 @@ | ||
FROM ubuntu:22.04 | ||
# TILEDB_VECTOR_SEARCH_VERSION should be the newest that doesn't need a newer version of tiledb | ||
# than the client tiledbsoma: https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/0.2.2/pyproject.toml | ||
ARG TILEDB_VECTOR_SEARCH_VERSION=0.2.2 | ||
|
||
# TILEDB_PY_VERSION should be set such that the TileDB Embedded version will match that used by | ||
# tiledbsoma in cellxgene_census_builder and census_contrib. | ||
# https://github.com/single-cell-data/TileDB-SOMA/blob/1.15.3/libtiledbsoma/cmake/Modules/FindTileDB_EP.cmake#L93 (2.27.0) | ||
# == | ||
# https://github.com/TileDB-Inc/TileDB-Py/blob/0.33.3/CMakeLists.txt#L49 (2.27.0) | ||
ARG TILEDB_PY_VERSION=0.33.3 | ||
# TILEDB_VECTOR_SEARCH_VERSION should be the newest compatible with TILEDB_PY_VERSION. | ||
# https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/0.11.0/pyproject.toml#L23 (tiledb-py>=0.32.0) | ||
ARG TILEDB_VECTOR_SEARCH_VERSION=0.11.0 | ||
|
||
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \ | ||
python3-pip | ||
RUN pip3 install \ | ||
cellxgene_census \ | ||
tiledb==$TILEDB_PY_VERSION \ | ||
tiledb-vector-search==$TILEDB_VECTOR_SEARCH_VERSION | ||
|
||
# FIXME: monkey patch tiledb-vector-search 0.11 for https://github.com/TileDB-Inc/TileDB-Vector-Search/issues/564 | ||
# This should be removed when we update to a new version addressing that issue. | ||
ADD ingestion.py.patch /tmp | ||
RUN patch /usr/local/lib/python3.10/dist-packages/tiledb/vector_search/ingestion.py /tmp/ingestion.py.patch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,31 @@ | ||
# census_embeddings_indexer | ||
|
||
This is a Docker+WDL pipeline to build [TileDB-Vector-Search](https://github.com/TileDB-Inc/TileDB-Vector-Search) indexes for Census cell embeddings, supporting cell similarity search in embedding space. It's meant to run on the AWS HealthOmics workflow service using the [miniwdl-omics-run](https://github.com/miniwdl-ext/miniwdl-omics-run) launcher (assuming account setup documented there). | ||
This is a Docker+WDL pipeline to build [TileDB-Vector-Search](https://github.com/TileDB-Inc/TileDB-Vector-Search) indexes for Census cell embeddings, supporting cell similarity search in embedding space. It's meant to run on the AWS HealthOmics workflow service using the [miniwdl-omics-run](https://github.com/miniwdl-ext/miniwdl-omics-run) launcher (`pip3 install miniwdl-omics-run`; one-time account setup steps documented there are probably already done in the relevant CZI AWS account). | ||
|
||
The pipeline consumes one or more of the existing TileDB arrays for hosted and contributed [Census embeddings](https://cellxgene.cziscience.com/census-models) stored on S3. The resulting indexes are themselves TileDB groups to be stored on S3. | ||
The pipeline consumes one or more of the existing TileDB arrays for [Census embeddings](https://cellxgene.cziscience.com/census-models) stored on S3. The resulting indexes are themselves TileDB groups to be stored on S3. | ||
|
||
```bash | ||
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) | ||
export AWS_DEFAULT_REGION=$(aws configure get region) | ||
export ECR_ENDPT=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com | ||
export WDL_OUTPUT_BUCKET=mlin-census-screatch | ||
export WDL_OUTPUT_BUCKET=mlin-census-scratch | ||
|
||
docker build -t ${ECR_ENDPT}/omics:census_embeddings_indexer . | ||
docker build --platform linux/amd64 -t ${ECR_ENDPT}/omics:census_embeddings_indexer . | ||
aws ecr get-login-password | docker login --username AWS --password-stdin "$ECR_ENDPT" | ||
docker push ${ECR_ENDPT}/omics:census_embeddings_indexer | ||
|
||
miniwdl-omics-run census_embeddings_indexer.wdl \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-czi-1 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-czi-4 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-czi-5 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-contrib-1 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-contrib-2 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2023-12-15/CxG-contrib-3 \ | ||
census_version=2023-12-15 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-czi-6 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-czi-7 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-czi-8 \ | ||
embeddings_s3_uris=s3_//cellxgene-contrib-public/contrib/cell-census/soma/2024-07-01/CxG-contrib-7 \ | ||
census_version=2024-07-01 \ | ||
s3_region=$AWS_DEFAULT_REGION \ | ||
docker=${ECR_ENDPT}/omics:census_embeddings_indexer \ | ||
--output-uri s3://${WDL_OUTPUT_BUCKET}/census_embeddings_indexer/out/ \ | ||
--role poweromics | ||
--role poweromics --storage-capacity 4800 | ||
``` | ||
|
||
(The `embeddings_s3_uris=s3_//...` with `s3_//` instead of `s3://` is a workaround for an AWS-side existence check that doesn't seem to work right on public buckets.) | ||
|
||
The Dockerfile has an argument for the TileDB-Vector-Search version to use. We should use the newest version that doesn't need a newer version of TileDB than the intended client tiledbsoma/cellxgene_census. | ||
The [Dockerfile](Dockerfile) has arguments for the TileDB-Py and TileDB-Vector-Search versions to use; see comments there for guidance on setting them. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
56a57 | ||
> dimensions_override: int = -1, | ||
3144a3146,3147 | ||
> if dimensions_override >= 0: | ||
> dimensions = min(dimensions, dimensions_override) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.