Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBI Index Compression #262

Merged
merged 211 commits into from
May 8, 2024
Merged

NCBI Index Compression #262

merged 211 commits into from
May 8, 2024

Conversation

morsecodist
Copy link
Collaborator

@morsecodist morsecodist commented Jun 5, 2023

NCBI Index NT and NR compression code

Changes:

  • rust code (ncbi-compress) package to:

    1. split NT and NR into taxids based on provided NCBI mapping files
    2. sort individual taxids with longest sequences on top
    3. compress individual taxids based on containment above a specified similarity threshold
    4. combine individual taxids back into a compressed version of NT and NR
    5. Shuffle resulting compressed NT and NR (to avoid large blocks of SC2 accessions on one diamond or minimap chunked index)
  • Download NT/NR

    1. NCBI is moving blast DBs off of the ftp site, we needed to adjust to download NT and NR with NCBI supported cli tools.
  • debugging jupyter notebooks:

    1. notebook to query NCBI (helpful for spot checking the 'all taxa with neither family nor genus' classification' in CZID to verify that all accessions in that bucket indeed to not have genus or species level classifications.
    2. notebook to easily query marisa trie files (helpful to find if a particular accessionID is included in NT or NR)
    3. notebook to compare diamond and minimap alignment times between two projects - helpful for determining the run time difference between samples run on two different index versions
    4. Notebook to compare NCBI compress runs with multiple different parameters / input data
    5. Notebook to create taxon-lineage changelog for comp bio QA purposes

Copy link
Collaborator Author

@morsecodist morsecodist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall! Just left two tiny comments. It was great working with you on all of this.

@phoenixAja phoenixAja marked this pull request as ready for review April 23, 2024 21:22
@phoenixAja phoenixAja self-requested a review April 30, 2024 20:31
@phoenixAja phoenixAja merged commit 6d4a2be into main May 8, 2024
17 checks passed
@phoenixAja phoenixAja deleted the tmorse-nt-compression branch May 8, 2024 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants