Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sourmash in MGnify #1577

Open
gustavo-salazar opened this issue Jun 9, 2021 · 9 comments
Open

Sourmash in MGnify #1577

gustavo-salazar opened this issue Jun 9, 2021 · 9 comments

Comments

@gustavo-salazar
Copy link

Hello there,
We want to use sourmash to power the search in the MGnify genome catalog.

Right now we are only on a prototyping stage, but I would like to share here our high level plan, and ask you to let us know if you see any red flags in our approach.

  1. Calculate the signatures for our catalog. This is already done, and we've used the sourmash python instance to sketch and index them.
  2. When the user wants to query our catalog we will create signature(s) for their query using the WASM+rust client, and send this to our API.
  3. The API will queue this request and in turn it will run gather of the query against our catalog using the python sourmash and will return the results.
  4. Parse and display results in our web client.

Please let me know if you see any issue with this high-level picture.

Thanks, for your help,

Gustavo.

@luizirber
Copy link
Member

Hi @gustavo-salazar!

I don't see any issues with this picture, that's pretty much what I did in greyhound.sourmash.bio (code), but using a Rust backend for step 3. As is, greyhound will not scale easily to MGnify-levels, but some things to consider for a more scalable solution:

Part 2: creating the query signature

  • the way I implemented it in greyhound ends up doing all the work in the main thread, which... pretty much kills the browser for large metagenomes 😅. The proper solution is running the sketching in Web Workers, so the browser continues working properly.
  • The sourmash NPM package is not very ergonomic, so we can work on improve that based on your needs...
  • (my dream was to make a web component and make it easy to use sourmash with any frontend framework, but writing something framework-agnostic is complicated =])

Part 3: gather in the backend

  • Something we realised during greyhound/prefetch was that we can split the process in two steps: first build a counter (preferably with access to a fast index, like LCA/RevIndex), and then use the counter to generate the gather results/CSV. This way you can build a service using one worker holding the index for the first part, and other workers with access to the signatures (possibly in a shared drive/location/bucket) for the second part.
  • The first step is more expensive memory-wise (if using revindex/LCA), but can calculate counters quickly. The second step needs access to the signatures, but doesn't require much memory. I also suspect the first step is way faster, so having more workers for step 2 (all feeding from one worker for step 1) is probably feasible.
  • In practical terms, the first step needs a sourmash instance holding an opened Index in memory, accepting query signatures. Holding the index open is especially advantageous with revindex/LCA, since they take some non-trivial overhead for loading and benefit from long running processes.
  • The second step can be started on demand, and loaded the signatures requested in the counter as needed.

Please keep asking questions, and I can also help with PRs if the code is public.

@gustavo-salazar
Copy link
Author

Hey @luizirber
Thanks so much for your answer.

Part 2

Indeed a web worker makes a lot of sense here. I'll consider it once I get to that part. I have some experience with web components, so I'll definitely will give a try to create this as one. And TBH, I don't think I'll be using any advanced feature of the sourmash NPM package, and probably will be following similar logic to what you have in your blogpost. But of course I will let you know if I have any feedback.

Part 3

This is where I just started prototyping. To be honest, my plan was to either copy most of the code you have in the gather command, mostly to format the output of the result for our purposes, or to mock the args object, call that function directly, take the csv, and return it as is.
I guess that for now I'm interested in having a working version on our side, and later we can focus on optimizations like the one you suggest to keep the index preloaded.


I don't have a repo yet, as for now I'm just playing around to define our architecture for it, but once I have something I will share the links here. Thanks again for your help.

@gustavo-salazar
Copy link
Author

Hey y'all,

After some tweaking and tuning with webpack 5, I manage to generate my first set of signatures in the client. I'm basically copying what @luizirber did for its blog here, and so far I'm only processing uncompressed FastA files.

Here is the repo if anyone is interested: https://github.com/EBI-Metagenomics/mgnify-sourmash-component

@luizirber
Copy link
Member

After some tweaking and tuning with webpack 5, I manage to generate my first set of signatures in the client. I'm basically copying what @luizirber did for its blog here, and so far I'm only processing uncompressed FastA files.

This is wonderful 🤩

Something I did in greyhound and I think is doable in #1625 is moving the FASTA/Q parsing (including gz compression) into Rust (and use needletail for the parsing). You already did the hard work (loading everything into a separate worker), and this would also avoid the dependency on fasta-parser, fastqstream, filestream, peek-stream, pumpify, stream and through2, which are a bit wonky to make work correctly (every time I tried to touch/change it, it broke 😂)

The function body in Rust would be something like this, but taking the File or FileReadStream in https://github.com/EBI-Metagenomics/mgnify-sourmash-component/blob/ab7492263a5c319804358d2a71a018517d698e8d/src/sketcher.worker.ts#L68 instead. The function should also provide some callback for tracking the progress.

(the added benefit is that the parsing will be much faster...)

@gustavo-salazar
Copy link
Author

Oh that's excellent news!🎉

The problem with those dependencies is that they all have their assumptions on how the stream API is in the Web, but that standard is still under development, and some of the updates are not backwards compatible. This change will save me quite a few headaches. 😅
The other "issue" is that webpack 5 decided to not try to emulate node functionality, and therefore the setup of things like zlib or even buffer needed manual setup, nothing too complicated, but the config was growing and the dependencies increasing.

I'll keep the fasta parser until you can do the release of #1625 but I won't be trying anymore to make the .gz or .fastq work, which I was already suffering with! Do you have a timeline for it? I don't want to put extra pressure, just want to plan accordingly on my side.

@gustavo-salazar
Copy link
Author

Hey @luizirber @ctb

I just completed a first prototype of the system, which you can play with it HERE
Bear in mind this is just a prototype and is under current development, but at least is a view of all the pieces running together. Please let me know any issues or comments you have on it.

If you are interested in the code, it is split in 4 components:

  1. The web component that creates the signature: Github Repo
  2. A queuing system using celery to execute gather in the server: Github Repo
  3. New endpoints on our API to trigger gather jobs or retrieve results: Branch changes
  4. The inclusion of the web component in the Mgnify web client and orchestration of the API calls: Branch Changes

@luizirber
Copy link
Member

This is awesome!

@gustavo-salazar
Copy link
Author

Hey @luizirber @ctb

We have iterated a bit on the feature using sourmash to search our catalogues, and we are close to release it live.
You can see it in our DEV environment: https://wwwdev.ebi.ac.uk/metagenomics/genome-catalogues/human-gut-v1-0

Among other changes, it now includes the sourmash logo, I just wanted to make sure you are OK with that.

BTW @luizirber have you made any progress on including the fastq and .gz into the Rust code? No pressure but it would be cool to support those formats in the component.

@luizirber
Copy link
Member

@gustavo-salazar #3047 implement sequence parsing in wasm, EBI-Metagenomics/mgnify-sourmash-component#4 adds it to the MGnify component

luizirber added a commit that referenced this issue Mar 23, 2024
Address
#1577 (comment)

This PR implements `Read` for `File` in browsers, which allows using
`niffler` + `needletail` to parse FASTA/Q, `.gz`compressed or not, in
browsers.

I also added error handling, so the browser can print nicer error
messages instead of something cryptic to `console.log`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants