Sourmash in MGnify #1577

gustavo-salazar · 2021-06-09T10:45:38Z

Hello there,
We want to use sourmash to power the search in the MGnify genome catalog.

Right now we are only on a prototyping stage, but I would like to share here our high level plan, and ask you to let us know if you see any red flags in our approach.

Calculate the signatures for our catalog. This is already done, and we've used the sourmash python instance to sketch and index them.
When the user wants to query our catalog we will create signature(s) for their query using the WASM+rust client, and send this to our API.
The API will queue this request and in turn it will run gather of the query against our catalog using the python sourmash and will return the results.
Parse and display results in our web client.

Please let me know if you see any issue with this high-level picture.

Thanks, for your help,

Gustavo.

The text was updated successfully, but these errors were encountered:

luizirber · 2021-06-10T23:59:09Z

Hi @gustavo-salazar!

I don't see any issues with this picture, that's pretty much what I did in greyhound.sourmash.bio (code), but using a Rust backend for step 3. As is, greyhound will not scale easily to MGnify-levels, but some things to consider for a more scalable solution:

Part 2: creating the query signature

the way I implemented it in greyhound ends up doing all the work in the main thread, which... pretty much kills the browser for large metagenomes 😅. The proper solution is running the sketching in Web Workers, so the browser continues working properly.
The sourmash NPM package is not very ergonomic, so we can work on improve that based on your needs...
(my dream was to make a web component and make it easy to use sourmash with any frontend framework, but writing something framework-agnostic is complicated =])

Part 3: `gather` in the backend

Something we realised during greyhound/prefetch was that we can split the process in two steps: first build a counter (preferably with access to a fast index, like LCA/RevIndex), and then use the counter to generate the gather results/CSV. This way you can build a service using one worker holding the index for the first part, and other workers with access to the signatures (possibly in a shared drive/location/bucket) for the second part.
The first step is more expensive memory-wise (if using revindex/LCA), but can calculate counters quickly. The second step needs access to the signatures, but doesn't require much memory. I also suspect the first step is way faster, so having more workers for step 2 (all feeding from one worker for step 1) is probably feasible.
In practical terms, the first step needs a sourmash instance holding an opened Index in memory, accepting query signatures. Holding the index open is especially advantageous with revindex/LCA, since they take some non-trivial overhead for loading and benefit from long running processes.
The second step can be started on demand, and loaded the signatures requested in the counter as needed.

Please keep asking questions, and I can also help with PRs if the code is public.

gustavo-salazar · 2021-06-11T08:51:53Z

Hey @luizirber
Thanks so much for your answer.

Part 2

Indeed a web worker makes a lot of sense here. I'll consider it once I get to that part. I have some experience with web components, so I'll definitely will give a try to create this as one. And TBH, I don't think I'll be using any advanced feature of the sourmash NPM package, and probably will be following similar logic to what you have in your blogpost. But of course I will let you know if I have any feedback.

Part 3

This is where I just started prototyping. To be honest, my plan was to either copy most of the code you have in the gather command, mostly to format the output of the result for our purposes, or to mock the args object, call that function directly, take the csv, and return it as is.
I guess that for now I'm interested in having a working version on our side, and later we can focus on optimizations like the one you suggest to keep the index preloaded.

I don't have a repo yet, as for now I'm just playing around to define our architecture for it, but once I have something I will share the links here. Thanks again for your help.

gustavo-salazar · 2021-06-23T08:31:41Z

Hey y'all,

After some tweaking and tuning with webpack 5, I manage to generate my first set of signatures in the client. I'm basically copying what @luizirber did for its blog here, and so far I'm only processing uncompressed FastA files.

Here is the repo if anyone is interested: https://github.com/EBI-Metagenomics/mgnify-sourmash-component

luizirber · 2021-06-23T17:57:08Z

After some tweaking and tuning with webpack 5, I manage to generate my first set of signatures in the client. I'm basically copying what @luizirber did for its blog here, and so far I'm only processing uncompressed FastA files.

This is wonderful 🤩

Something I did in greyhound and I think is doable in #1625 is moving the FASTA/Q parsing (including gz compression) into Rust (and use needletail for the parsing). You already did the hard work (loading everything into a separate worker), and this would also avoid the dependency on fasta-parser, fastqstream, filestream, peek-stream, pumpify, stream and through2, which are a bit wonky to make work correctly (every time I tried to touch/change it, it broke 😂)

The function body in Rust would be something like this, but taking the File or FileReadStream in https://github.com/EBI-Metagenomics/mgnify-sourmash-component/blob/ab7492263a5c319804358d2a71a018517d698e8d/src/sketcher.worker.ts#L68 instead. The function should also provide some callback for tracking the progress.

(the added benefit is that the parsing will be much faster...)

gustavo-salazar · 2021-06-24T09:08:35Z

Oh that's excellent news!🎉

The problem with those dependencies is that they all have their assumptions on how the stream API is in the Web, but that standard is still under development, and some of the updates are not backwards compatible. This change will save me quite a few headaches. 😅
The other "issue" is that webpack 5 decided to not try to emulate node functionality, and therefore the setup of things like zlib or even buffer needed manual setup, nothing too complicated, but the config was growing and the dependencies increasing.

I'll keep the fasta parser until you can do the release of #1625 but I won't be trying anymore to make the .gz or .fastq work, which I was already suffering with! Do you have a timeline for it? I don't want to put extra pressure, just want to plan accordingly on my side.

gustavo-salazar · 2021-07-15T11:01:34Z

Hey @luizirber @ctb

I just completed a first prototype of the system, which you can play with it HERE
Bear in mind this is just a prototype and is under current development, but at least is a view of all the pieces running together. Please let me know any issues or comments you have on it.

If you are interested in the code, it is split in 4 components:

The web component that creates the signature: Github Repo
A queuing system using celery to execute gather in the server: Github Repo
New endpoints on our API to trigger gather jobs or retrieve results: Branch changes
The inclusion of the web component in the Mgnify web client and orchestration of the API calls: Branch Changes

luizirber · 2021-07-16T15:43:44Z

This is awesome!

gustavo-salazar · 2021-09-10T13:08:55Z

Hey @luizirber @ctb

We have iterated a bit on the feature using sourmash to search our catalogues, and we are close to release it live.
You can see it in our DEV environment: https://wwwdev.ebi.ac.uk/metagenomics/genome-catalogues/human-gut-v1-0

Among other changes, it now includes the sourmash logo, I just wanted to make sure you are OK with that.

BTW @luizirber have you made any progress on including the fastq and .gz into the Rust code? No pressure but it would be cool to support those formats in the component.

luizirber · 2024-02-27T02:39:07Z

@gustavo-salazar #3047 implement sequence parsing in wasm, EBI-Metagenomics/mgnify-sourmash-component#4 adds it to the MGnify component

Address #1577 (comment) This PR implements `Read` for `File` in browsers, which allows using `niffler` + `needletail` to parse FASTA/Q, `.gz`compressed or not, in browsers. I also added error handling, so the browser can print nicer error messages instead of something cryptic to `console.log`.

ctb mentioned this issue Jun 20, 2021

Spread 'frozen' behavior through code base & document in dev docs #1616

Open

ctb mentioned this issue Apr 22, 2022

sketching in JavaScript on the client side #1973

Open

luizirber mentioned this issue Feb 27, 2024

Implement file parsing for webassembly #3047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sourmash in MGnify #1577

Sourmash in MGnify #1577

gustavo-salazar commented Jun 9, 2021

luizirber commented Jun 10, 2021

gustavo-salazar commented Jun 11, 2021

gustavo-salazar commented Jun 23, 2021

luizirber commented Jun 23, 2021

gustavo-salazar commented Jun 24, 2021

gustavo-salazar commented Jul 15, 2021

luizirber commented Jul 16, 2021

gustavo-salazar commented Sep 10, 2021

luizirber commented Feb 27, 2024

Sourmash in MGnify #1577

Sourmash in MGnify #1577

Comments

gustavo-salazar commented Jun 9, 2021

luizirber commented Jun 10, 2021

Part 2: creating the query signature

Part 3: gather in the backend

gustavo-salazar commented Jun 11, 2021

Part 2

Part 3

gustavo-salazar commented Jun 23, 2021

luizirber commented Jun 23, 2021

gustavo-salazar commented Jun 24, 2021

gustavo-salazar commented Jul 15, 2021

luizirber commented Jul 16, 2021

gustavo-salazar commented Sep 10, 2021

luizirber commented Feb 27, 2024

Part 3: `gather` in the backend