Alternative data sources #13

joicy · 2021-11-18T10:55:13Z

Hi there! I'm trying to set up Pyro-cov using GISAID but I don't have a data feed yet. It would be possible to use the pipeline with other data sources, like fasta files I have been downloaded? Would be a pleasure to contribute with new features.

dpark01 · 2021-11-18T11:34:56Z

Really ideal would be to use this data source which is pre-ETLed, cleaned up, easy to digest, and updated daily. Already comes with masked alignments and such. Folks use this routinely for tree building on global data sets.

Also a bonus: it’s in the exact same format as what GISAID data would look like if you used nextstrain’s ETL scripts on it.

fritzo · 2021-11-18T20:32:57Z

Hi @joicy, the easiest way to start work on this feature would be to update scripts/preprocess_gisaid.py and scripts/preprocess_nextclade.py to read from files other than results/gisaid.json. The least amount of code change would involve creating files in the gisaid.json format: a newline-delimited list of json dictionaries that look like:

{"covv_virus_name": ..., "covv_accession_id": ..., "covv_collection_date": ..., "covv_location": ..., "covv_add_location": ..., "covv_lineage": ..., "sequence": ...}

If instead you want to more natively support another format, you could replace scripts/preprocess_gisaid.py with some other say scripts/preprocess_ncov.py and make slight modifications to scripts/preprocess_nextclade.py.

The two intermediate files that are constructed from gisaid data are

results/columns.pkl which is a column-oriented dict-of-lists, with columns like day and lineage (but beware this code is moving around quickly and the columns have been changing a lot, see code for source of truth).
results/aligndb is a place to cache results of alignment and running usher/panglin to classify lineages. This is populated by preprocess_nextclade.py which only needs the sequence and covv_accession_id fields.

Re: contributing, I'm happy to review PRs and help with writing unit tests and answer questions about code.

joicy · 2021-11-19T07:17:53Z

Hi @fritzo, thank you for explaining to me. We just receive the application form to data feed on GISAID, so maybe won't be necessary. However, I would like to try these changes in parallel. Could you provide me with a results/gisaid.json example file?

joicy · 2021-11-19T07:21:52Z

Really ideal would be to use this data source which is pre-ETLed, cleaned up, easy to digest, and updated daily. Already comes with masked alignments and such. Folks use this routinely for tree building on global data sets.

Also a bonus: it’s in the exact same format as what GISAID data would look like if you used nextstrain’s ETL scripts on it.

@dpark01 It is a really good resource, but our data is stored just by GISAID.

fritzo · 2021-11-22T14:48:26Z

Hi @joicy, realistically I won't have time to help out with this task until say January at earliest. We are focusing on just being able to run the model ourselves, after months of data drift changed model requirements (e.g. PANGO lineages are increasingly conflated, dates and locations are increasingly erroneous).

fritzo · 2021-12-05T17:32:04Z

@dpark01 can you provide any other links to the Nextstrain GenBank data feed you recommended? The docs are broken, so maybe I can follow one of your existing pipelines.

EDIT Aha, I see there are docs about remote inputs, which point to buckets on AWS and GCP:

$ gsutil ls gs://nextstrain-data/files/ncov/open/

gs://nextstrain-data/files/ncov/open/aligned.fasta.xz
gs://nextstrain-data/files/ncov/open/biosample.ndjson.gz
gs://nextstrain-data/files/ncov/open/biosample.ndjson.xz
gs://nextstrain-data/files/ncov/open/biosample.tsv.gz
gs://nextstrain-data/files/ncov/open/duplicate_biosample.txt.gz
gs://nextstrain-data/files/ncov/open/filtered.fasta.xz
gs://nextstrain-data/files/ncov/open/genbank.ndjson.xz
gs://nextstrain-data/files/ncov/open/masked.fasta.xz
gs://nextstrain-data/files/ncov/open/metadata.tsv.gz
gs://nextstrain-data/files/ncov/open/mutation-summary.tsv.xz
gs://nextstrain-data/files/ncov/open/nextclade.aligned.fasta.xz
gs://nextstrain-data/files/ncov/open/nextclade.tsv.gz
gs://nextstrain-data/files/ncov/open/sequences.fasta.xz
gs://nextstrain-data/files/ncov/open/africa/
gs://nextstrain-data/files/ncov/open/asia/
gs://nextstrain-data/files/ncov/open/europe/
gs://nextstrain-data/files/ncov/open/global/
gs://nextstrain-data/files/ncov/open/nextclade-full-run-2021-11-19--02-34-23--UTC/
gs://nextstrain-data/files/ncov/open/nextclade-full-run-2021-11-23--04-19-18--UTC/
gs://nextstrain-data/files/ncov/open/nextclade-full-run-2021-11-28--12-44-58--UTC/
gs://nextstrain-data/files/ncov/open/nextclade-full-run-2021-11-28--19-54-41--UTC/
gs://nextstrain-data/files/ncov/open/nextclade-full-run-2021-11-29--01-46-57--UTC/
gs://nextstrain-data/files/ncov/open/nextclade-full-run-2021-12-02--12-31-04--UTC/
gs://nextstrain-data/files/ncov/open/north-america/
gs://nextstrain-data/files/ncov/open/oceania/
gs://nextstrain-data/files/ncov/open/south-america/

dpark01 · 2021-12-06T14:19:53Z

@fritzo sorry for the delayed response, but yes, you've found the public GCS bucket we use to mirror their S3 bucket. Files that you may find useful:

aligned.fasta.xz, masked.fasta.xz: multiple sequence alignment (in fasta format), either masked for problematic sites or not.
metadata.tsv.gz - really nicely washed and cleaned metadata file, ready for nextstrain ingest, same format as emitted by their gisaid ETL scripts as well

The regional build files (named by continents) are probably not of interest to you. The json-formatted metadata might be interesting, I haven't tried--I don't know if that's a more raw version of what NCBI Virus/BioSample provides than the metadata.tsv.

Our mirror syncs with their S3 bucket every morning (usually after their daily builds complete), and they tend to update their files every weekday from a continental Europe time zone.

A lot of our analyses (that desire really up-to-date global inputs) directly ingest from this bucket.

fritzo · 2022-02-04T20:33:00Z

Hi @joicy, we've revised our preprocessing pipeline to input an UShER tree using open GENBANK data, rather than a GISAID tree, so hopefully you should be able to run the model more easily. (We still support GISAID input but you would need to build your own UShER tree for that usage).

joicy · 2022-02-10T09:48:26Z

Hi @fritzo thank you for letting me know and for providing this feature. I'm gonna try to use it soon.

fritzo added enhancement New feature or request help wanted Extra attention is needed labels Nov 18, 2021

cwhittaker1000 mentioned this issue Jul 28, 2022

alternative data sources/sharing replication dataset? #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative data sources #13

Alternative data sources #13

joicy commented Nov 18, 2021

dpark01 commented Nov 18, 2021

fritzo commented Nov 18, 2021

joicy commented Nov 19, 2021

joicy commented Nov 19, 2021

fritzo commented Nov 22, 2021

fritzo commented Dec 5, 2021 •

edited

Loading

dpark01 commented Dec 6, 2021

fritzo commented Feb 4, 2022

joicy commented Feb 10, 2022

Alternative data sources #13

Alternative data sources #13

Comments

joicy commented Nov 18, 2021

dpark01 commented Nov 18, 2021

fritzo commented Nov 18, 2021

joicy commented Nov 19, 2021

joicy commented Nov 19, 2021

fritzo commented Nov 22, 2021

fritzo commented Dec 5, 2021 • edited Loading

dpark01 commented Dec 6, 2021

fritzo commented Feb 4, 2022

joicy commented Feb 10, 2022

fritzo commented Dec 5, 2021 •

edited

Loading