diff --git a/README.md b/README.md index d3b7463..7a5b974 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,22 @@ It starts by downloading the [NCBI nr database in FASTA format](https://ftp.ncbi After clustering this file at 90% length and 90% identity, it then determines the lowest common ancestor for each cluster using the [prot.accession2taxid.FULL files](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) (12Gb in March 2023) and the [taxdump files](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). The final output includes the representative sequences in FASTA format, a TSV file that reports cluster representatives and members, and an SQLite DB with representative sequence names and their taxonomic lineages (as taxid and as names). +## Outputs & Downloads + +**The database and associated taxonomy files are available for download on [OSF](https://osf.io/tejwd/).** + +Description of output files: +* `nr_rep_seq.fasta.gz` (59GB): FASTA file of representative sequences output by mmseqs2 `easy-linclust`. +* `nr_cluster.tsv` (13.2GB): TSV file documenting cluster membership. The first column records the representative sequence identifier, while the second column records the sequence identifiers for member sequences of the cluster. +* `nr_cluster_taxid_formatted_final.tsv.gz` (1.4GB): TSV file recording the representative sequence for a cluster, the lowest common ancestor taxomony ID, the named lineage of the lowest common ancestor, and taxonomy ID lineage of the lowest common ancestor. A snippet of the file is presented below. +``` +rep taxid lca_taxid lca_lineage_named lca_lineage_taxid +0310191A 2517390 2517390 Eukaryota;Metazoa;Chordata;Amphibia;Anura;Hyperoliidae;Kassina;Kassina cochranae;unclassified Kassina cochranae subspecies/strain 2759;33208;7711;8292;8342;8412;8413;2517390; +0311203A 9031 9031 Eukaryota;Metazoa;Chordata;Aves;Galliformes;Phasianidae;Gallus;Gallus gallus;unclassified Gallus gallus subspecies/strain 2759;33208;7711;8782;8976;9005;9030;9031; +0311203B 9940 9940 Eukaryota;Metazoa;Chordata;Mammalia;Artiodactyla;Bovidae;Ovis;Ovis aries;unclassified Ovis aries subspecies/strain 2759;33208;7711;40674;91561;9895;9935;9940; +``` +* `nr_cluster_taxid_formatted_final.sqlite` (66GB): An SQLite database of the `nr_cluster_taxid_formatted_final.tsv.gz` TSV file. The name of the database was recorded as `nr_cluster_taxid_table` (see [this script](./scripts/make_sqlite_db.R)). For an example of how to use the database to assign lineages to BLAST results, see [this script](https://github.com/Arcadia-Science/2023-rehgt/blob/main/bin/blastp_add_taxonomy_info.R). + ## Getting started with this repository This repository uses snakemake to run the pipeline and conda to manage software environments and installations. @@ -37,3 +53,8 @@ snakemake -j 1 --use-conda --rerun-incomplete -k -n ``` where `-j` specifies the number of threads to run with, `--use-conda` uses conda to manage software environments, `--rerun-incomplete` re-runs incomplete files, `-k` tells the pipeline to continue with independent steps when one step fails, and `-n` signifies to run a dry run first. + +## Citation & contributing + +You can read more about this project in [this pub](https://doi.org/10.57844/arcadia-w8xt-pc81). +See [this guide](https://github.com/Arcadia-Science/arcadia-software-handbook/blob/main/guides-and-standards/guide-credit-for-contributions.md) to see how we recognize feedback and contributions on our code.