From 441a8dab85aa44d2dd97adf65a48cf2e887e809a Mon Sep 17 00:00:00 2001 From: Taylor Reiter Date: Wed, 31 May 2023 12:58:34 -0400 Subject: [PATCH 1/4] start add db and pub details to readme --- README.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/README.md b/README.md index d3b7463..56d733e 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,22 @@ It starts by downloading the [NCBI nr database in FASTA format](https://ftp.ncbi After clustering this file at 90% length and 90% identity, it then determines the lowest common ancestor for each cluster using the [prot.accession2taxid.FULL files](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) (12Gb in March 2023) and the [taxdump files](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). The final output includes the representative sequences in FASTA format, a TSV file that reports cluster representatives and members, and an SQLite DB with representative sequence names and their taxonomic lineages (as taxid and as names). +## Outputs & Downloads + +**The database and associated taxonomy files are available for download on [OSF](https://osf.io/tejwd/).** + +Description of output files: +* `nr_rep_seq.fasta.gz`: FASTA file of representative sequences output by mmseqs2 `easy-linclust`. +* `nr_cluster.tsv`: TSV file documenting cluster membership. The first column records the representative sequence identifier, while the second column records the sequence identifiers for member sequences of the cluster. +* `nr_cluster_taxid_formatted_final.tsv.gz`: TSV file recording the representative sequence for a cluster, the lowest common ancestor taxomony ID, the named lineage of the lowest common ancestor, and taxonomy ID lineage of the lowest common ancestor. A snippet of the file is presented below. +``` +rep taxid lca_taxid lca_lineage_named lca_lineage_taxid +0310191A 2517390 2517390 Eukaryota;Metazoa;Chordata;Amphibia;Anura;Hyperoliidae;Kassina;Kassina cochranae;unclassified Kassina cochranae subspecies/strain 2759;33208;7711;8292;8342;8412;8413;2517390; +0311203A 9031 9031 Eukaryota;Metazoa;Chordata;Aves;Galliformes;Phasianidae;Gallus;Gallus gallus;unclassified Gallus gallus subspecies/strain 2759;33208;7711;8782;8976;9005;9030;9031; +0311203B 9940 9940 Eukaryota;Metazoa;Chordata;Mammalia;Artiodactyla;Bovidae;Ovis;Ovis aries;unclassified Ovis aries subspecies/strain 2759;33208;7711;40674;91561;9895;9935;9940; +``` +* `nr_cluster_taxid_formatted_final.sqlite`: An SQLite database of the `nr_cluster_taxid_formatted_final.tsv.gz` TSV file. The name of the database was recorded as `nr_cluster_taxid_table` (see [this script](./scripts/make_sqlite_db.R)). For an example of how to use the database to assign lineages to BLAST results, see [this script](https://github.com/Arcadia-Science/2023-rehgt/blob/main/bin/blastp_add_taxonomy_info.R). + ## Getting started with this repository This repository uses snakemake to run the pipeline and conda to manage software environments and installations. @@ -37,3 +53,8 @@ snakemake -j 1 --use-conda --rerun-incomplete -k -n ``` where `-j` specifies the number of threads to run with, `--use-conda` uses conda to manage software environments, `--rerun-incomplete` re-runs incomplete files, `-k` tells the pipeline to continue with independent steps when one step fails, and `-n` signifies to run a dry run first. + +## Citation + +This repository is associated with [this pub](). +You can read more about the project therein. From f493a2f1ff4b05ac280e02149f19d6d29c790bb9 Mon Sep 17 00:00:00 2001 From: Taylor Reiter Date: Fri, 2 Jun 2023 08:37:23 -0400 Subject: [PATCH 2/4] update readme --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 56d733e..017c898 100644 --- a/README.md +++ b/README.md @@ -56,5 +56,4 @@ where `-j` specifies the number of threads to run with, `--use-conda` uses conda ## Citation -This repository is associated with [this pub](). -You can read more about the project therein. +You can read more about this project in [this pub](https://doi.org/10.57844/arcadia-w8xt-pc81). From 271775b0c5a9b778f06b800598902bd1ce3c19cd Mon Sep 17 00:00:00 2001 From: Taylor Reiter Date: Fri, 2 Jun 2023 11:49:20 -0400 Subject: [PATCH 3/4] add file sizes --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 017c898..35be0a1 100644 --- a/README.md +++ b/README.md @@ -14,16 +14,16 @@ The final output includes the representative sequences in FASTA format, a TSV fi **The database and associated taxonomy files are available for download on [OSF](https://osf.io/tejwd/).** Description of output files: -* `nr_rep_seq.fasta.gz`: FASTA file of representative sequences output by mmseqs2 `easy-linclust`. -* `nr_cluster.tsv`: TSV file documenting cluster membership. The first column records the representative sequence identifier, while the second column records the sequence identifiers for member sequences of the cluster. -* `nr_cluster_taxid_formatted_final.tsv.gz`: TSV file recording the representative sequence for a cluster, the lowest common ancestor taxomony ID, the named lineage of the lowest common ancestor, and taxonomy ID lineage of the lowest common ancestor. A snippet of the file is presented below. +* `nr_rep_seq.fasta.gz` (59GB): FASTA file of representative sequences output by mmseqs2 `easy-linclust`. +* `nr_cluster.tsv` (13.2GB): TSV file documenting cluster membership. The first column records the representative sequence identifier, while the second column records the sequence identifiers for member sequences of the cluster. +* `nr_cluster_taxid_formatted_final.tsv.gz` (1.4GB): TSV file recording the representative sequence for a cluster, the lowest common ancestor taxomony ID, the named lineage of the lowest common ancestor, and taxonomy ID lineage of the lowest common ancestor. A snippet of the file is presented below. ``` rep taxid lca_taxid lca_lineage_named lca_lineage_taxid 0310191A 2517390 2517390 Eukaryota;Metazoa;Chordata;Amphibia;Anura;Hyperoliidae;Kassina;Kassina cochranae;unclassified Kassina cochranae subspecies/strain 2759;33208;7711;8292;8342;8412;8413;2517390; 0311203A 9031 9031 Eukaryota;Metazoa;Chordata;Aves;Galliformes;Phasianidae;Gallus;Gallus gallus;unclassified Gallus gallus subspecies/strain 2759;33208;7711;8782;8976;9005;9030;9031; 0311203B 9940 9940 Eukaryota;Metazoa;Chordata;Mammalia;Artiodactyla;Bovidae;Ovis;Ovis aries;unclassified Ovis aries subspecies/strain 2759;33208;7711;40674;91561;9895;9935;9940; ``` -* `nr_cluster_taxid_formatted_final.sqlite`: An SQLite database of the `nr_cluster_taxid_formatted_final.tsv.gz` TSV file. The name of the database was recorded as `nr_cluster_taxid_table` (see [this script](./scripts/make_sqlite_db.R)). For an example of how to use the database to assign lineages to BLAST results, see [this script](https://github.com/Arcadia-Science/2023-rehgt/blob/main/bin/blastp_add_taxonomy_info.R). +* `nr_cluster_taxid_formatted_final.sqlite` (66GB): An SQLite database of the `nr_cluster_taxid_formatted_final.tsv.gz` TSV file. The name of the database was recorded as `nr_cluster_taxid_table` (see [this script](./scripts/make_sqlite_db.R)). For an example of how to use the database to assign lineages to BLAST results, see [this script](https://github.com/Arcadia-Science/2023-rehgt/blob/main/bin/blastp_add_taxonomy_info.R). ## Getting started with this repository From 1a49dc8ddde9add4afc0f8ab7a91e7a3364ac044 Mon Sep 17 00:00:00 2001 From: Taylor Reiter Date: Fri, 2 Jun 2023 11:50:55 -0400 Subject: [PATCH 4/4] add contributing link --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 35be0a1..7a5b974 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,7 @@ snakemake -j 1 --use-conda --rerun-incomplete -k -n where `-j` specifies the number of threads to run with, `--use-conda` uses conda to manage software environments, `--rerun-incomplete` re-runs incomplete files, `-k` tells the pipeline to continue with independent steps when one step fails, and `-n` signifies to run a dry run first. -## Citation +## Citation & contributing You can read more about this project in [this pub](https://doi.org/10.57844/arcadia-w8xt-pc81). +See [this guide](https://github.com/Arcadia-Science/arcadia-software-handbook/blob/main/guides-and-standards/guide-credit-for-contributions.md) to see how we recognize feedback and contributions on our code.