Merge pull request #7 from Arcadia-Science/ter/readme

Add database download links and pub details to readme
Arcadia-Science · Jun 2, 2023 · 8e64581 · 8e64581
2 parents cdd2df9 + 1a49dc8
commit 8e64581
Showing 1 changed file with 21 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -9,6 +9,22 @@ It starts by downloading the [NCBI nr database in FASTA format](https://ftp.ncbi
 After clustering this file at 90% length and 90% identity, it then determines the lowest common ancestor for each cluster using the [prot.accession2taxid.FULL files](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) (12Gb in March 2023) and the [taxdump files](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).
 The final output includes the representative sequences in FASTA format, a TSV file that reports cluster representatives and members, and an SQLite DB with representative sequence names and their taxonomic lineages (as taxid and as names).
 
+## Outputs & Downloads
+
+**The database and associated taxonomy files are available for download on [OSF](https://osf.io/tejwd/).**
+
+Description of output files:
+* `nr_rep_seq.fasta.gz` (59GB): FASTA file of representative sequences output by mmseqs2 `easy-linclust`.
+* `nr_cluster.tsv` (13.2GB): TSV file documenting cluster membership. The first column records the representative sequence identifier, while the second column records the sequence identifiers for member sequences of the cluster.
+* `nr_cluster_taxid_formatted_final.tsv.gz` (1.4GB): TSV file recording the representative sequence for a cluster, the lowest common ancestor taxomony ID, the named lineage of the lowest common ancestor, and taxonomy ID lineage of the lowest common ancestor. A snippet of the file is presented below. 
+```
+rep	taxid	lca_taxid	lca_lineage_named	lca_lineage_taxid
+0310191A	2517390	2517390	Eukaryota;Metazoa;Chordata;Amphibia;Anura;Hyperoliidae;Kassina;Kassina cochranae;unclassified Kassina cochranae subspecies/strain	2759;33208;7711;8292;8342;8412;8413;2517390;
+0311203A	9031	9031	Eukaryota;Metazoa;Chordata;Aves;Galliformes;Phasianidae;Gallus;Gallus gallus;unclassified Gallus gallus subspecies/strain	2759;33208;7711;8782;8976;9005;9030;9031;
+0311203B	9940	9940	Eukaryota;Metazoa;Chordata;Mammalia;Artiodactyla;Bovidae;Ovis;Ovis aries;unclassified Ovis aries subspecies/strain	2759;33208;7711;40674;91561;9895;9935;9940;
+```
+* `nr_cluster_taxid_formatted_final.sqlite` (66GB): An SQLite database of the `nr_cluster_taxid_formatted_final.tsv.gz` TSV file. The name of the database was recorded as `nr_cluster_taxid_table` (see [this script](./scripts/make_sqlite_db.R)). For an example of how to use the database to assign lineages to BLAST results, see [this script](https://github.com/Arcadia-Science/2023-rehgt/blob/main/bin/blastp_add_taxonomy_info.R).
+
 ## Getting started with this repository
 
 This repository uses snakemake to run the pipeline and conda to manage software environments and installations.
@@ -37,3 +53,8 @@ snakemake -j 1 --use-conda --rerun-incomplete -k -n
 ```
 
 where `-j` specifies the number of threads to run with, `--use-conda` uses conda to manage software environments, `--rerun-incomplete` re-runs incomplete files, `-k` tells the pipeline to continue with independent steps when one step fails, and `-n` signifies to run a dry run first.
+
+## Citation & contributing
+
+You can read more about this project in [this pub](https://doi.org/10.57844/arcadia-w8xt-pc81).
+See [this guide](https://github.com/Arcadia-Science/arcadia-software-handbook/blob/main/guides-and-standards/guide-credit-for-contributions.md) to see how we recognize feedback and contributions on our code.