Skip to content

Commit

Permalink
Merge pull request #7 from Arcadia-Science/ter/readme
Browse files Browse the repository at this point in the history
Add database download links and pub details to readme
  • Loading branch information
taylorreiter authored Jun 2, 2023
2 parents cdd2df9 + 1a49dc8 commit 8e64581
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,22 @@ It starts by downloading the [NCBI nr database in FASTA format](https://ftp.ncbi
After clustering this file at 90% length and 90% identity, it then determines the lowest common ancestor for each cluster using the [prot.accession2taxid.FULL files](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) (12Gb in March 2023) and the [taxdump files](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).
The final output includes the representative sequences in FASTA format, a TSV file that reports cluster representatives and members, and an SQLite DB with representative sequence names and their taxonomic lineages (as taxid and as names).

## Outputs & Downloads

**The database and associated taxonomy files are available for download on [OSF](https://osf.io/tejwd/).**

Description of output files:
* `nr_rep_seq.fasta.gz` (59GB): FASTA file of representative sequences output by mmseqs2 `easy-linclust`.
* `nr_cluster.tsv` (13.2GB): TSV file documenting cluster membership. The first column records the representative sequence identifier, while the second column records the sequence identifiers for member sequences of the cluster.
* `nr_cluster_taxid_formatted_final.tsv.gz` (1.4GB): TSV file recording the representative sequence for a cluster, the lowest common ancestor taxomony ID, the named lineage of the lowest common ancestor, and taxonomy ID lineage of the lowest common ancestor. A snippet of the file is presented below.
```
rep taxid lca_taxid lca_lineage_named lca_lineage_taxid
0310191A 2517390 2517390 Eukaryota;Metazoa;Chordata;Amphibia;Anura;Hyperoliidae;Kassina;Kassina cochranae;unclassified Kassina cochranae subspecies/strain 2759;33208;7711;8292;8342;8412;8413;2517390;
0311203A 9031 9031 Eukaryota;Metazoa;Chordata;Aves;Galliformes;Phasianidae;Gallus;Gallus gallus;unclassified Gallus gallus subspecies/strain 2759;33208;7711;8782;8976;9005;9030;9031;
0311203B 9940 9940 Eukaryota;Metazoa;Chordata;Mammalia;Artiodactyla;Bovidae;Ovis;Ovis aries;unclassified Ovis aries subspecies/strain 2759;33208;7711;40674;91561;9895;9935;9940;
```
* `nr_cluster_taxid_formatted_final.sqlite` (66GB): An SQLite database of the `nr_cluster_taxid_formatted_final.tsv.gz` TSV file. The name of the database was recorded as `nr_cluster_taxid_table` (see [this script](./scripts/make_sqlite_db.R)). For an example of how to use the database to assign lineages to BLAST results, see [this script](https://github.com/Arcadia-Science/2023-rehgt/blob/main/bin/blastp_add_taxonomy_info.R).

## Getting started with this repository

This repository uses snakemake to run the pipeline and conda to manage software environments and installations.
Expand Down Expand Up @@ -37,3 +53,8 @@ snakemake -j 1 --use-conda --rerun-incomplete -k -n
```

where `-j` specifies the number of threads to run with, `--use-conda` uses conda to manage software environments, `--rerun-incomplete` re-runs incomplete files, `-k` tells the pipeline to continue with independent steps when one step fails, and `-n` signifies to run a dry run first.

## Citation & contributing

You can read more about this project in [this pub](https://doi.org/10.57844/arcadia-w8xt-pc81).
See [this guide](https://github.com/Arcadia-Science/arcadia-software-handbook/blob/main/guides-and-standards/guide-credit-for-contributions.md) to see how we recognize feedback and contributions on our code.

0 comments on commit 8e64581

Please sign in to comment.