3: Assessing Taxonomy

Assessing Taxonomy

Overview
Creating a Taxon List
Getting a Taxon List Directly from Fasta Files
- Fasta_Get_Taxa.py
Identifying Taxonomy Conflicts
Resolving Taxonomy Conflicts
- Rename_Merge.py

Overview

SuperCRUNCH allows any taxonomy to be used in analyses. Ideally, this user-supplied taxonomy will match the taxon names present in the sequence records. However, for many reasons the user-supplied taxonomy could also strongly conflict with the sequence record labels. Given that taxon names are required for most initial steps in SuperCRUNCH, it is important to understand how compatible the user-supplied taxonomy is with the sequence records. This section deals with how to create a taxon list from external sources, obtain taxon names directly from sequence records, and how to identify and resolve taxonomy conflicts.

Creating a Taxon List

SuperCRUNCH requires a list of taxon names that is used for filtering sequences. One option for obtaining a set of taxon names is the NCBI Taxonomy Browser, which is a general use database. In many cases there are specific databases dedicated to major organismal groups, for example the Reptile Database, AmphibiaWeb, and Amphibian Species of the World, which usually contain up-to-date taxonomies in a downloadable format.

The taxon list required is a simple text file which contains one taxon name per line. The file can contain a mix of species (two-part) and subspecies (three-part) names, and components of the name should be separated by a space (rather than undescores). Below are some example contents from suitable taxon list files.

Partial contents of a list containing species (two-part) names:

Lygodactylus regulus
Lygodactylus rex
Lygodactylus roavolana
Lygodactylus scheffleri
Lygodactylus scorteccii
Lygodactylus somalicus

Partial contents of a list containing species (two-part) and subspecies (three-part) names:

Leycesteria crocothyrsos
Leycesteria formosa
Linnaea borealis americana
Linnaea borealis borealis
Linnaea borealis longiflora

Note that names are not case-sensitive for the various search steps. All names supplied in the taxon list are converted to uppercase for these steps. However, it is good practice to write the names as above.

Note that the three-part name does not require an actual subspecies name (e.g., Hyperolius balfouri viridistriatus). It could also consist of a numerical code in the third position (e.g., Hyperolius balfouri CAS256943). Both types are considered valid three-part names, and using the subspecies options in various steps would allow you to search for either type. This is particularly useful for local sequences. The following is also a perfectly valid taxon list:

Afrixalus dorsalis MVZ244954
Afrixalus dorsalis UWBM5580
Afrixalus dorsalis ZMB56734
Arthroleptis poecilonotus
Cardioglossa elegans
Cardioglossa leucomystax
Kassina arboricola UWBM5746
Kassina arboricola UWBM5747
Kassina cassinoides JP0056

SuperCRUNCH offers the option to exclude subspecies from searches. If this option is invoked, a three-part name is reduced to a two-part name during searches. Because of this option, a taxon list can contain a mix of species and subspecies names even if subspecies are not desired in the analysis. They can be ignored. The effects of this option with different types of taxon lists is explained in the Identifying Taxonomy Conflicts section.

Back to top

Getting a Taxon List Directly from Fasta Files

In some cases it may be desirable to obtain a list of taxon names directly from a fasta file of sequence records. This option is available using the Fasta_Get_Taxa.py module, which is described below. This is most useful for files that contain a limited number of taxa, because the names will need to be carefully reviewed for potential errors.

Back to top

Fasta_Get_Taxa.py

The goal of this module is to search through all fasta files in a directory and attempt to construct all possible species (two-part) and subspecies (three-part) names directly from the description lines. There are several filters in place to try to prevent spurious names from being produced, but this is an inherently imperfect process. Although the filters should work relatively well for constructing species (two-part) names, the subspecies (three-part) names are a much more difficult problem. The resulting subspecies list will likely contain some errors. In addition, if multiple names representing the same taxon are present (synonymies), they will all be written to the taxon list. For these reasons, the resulting taxon lists should be carefully inspected before using them for any other purpose. After inspection, names resulting from this module can be used to create a taxon list for downstream steps.

In some cases (such as population level data) it may be desirable to obtain a three-part name such as Genus species samplecode, rather than Genus species subspecies, where the samplecode component may be a museum code or other numerical identifier. This can be obtained by using the optional --numerical flag. When the --numerical flag is not used, a subspecies label will automatically be excluded if the third component contains any numbers. That is, a three-part name will only be constructed if each component is strictly alphabetical.

Basic Usage:

python Fasta_Get_Taxa.py -i <directory with fasta file(s)> -o <output directory>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory with fasta file(s). Input fasta files should be labeled as 'NAME.fasta' or 'NAME.fa'. The NAME portion should not contain any periods or spaces, but can contain underscores.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`--numerical`

Optional: Allows the third part of a name to also contain numbers and special characters, rather than just letters. Useful for samples of the same species with museum/field codes or other unique identifiers immediately following the species label.

Example Usage:

python Fasta_Get_Taxa.py -i bin/FastaSet/ -o bin/FastaSet/Output/

The above command will find all possible two-part and three-part names present across all the fasta files present in the bin/FastaSet/ directory. Outputs are written to bin/FastaSet/Output/.

python Fasta_Get_Taxa.py -i bin/FastaSet/ -o bin/FastaSet/Output/ --numerical

The above command will find all possible two-part and three-part names present across all the fasta files present in the bin/FastaSet/ directory. The three-part names will be constructed using third components that can also contain numbers. Outputs are written to bin/FastaSet/Output/.

Outputs:

Two output files are created in the specified output directory, including:

File Species_Names.txt: List of unique two-part names constructed from record descriptions. If records are labeled correctly this should correspond to the genus and species. This file should be inspected.
File Subspecies_Names.txt: List of unique three-part names constructed from record descriptions. If records actually contain subspecies labels they will be captured in this list, however if the records only contain a typical species (two-part) names then spurious names may be produced. This file should be VERY carefully inspected. If the --numerical flag is used, the third component may also be museum or field codes, or other alpha-numerical identifiers.

Back to top

Identifying Taxonomy Conflicts

Given the potential for discordance between the supplied taxonomy (the taxon list) and the actual sequence records, it is very important to assess how well the two match. The Taxa_Assessment.py module performs this task. It will identify all the taxon names in the sequence records that match names supplied in the taxon list. It will also identify all the taxon names in the sequence records that did not match a name in the supplied taxon list. This allows for major conflicts in taxonomy to be identified, which can be corrected downstream.

Back to top

Taxa_Assessment.py

The goal of this module is to search through the records of all fasta file of nucleotide sequences (GenBank and/or local sequences) and identify valid taxon names. Names are considered valid if the taxon name present in the description line of a sequence record can be matched to a taxon name in the user-supplied taxon list. The taxon list can contain a mix of species (two-part name) and subspecies (three-part name) labels. Note that 'subspecies' refers to a three-part name, where the third part can be an actual subspecies label or a unique identifier (such as fied/museum code, or alpha-numerical code).

The decision to include or exclude subspecies labels is up to the user, and can be specified using the --no_subspecies flag. For a thorough explanation of how taxonomy searches are conducted and how this flag affects this step (and others), please see below. For all searches both the sequence description lines and the supplied taxon names are converted to uppercase, so the list of taxon names is NOT case-sensitive.

Basic Usage:

python Taxa_Assessment.py -i <fasta file> -t <taxon file> -o <output directory>

Argument Explanations:

`-i <path-to-file>` or `--input <path-to-file>`

Required: The full path to a fasta file of sequence data.

`-t <path-to-file>` or `--taxa <path-to-file>`

Required: The full path to a text file containing all taxon names to cross-reference in the fasta file.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`--no_subspecies`

Optional: Ignore any subspecies labels in the taxon list and also during record searches (only search two-part names).

`--sql_db <full-path-to-sql-database>`

Optional: The full path to the sql database to use for searches. Assumes the database was created with this module for the input fasta file being used.

Example Usage:

python Taxa_Assessment.py -i bin/Analysis/Start_Seqs.fasta -t bin/Analysis/Taxa_List.txt -o bin/Analysis/Output/

Above command will search sequence records in Start_Seqs.fasta to find taxon names present in Taxa_List.txt, the output is written to the specified directory.

python Taxa_Assessment.py -i bin/Analysis/Start_Seqs.fasta -t bin/Analysis/Taxa_List.txt -o bin/Analysis/Output/ --no_subspecies

Above command will search sequence records in Start_Seqs.fasta to find taxon names present in Taxa_List.txt, the output is written to the specified directory. This search will exclude the subspecies component of taxon labels in the taxon names file and fasta file.

Outputs

Two output fasta files are written to the specified output directory:

File Matched_Taxa.fasta - A fasta file containing only records with matched taxon names.
File Unmatched_Taxa.fasta - A fasta file containing only records with invalid taxon names (those that couldn't be matched to a name in the taxon list).

In addition, the accession numbers of the records in each file are written to the following files:

File Matched_Records_Accession_Numbers.log
File Unmatched_Records_Accession_Numbers.log

Two log files are written which contain lists of the matched or unmatched names:

File Matched_Taxon_Names.log
File Unmatched_Taxon_Names.log

The Unmatched_Taxon_Names.log file can be used to create the replacement names file needed to relabel taxa in the Rename_Merge.py script.

The searches are conducted using SQL (via sqlite3), and a database is constructed from the sequence records. This file is called Taxa-Assessment.sql.db, and is also written to the output directory. If this module needs to be re-run using the same fasta file, this database can be used instead of building it again. The path to this database can be specified using the --sql_db flag.

Taxonomy searches using other types of names

The taxon names do not need to be strictly scientific, as in Genus species or Genus species subspecies. They can really be any combination of a two-part name or a three-part name. Therefore, you can run SuperCRUNCH for a set of sequences that are labeled as follows:

>XX4534 Sample L4657 exon-capture locus RAG-1
...
>XX4535 Sample L4657 exon-capture locus POMC
...
>XX4536 Sample L4658 exon-capture locus RAG-1
...
>XX4537 Sample L4658 exon-capture locus POMC
...

The taxon list in this case would contain the above names (Sample L4657), rather than a scientific name.

Furthermore, if your sequences contain a species name and some type of identifier code following it, you can include these components in a three-part name. For example, take the following locally-generated sequences:

>CUMV15205.KIAA2013 Hyperolius ocellatus CUMV15205 KIAA2013 gene, partial cds
...
>FMNH274349.KIAA2013 Hyperolius reesi FMNH274349 KIAA2013 gene, partial cds
...
>FMNH274348.KIAA2013 Hyperolius reesi FMNH274348 KIAA2013 gene, partial cds 
...
>CUMV14908.KIAA2013 Hyperolius tuberculatus CUMV14908 KIAA2013 gene, partial cds
...

If the taxon list provided included the following:

Hyperolius ocellatus
Hyperolius reesi
Hyperolius tuberculatus

Then the records would be detected as typical species:

>CUMV15205.KIAA2013 Hyperolius ocellatus 
...
>FMNH274349.KIAA2013 Hyperolius reesi 
...
>FMNH274348.KIAA2013 Hyperolius reesi 
...
>CUMV14908.KIAA2013 Hyperolius tuberculatus 
...

However, if the taxon list provided included the following:

Hyperolius ocellatus CUMV15205
Hyperolius reesi FMNH274349
Hyperolius reesi FMNH274348
Hyperolius tuberculatus CUMV14908

Then the records would be detected as a species + identifier three-part name:

>CUMV15205.KIAA2013 Hyperolius ocellatus CUMV15205
...
>FMNH274349.KIAA2013 Hyperolius reesi FMNH274349
...
>FMNH274348.KIAA2013 Hyperolius reesi FMNH274348
...
>CUMV14908.KIAA2013 Hyperolius tuberculatus CUMV14908
...

Why is this important? In the first example, Hyperolius reesi contains two sequences for this gene. For downstream steps, these sequences are considered part of the same taxon, and one would ultimately be selected (for a final supermatrix). If the taxon name + identifier is used instead, Hyperolius reesi FMNH274349 and Hyperolius reesi FMNH274348 are considered to be two distinct 'taxa'. It will ensure that only sequences coming from that specific sample are labeled as such and selected during downstream steps. In other words, this particular usage can be used as an alternative to the voucher feature to create a vouchered dataset. It is most useful for locally-generated datasets, which are more likely to be labeled in this fashion.

Taxonomy searches with and without subspecies

To understand how the --no_subspecies flag can impact analyses, it is important to demonstrate how the taxon list is being parsed. Regardless of the type of names present in this file (species or subspecies), two lists are constructed. One is filled with species (two-part) names, and the other with subspecies (three-part) names.

Given the following taxon list:

Leycesteria crocothyrsos
Leycesteria formosa
Linnaea borealis americana
Linnaea borealis borealis
Linnaea borealis longiflora

The resulting parsed lists are:

Species:

Leycesteria crocothyrsos
Leycesteria formosa
Linnaea borealis

Subspecies:

Linnaea borealis americana
Linnaea borealis borealis
Linnaea borealis longiflora

Notice that even though there wasn't a species (two-part) name provided for Linnaea borealis, it was automatically generated based on the subspecies labels. This is true regardless of whether the --no_subspecies flag is included or not.

How does the --no_subspecies flag impact searches?

I will use the above taxon list and the following example records to illustrate:

>FJ745393.1 Leycesteria crocothyrsos voucher N. Pyck 1992-1691 maturase K (matK) gene, partial cds; chloroplast
>KC474956.1 Linnaea borealis americana voucher Bennett_06-432_CAN maturase K (matK) gene, partial cds; chloroplast

By default, each sequence record is always parsed to construct a species (two-part) and subspecies (three-part) name. This would produce the following results:

Leycesteria crocothyrsos         - species
Leycesteria crocothyrsos voucher - subspecies

Linnaea borealis           - species
Linnaea borealis americana - subspecies

As you can see above, every subspecies label contains a species label. This allows a series of checks to be performed. If the --no_subspecies flag is not used (e.g., a species-only search), the following checks are performed:

Is the reconstructed species name in the species list?
1. If no, the record is ignored.
2. If yes, the subspecies is examined.
Is the reconstructed subspecies name in the subspecies list?
1. If no, the species name will be used.
2. If yes, the subspecies name will be used instead.

In the example above, Leycesteria crocothyrsos is in the species list, but Leycesteria crocothyrsos voucher is an obviously incorrect name and is absent from the subspecies list. In this case, the species name Leycesteria crocothyrsos will be used for that record. In the other example, Linnaea borealis is in the species list, but Linnaea borealis americana is also present in the subspecies list, so Linnaea borealis americana will be used for that record.

If the --no_subspecies flag is used, the following checks are performed:

Is the reconstructed species name in the species list?
1. If no, the record is ignored.
2. If yes, the species name is used.

In the example above, Leycesteria crocothyrsos and Linnaea borealis would be the names used. Essentially, the --no_subspecies flag eliminates all the subspecies, and they are all lumped under the relevant species label.

Let's use another example.

Here, the taxon list file contains only species (two-part) names:

Draco beccarii
Draco biaro
Draco bimaculatus
Draco blanfordii
Draco boschmai

If only species names are present, then the resulting species list will be populated and the resulting subspecies list will be empty:

Species:

Draco beccarii
Draco biaro
Draco bimaculatus
Draco blanfordii
Draco boschmai

Subspecies:

In this example the --no_subspecies flag will have no effect on the analysis. That is, regardless of whether the --no_subspecies flag is used or not, there aren't any subspecies to reference and the only possible outcome is to find species names.

There are also some special cases depending on combinations of the taxon list and sequence set.

Given the following taxon list:

Linnaea borealis americana
Linnaea borealis borealis
Linnaea borealis longiflora

And the following record description lines:

>KJ593010.1 Linnaea borealis voucher WAB_0132469163 maturase K (matK) gene, partial cds; chloroplast
>KC474956.1 Linnaea borealis americana voucher Bennett_06-432_CAN maturase K (matK) gene, partial cds; chloroplast
>KP297496.1 Linnaea borealis borealis isolate BOP012344 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
>KP297498.1 Linnaea borealis longiflora isolate BOP022790 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence

The following taxa would be detected and included from each record if the --no_subspecies flag is not used (e.g., subspecies are allowed):

KJ593010.1 -> Linnaea borealis
KC474956.1 -> Linnaea borealis americana
KP297496.1 -> Linnaea borealis borealis
KP297498.1 -> Linnaea borealis longiflora

Any record of Linnaea borealis missing a valid subspecies label will be lumped in with all other Linnaea borealis, whereas those containing valid subspecies labels will be assigned to the correct subspecies.

The following taxa would be detected and included from each record if the --no_subspecies flag is included:

KJ593010.1 -> Linnaea borealis
KC474956.1 -> Linnaea borealis
KP297496.1 -> Linnaea borealis
KP297498.1 -> Linnaea borealis

This effectively groups all the subspecies under the species name Linnaea borealis.

To summarize:

If the taxon names list contains only species (two-part names) then searches for subspecies labels cannot occur, and therefore the presence or absence of the --no_subspecies flag has no effect.
If the taxon names list contains a mix of species (two-part) and subspecies (three-part) labels, then the --no_subspecies flag can substantially change the outcome.
Using the --no_subspecies flag eliminates the third component of the subspecies label, which essentially converts it into a species (two-part) label. This is expected to result in less taxa recovered. Depending on your conceptual view of subspecies, you may find this to be an awesome choice, or you may find it to be a terrible choice. The choice is yours, and yours alone.
Omitting the --no_subspecies flag will allow searches for both species and subspecies, and is expected to produce a greater number of taxa. However, this will only occur if valid subspecies labels are actually present in the sequence records.
There is no downside to having subspecies labels in the taxon list file, because they can effectively be ignored while capturing all relevant species labels. You can always try both types of analyses.

Back to top

Resolving Taxonomy Conflicts

The Taxa_Assessment.py module determines which names in the sequence records match to the user-supplied taxon list, and which names do not. Often times, the names that do not match are the result of spelling errors, changes in taxonomy (for example, being assigned to a new genus), or other correctable errors. These types of names can be relabeled using a new name. This allows them to be matched to a valid taxon name, and to pass various filtering steps in SuperCRUNCH. To relabel invalid taxon names in sequence records, the Rename_Merge.py module can be used.

Back to top

Rename_Merge.py

This module can be used to relabel taxon names that did not match the user-supplied taxon list in the Taxa_Assessment.py step. These records will have been written to a file called Unmatched_Taxa.fasta, and the unmatched names will have been written to a file called Unmatched_Taxon_Names.log. Currently, two-part names (e.g., Genus species) and three-part names (e.g., Genus species subspecies) can be replaced using a substitute name of any length. That is, a species name can be replaced using a different species name or a subspecies name. The same is true for replacing subspecies labels. The replacement names file is a tab-delimited text file with two columns. The first column contains the unmatched name to replace, and the second column contains the replacement name.

All successfully relabeled records are written to a fasta file called Relabeled.fasta. These records can also be joined with records from an additional fasta file using the -m flag. This is ideal for joining updated records with those from Matched_Taxa.fasta, and will produce an output fasta file called Merged.fasta.

Finally, a summary of the number of records relabeled for each name provided is written as Renaming_Summary.txt. All output files are written to the output directory specified (-o ).

Basic Usage:

python Rename_Merge.py -i <fasta file> -r <taxon renaming file> -o <output directory>

Argument Explanations:

`-i <path-to-file>` or `--input <path-to-file>`

Required: The full path to a fasta file to replace names inside. If you have used the Taxa_Assessment module, this should be the file Unmatched_Taxa.fasta.

`-r <path-to-file>` or `--replace <path-to-file>`

Required: The full path to a text file containing the replacement name information..

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`-m <full-path-to-file>` or `--merge <full-path-to-file>`

Optional: The full path to a fasta file to merge with the updated records. If you have used the Taxa_Assessment module, this should be the file Matched_Taxa.fasta.

`--sql_db <full-path-to-sql-database>`

Optional: The full path to the sql database to use for searches. Assumes the database was created with this module for the input fasta file being used.

`--quiet`

Optional: Show less output while running.

Example Usage:

python Rename_Merge.py -i bin/02-Taxon-assess/Unmatched_Taxa.fasta -r bin/03-Rename/Rename_List.txt -o bin/03-Rename/Output/

Above command will attempt to rename sequence records in Unmatched_Taxa.fasta following the file Rename_List.txt, and the output is written to the specified directory.

python Rename_Merge.py -i bin/02-Taxon-assess/Unmatched_Taxa.fasta -r bin/03-Rename/Rename_List.txt -o bin/03-Rename/Output/ -m bin/02-Taxon-assess/Matched_Taxa.fasta --quiet

Above command will attempt to rename sequence records in Unmatched_Taxa.fasta following the file Rename_List.txt, and merge these relabeled records with those in Matched_Taxa.fasta. The output is written to the specified directory.

Outputs:

File Renamed.fasta - A fasta file with sequence records that have been successfully relabeled.
File Renaming_Summary.txt - A tab-delimited text file that summarizes the number of records that were relabeled for each name pair supplied.
File Rename-Merge.sql.db - The SQL database constructed from the input fasta file to perform the search and relabeling steps.
File Merged.fasta - A fasta file of relabeled sequence records and the records of the fasta file specified by the -m flag. This file is created only if the -m flag is used.

Examples of relabeling

In the replacement names file, the first column should be the name to replace, and the second column should be the replacement name. The columns must be separated by a tab character. There should not be any header (column labels) in this file. Any species (two-part) or subspecies (three-part) name can be replaced with either a species (two-part) or subspecies (three-part) name. Here are several examples of the contents of valid replacement names files.

Species names can be replaced with species names:

Chamaeleo hoehneli		Trioceros hoehnelii
Chamaeleo hoehnelii		Trioceros hoehnelii
Chamaeleo jacksonii		Trioceros jacksonii

Subspecies names can be replaced with subspecies names:

Hyperolius parallelus marginatus	Hyperolius viridiflavus marginatus
Hyperolius parallelus parallelus	Hyperolius viridiflavus parallelus
Hyperolius parallelus pyrrhodictyon	Hyperolius viridiflavus pyrrhodictyon

Species labels can also be replaced by subspecies labels, and vice versa:

Hyperolius parallelus marginatus	Hyperolius marginatus
Hyperolius parallelus			Hyperolius parallelus parallelus

To accomplish the name replacing, an SQL database is first constructed from the sequence records (via sqlite3). Then, searches for the 'invalid' names (in the first column) are carried out. If records with the invalid name are found, these records are written to a new fasta file using the updated name (in the second column). In these updated records, a *Relabeled* flag is inserted into the description line. Let's look at an actual example below.

Here are the starting sequence records:

>AB023750.1 Ptyctolaemus phuwuanensis mitochondrial DNA for 12S ribosomal RNA
GCCTTACCGTTAAACAAAAAATGCCAAAGAAGTACGAGCCCACTCACTTTAAACTTTAAGGACCTGGCGGTACTCTACATCACCCTA...

>AB023772.1 Ptyctolaemus phuwuanensis mitochondrial DNA for 16S ribosomal RNA
TGTCCTCCAAATAAGGACCAGTATGAATGGCAACATGAGAAAGAAACTGTCTCTTAAGGCCAGCCAATGAACCTGATCTGCTTGTAA...

Here is the renaming list:

Ptyctolaemus phuwuanensis	Mantheyus phuwuanensis

Here are the relabeled sequence records:

>AB023750.1 Mantheyus phuwuanensis *Relabeled* mitochondrial DNA for 12S ribosomal RNA
GCCTTACCGTTAAACAAAAAATGCCAAAGAAGTACGAGCCCACTCACTTTAAACTTTAAGGACCTGGCGGTACTCTACATCACCCTA...

>AB023772.1 Mantheyus phuwuanensis *Relabeled* mitochondrial DNA for 16S ribosomal RNA
TGTCCTCCAAATAAGGACCAGTATGAATGGCAACATGAGAAAGAAACTGTCTCTTAAGGCCAGCCAATGAACCTGATCTGCTTGTAA...

And this is what the contents of Renaming_Summary.txt would show:

Orig_Name	Replace_Name	Records_Relabeled
Ptyctolaemus phuwuanensis	Mantheyus phuwuanensis	2

Although this example was very simple, in practice the Rename_Merge.py module can be used to relabel thousands of records with hundreds of replacement names.

Although the replacement step should help rescue many records, there are some labels in the Unmatched_Taxon_Names.log that simply can't be corrected. These include names like the following:

A.alutaceus mitochondrial
A.barbouri mitochondrial
Agama sp.
C.subcristatus tcs1
C.versicolor sox-4
Calotes sp.
Calumma aff.
Calumma cf.
Liolaemus kriegi/ceii
Tsa anolis
Unverified bradypodion
Unverified callisaurus

These records have been labeled improperly, or the identity of the organism is uncertain (sp., cf., aff.). These deserve to be discarded, as we can't identify the actual organism they belong to.

In other cases, taxon names may have been updated and now represent synonymies, or may have been accidentally misspelled. Using a organism-specific taxonomy browser, or tools developed specifically for this purpose (taxize, pytaxize, Global Names Resolver, etc.), can often help clarify these situations. Synonymies, misspellings, and name changes represent examples of records that are worth rescuing through relabeling, and using Rename_Merge.py to do so will result in higher quality data.

Back to top

Last updated: October, 2019

For SuperCRUNCH v1.2

3: Assessing Taxonomy

Assessing Taxonomy

Overview

Creating a Taxon List

Getting a Taxon List Directly from Fasta Files

Fasta_Get_Taxa.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

--numerical

Example Usage:

Outputs:

Identifying Taxonomy Conflicts

Taxa_Assessment.py

Basic Usage:

Argument Explanations:

-i <path-to-file> or --input <path-to-file>

-t <path-to-file> or --taxa <path-to-file>

-o <path-to-directory> or --outdir <path-to-directory>

--no_subspecies

--sql_db <full-path-to-sql-database>

Example Usage:

Outputs

Taxonomy searches using other types of names

Taxonomy searches with and without subspecies

Resolving Taxonomy Conflicts

Rename_Merge.py

Basic Usage:

Argument Explanations:

-i <path-to-file> or --input <path-to-file>

-r <path-to-file> or --replace <path-to-file>

-o <path-to-directory> or --outdir <path-to-directory>

-m <full-path-to-file> or --merge <full-path-to-file>

--sql_db <full-path-to-sql-database>

--quiet

Example Usage:

Outputs:

Examples of relabeling

Clone this wiki locally

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`--numerical`

`-i <path-to-file>` or `--input <path-to-file>`

`-t <path-to-file>` or `--taxa <path-to-file>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`--no_subspecies`

`--sql_db <full-path-to-sql-database>`

`-i <path-to-file>` or `--input <path-to-file>`

`-r <path-to-file>` or `--replace <path-to-file>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-m <full-path-to-file>` or `--merge <full-path-to-file>`

`--sql_db <full-path-to-sql-database>`

`--quiet`