Skip to content

5: Similarity Searches and Filtering

Daniel Portik edited this page Apr 13, 2020 · 24 revisions

Similarity Searches and Filtering


Overview

There are two main methods in SuperCRUNCH that can be used to filter sequences by similarity in the locus-specific fasta files. The use of each method depends on the type of sequence records you are working with, which can be broadly classified as 'simple' or 'complex' records. The 'simple' records generally contain a single gene region, with minor length variation. This can be due to the fact that the same primers or probes were used to generate the sequences. In contrast, the 'complex' records generally contain multiple gene regions (e.g., long mtDNA fragments or whole mitogenomes) and high sequence length heterogeneity. These records often contain a portion of the target region, but can also contain plenty of non-target sequence. Both of the similarity filtering methods rely on using nucleotide BLAST to perform searches. Results of these searches are then used to extract homologous sequences and remove non-target sequences. A brief visual overview of these two approaches is shown below:

SuperCrunch fig2

For more information on how the similarity filtering works, please see the relevant sections below.

In addition to filtering by similarity, an optional filter is available to identify and remove 'contaminated' sequences. This step requires a user-supplied set of sequences that represent the source of 'contamination'. For example, amphibian mtDNA sequences can be screened against human mtDNA sequences to make sure they are actually amphibian. Any set of reference sequences can be used, and the context depends on what the contamination source is expected to be. For more details on this step, please see the relevant section below.

After filtering is complete, the resulting fasta files can then be used for the Sequence Selection stage.


Filtering 'Simple' Sequence Records

The 'simple' sequence records generally contain a single gene region, with minor length variation. Similarity searches are performed using BLASTn, which requires a set of reference sequences. Because of the nature of 'simple' records (one gene, minor length variation), the reference sequences can be selected automatically from the set of records for a given gene. This is accomplished by initially clustering the sequences by similarity, and then designating the largest sequence cluster as the reference sequence set. All starting sequences (including those in the reference set) are then BLASTed to the reference sequences (excluding self-hits) to identify homologous regions. If necessary, sequences are trimmed to remove any non-target sequence from the record. Likewise, any records that fail to produce significant BLASTn scores are simply discarded. This is all performed using the Cluster_Blast_Extract.py module. A visual depiction of this approach is shown below:

SuperCrunch fig2

The automatic selection of reference sequences in Cluster_Blast_Extract.py is an important feature, as it allows thousands of loci to be filtered with limited guidance. For example, it can be used to filter sequence capture data sets (including UCEs), in which specific regions have been targeted for each locus. This approach also works well for traditional Sanger sequenced loci, such as commonly used nuclear genes. It is important to highlight that the success of Cluster_Blast_Extract.py relies on a majority of the sequence records being 'simple' records (same gene region, minor length variation). If a majority of the records are 'complex', or a particular locus has an even mix of 'simple' and 'complex' records, you should not use this method - see the Filtering 'Complex' Sequence Records section instead.

Back to top


Cluster_Blast_Extract.py

Cluster_Blast_Extract.py is one of two methods for filtering sequences based on similarity. It is recommended to use for loci composed of 'simple' sequence records.

Cluster_Blast_Extract.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores. For each locus-specific fasta file included, the following steps will be performed per file:

  • Cluster all sequences using cd-hit-est.

  • Identify the largest cluster and use it as the reference sequence set. This cluster is used to create a BLAST database with makeblastdb.

  • BLAST all sequences to the reference set using nucleotide BLAST. This includes the sequences present in the reference set.

  • For each sequence with significant matches - obtain BLAST coordinates, merge coordinates based on merge strategy selected by the user, and use the final coordinates to extract the sequence region. Any self-hits are excluded during these steps.

  • Write the extracted sequences to a new filtered fasta file. Any sequences failing the similarity filtering step are excluded and written to a separate fasta file.

Several important decisions need to be made concerning the BLASTn searches and how the BLAST results are used. This includes the choice of BLAST algorithm using the -b flag. The options include blastn, blastn-short, dc-megablast, and megablast. Recommendations are provided in the BLAST algorithm choice section below. For each sequence with a significant BLAST result, the coordinates of all BLAST hits (excluding self-hits) are merged. This action often results in a single interval, but non-overlapping coordinates can also be produced. The -m and --bp_bridge flags both control how coordinates are merged. The concept of coordinate merging and the different strategies available are discussed in greater detail below in the BLAST coordinates section.

Basic Usage:

python Cluster_Blast_Extract.py -i <fasta file directory> -o <output directory> -b <blast algorithm> -m <blast coordinate strategy>

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory containing the parsed locus-specific fasta files. Fasta files in the directory must have the extension '.fa' or '.fasta' to be read.

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write output files.

-b <choice> or --blast_task <choice>

Required: The BLAST algorithm to use. Choices = blastn, blastn-short, dc-megablast, megablast. Recommended for most datasets: dc-megablast.

--max_hits <integer>

Optional: The maximum number of blast matches allowed per input sequence. May want to set < 300 for large sequence sets. If omitted, no limit is set.

-m <choice> or --merge_strategy <choice>

Optional: The strategy for dealing with multiple non-overlapping blast coordinates. Choices = span, nospan, all. Default = span.

--bp_bridge <integer>

Optional: The base pair distance used to merge non-overlapping blast coordinates for -m span choice. Has no effect if -f nospan or -f all are used. Default = 100.

--threads <integer>

Optional: Specifies number of threads to use for BLASTn. Default = 1. Recommended: 4.

Example Usage:

python Cluster_Blast_Extract.py -i bin/cluster-blast/input/ -o bin/cluster-blast/output/ -b dc-megablast -m span --max_hits 300

Above command will perform BLASTn using dc-megablast (with a hit limit of 300 imposed), and use the span strategy to merge BLAST coordinates. This occurs for each unaligned locus-specific fasta file present in the bin/cluster-blast/input/ input directory. Outputs are written to bin/cluster-blast/output/.

python Cluster_Blast_Extract.py -i bin/cluster-blast/input/ -o bin/cluster-blast/output/ -b blastn -m nospan --max_hits 200 --threads 4

Above command will perform BLASTn using blastn (with a hit limit of 200 and 4 threads), and use the nospan strategy to merge BLAST coordinates. This occurs for each unaligned locus-specific fasta file present in the bin/cluster-blast/input/ input directory. Outputs are written to bin/cluster-blast/output/.

Outputs:

Several outputs are created in the output directory specified.

  • Directory /Clustering-Output-Files - For each locus, this directory contains the output files from cd-hit-est (ex., LOCUS_Out.clstr), which delimit the sequence clusters. There will also be a corresponding fasta file composed of the largest sequence cluster, labeled as LOCUS_clusterDB.fasta.

  • Directory /Blast-Output-Files - For each locus, this directory contains the BLASTn results in output format 6. The files are labeled as LOCUS_blast_results.txt.

  • Directory /Blast-Trimming-Log-Files - For each locus, this directory contains a summary file of the sequence length used for each record, including the starting length, trimmed length, and coordinates used. These files are labeled Log_File_LOCUS.txt. If any sequences failed to produce significant hits, those were discarded. For each locus, if there were any discarded sequences, they are written to a fasta file called Log_BadSeqs_LOCUS.fasta. These files can be inspected to ensure the discarded sequences are indeed non-target.

  • Directory /Filtered-Fasta-Files - For each locus, this directory contains a fasta file with the trimmed sequences that passed similarity filtering. These files should be used for the next step - sequence selection.

  • File Filtered_Fasta_File_List.txt - This file is a list of the fasta file names present in the /Filtered-Fasta-Files output directory. That is, these are the loci for which some number of sequences passed the filtering step.

BLAST algorithm choice

The required -b flag specifies the BLAST algorithm to use, which can greatly affect the filtering results. The blastn algorithm searches with a word size of 11, whereas megablast searches include a word size of 28, making blastn more appropriate for interspecies searches and megablast more appropriate for closely related or intraspecific searches. However, discontiguous megablast (dc-megablast) is better at producing non-fragmented hits for divergent sequences using similar word sizes as blastn, and as such it works well for interspecific and intraspecific searches. If the goal is to produce species level phylogenetic data sets then dc-megablast should be used, but if the focus is on population level phylogenetic data sets then megablast may be preferable. You can easily compare the effects of the different BLAST algorithms, as the coordinates used to extract sequences are readily available in the summary files produced (e.g., Log_File_LOCUS.txt).

BLAST coordinates

The optional -m flag specifies the strategy to use for merging BLAST coordinates. To explain these options, a little background is necessary. When a sequence is blasted, the sections of the query sequence that match the subject sequence (reference) are reported as start and stop coordinates. In general, many hits are produced for the query sequence and the result is many sets of coordinates. Often, these coordinates overlap and can be combined. Take for example the following set of coordinates for a given query sequence 'X':

[2, 435]
[27, 380]
[30, 500]

After BLAST searches are finished, the coordinates for each input sequence are merged. The above results for query sequence 'X' can be combined to one set of coordinates: [2, 500]. If sequence 'X' is 510 bp, this means position 1 and positions 501-510 did not match any regions of the reference sequences. As a result, positions 2-500 would be extracted from sequence 'X' and written to the filtered fasta file.

Often, merging coordinates is straightforward and results in a single continuous interval (as in the above example). However, sometimes merging coordinates can produce multiple non-overlapping coordinates, such as: [2, 70], [100, 500].

How can this happen?

The N scenario

One common reason is the sequence contains a stretch of N characters. These will never be matched. Take the following 35 bp sequence 'Y':

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Even if the non-N characters are a perfect match to a reference sequence, sequence 'Y' will only produce the following coordinates after merging: [1, 10], [20, 35]. The N characters are always going to be ignored, and any sequence of N characters will cause a gap in the the resulting coordinate intervals.

The duplication scenario

Another more problematic reason is that duplicate sequence regions are found within a single record, and the coordinates of both duplicates are returned. This can happen in mitochondrial genomes, which sometimes have undergone gene duplications. In cases like this, the genes tend to be separated by quite a long distance.

Given a mitochondrial genome, the duplicate genes may return a set of coordinates that look like this: [300, 860], [4800, 5300].

The huge gap between these coordinates is a strong signal that duplicate genes are present. Another signal is that the combined length of the coordinate intervals is roughly double the length of the reference target region.

Merging strategies

Given these different scenarios, I designed three options to handle instances of non-overlapping coordinates. These are implemented using the -m flag, and include span, nospan, and all. A visual depiction of these strategies, and their effect on the above scenarios is shown in the figure below:

SuperCrunch coords

span

When -m span is used, if the distance between non-overlapping coordinate sets is less than or equal to X base pairs, they are merged.

By default, the value of X is set to 100 bp, but this can be set to any integer using the --bp_bridge flag. In the above N scenario example, this would produce the final coordinates of [1, 35] from [1, 10], [20, 35], and the resulting sequence will contain the original stretch of N characters:

Starting sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Coordinates: 

1-35

Extracted sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

If the non-overlapping coordinate intervals are more than X base pairs apart, the coordinates interval with more base pairs is selected. If the --bp_bridge flag was set to 8, the coordinates [1, 10], [20, 35] would no longer be merged:

Starting sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Coordinates: 

20-35

Extracted sequence:

CGAAAAATGATGCTG

In the duplication scenario example, using the default 100 bp would produce [300, 860] from [300, 860], [4800, 5300]. The span option can safely handle the gene duplication scenario, and also allow full length lower-quality sequences to be extracted - as long as they contain a reasonably small stretch of N's. The span method is the default option used if the -m flag is omitted.

nospan

When -m nospan is used, no attempt is made to merge the non-overlapping coordinate intervals.

Rather, the coordinate interval containing more base pairs is selected. In the above N scenario example, this would produce the final coordinates of [20, 35] from [1, 10], [20, 35]. This is different from span using the 100bp default. In the duplication scenario example, this would produce [300, 860] from [300, 860], [4800, 5300]. This is the same result obtained using span. The nospan method guarantees stretches of N's will not be present in the extracted sequences. For downstream sequence selection based on length, this can be thought of as a way to penalize lower-quality sequences by reducing their length. It will also correctly handle the gene duplication scenario by only using one of the coordinate intervals. This method should be viewed as a more conservative implementation of span.

all

When -m all is used, the non-overlapping coordinate intervals are simply concatenated and used as is.

In the N scenario example above, this would produce [1, 10], [20, 35], which would simply remove the stretch of N's from the sequence:

Starting sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Coordinates: 

1-10, 20-35

Extracted sequence:

TCATGTTCCACGAAAAATGATGCTG

Although this seems like a desirable outcome, the same strategy will severely backfire for duplications. For the duplication scenario, this would produce [300, 860], [4800, 5300]. In other words, two homologous gene regions would be extracted and concatenated, producing a single sequence that is double the length of the reference sequences. For duplications, this is a very poor outcome because it is incorrect and will interfere with sequence alignment. However, this option can be used to detect duplications in mitogenomes. For example, when I ran this option using many reptile mitogenomes, I was able to find all examples of gene duplications in mitogenomes for the genus Heteronotia (a parthenogenic gecko) by looking at the resulting coordinate summary files. Beyond this use, I would caution against using this option unless you inspect the results very carefully. It is included mostly as a comparison, and not recommended for actual analyses.

Overall, you can easily compare the effects of the options for the -m flag, as the coordinates used to extract sequences are readily available in the log files produced (e.g., Log_File_LOCUS.txt).

Back to top


Filtering 'Complex' Sequence Records

The 'complex' records generally contain multiple gene regions (e.g., long mtDNA fragments or whole mitogenomes) and high sequence length heterogeneity. These records will contain the target region, but also plenty of non-target sequence. Similarity searches are performed using BLASTn, which requires a set of reference sequences. Because of the nature of 'complex' records, selecting references automatically from the set of starting sequences is unreliable. Instead, the best approach is to provide high quality reference sequences for the target region for the gene of interest. All starting sequences are then BLASTed to the reference sequences (excluding self-hits, if present) to identify homologous regions. If necessary, sequences are trimmed to remove any non-target sequence from the record. Likewise, any records that fail to produce significant BLAST scores are simply discarded. This is all performed using the Reference_Blast_Extract.py module. A visual depiction of this approach is shown below:

SuperCrunch fig2

The Reference_Blast_Extract.py module requires more input than the Cluster_Blast_Extract.py module, as reference sequences must be supplied by the user. However, this offers extremely fine-control over the filtering stage and outcome. For example, by supplying a reference sequence set for a particular mtDNA gene, that exact target region can be extracted from a set of records which include partial sequences for that gene, long mtDNA fragments, and entire mitogenomes. There are also several reference sequence sets available in the Data folder here. Many of these were created by extracting the annotated regions of the mitogenomes using Geneious, and a similar approach can be taken for your organism of interest.

Back to top


Reference_Blast_Extract.py

Reference_Blast_Extract.py is one of two methods for filtering sequences based on similarity. It is recommended to use for loci composed of 'complex' sequence records.

Reference_Blast_Extract.py is designed to run for two different scenarios:

  1. A directory that contains a single locus-specific fasta file and corresponding reference sequence fasta file. This option is invoked using the -e and -d flags.

  2. A directory that contains multiple locus-specific fasta files and corresponding reference sequence fasta files. This option is invoked using the --multisearch flag, and providing a map file. Details on both usages are provided below in the section called Run Options.

For each locus-specific fasta included, the following steps will be performed per file:

  • The reference sequence set designated by the user is used to create a BLAST database with makeblastdb.

  • Sequences from the locus-specific fasta file are blasted to the reference set using nucleotide BLAST.

  • For each sequence with significant matches - BLAST coordinates are obtained, coordinates are merged based on the merge strategy selected by the user, and the final coordinates are used to extract the sequence region. Any self-hits are excluded during these steps.

  • The extracted sequences are written to a new filtered fasta file. Any sequences failing the similarity filtering step are excluded and written to a separate fasta file.

Several important decisions need to be made concerning the BLASTn searches and how the BLAST results are used. This includes the choice of BLAST algorithm using the -b flag. The options include blastn, blastn-short, dc-megablast, and megablast. Recommendations are provided in the BLAST algorithm choice section below. For each sequence with a significant BLAST result, the coordinates of all BLAST hits (excluding self-hits) are merged. This action often results in a single interval, but non-overlapping coordinates can also be produced. The -m and --bp_bridge flags both control how coordinates are merged. The concept of coordinate merging and the different strategies available are discussed in greater detail below in the BLAST coordinates section.

Basic Usage:

For filtering a single locus:

python Reference_Blast_Extract.py -i <input directory> -o <output directory> -d <reference fasta name> -e <empirical fasta name> -b <blast algorithm> -m <blast coordinate strategy>

For filtering multiple loci:

python Reference_Blast_Extract.py -i <input directory> -o <output directory> --multisearch <path to map file> -b <blast algorithm> -m <blast coordinate strategy>

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory containing the reference(s) and the empirical fasta file(s).

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write output files.

-e <filename> or --emp_fasta <filename>

Required for OPTION 1: The name of an empirical fasta to blast to the database. Provide file name only, rather than full path. This file should be located in the input directory (-i), and it will identified using only the file name. DO NOT PROVIDE A FULL FILE PATH.

-d <filename> or --blast_db <filename>

Required for OPTION 1: The name of the reference fasta file that will be used to create the blast database. Provide file name only, rather than full path. This file should be located in the input directory (-i), and it will identified using only the file name. DO NOT PROVIDE A FULL FILE PATH.

--multisearch <path-to-file>

Required for OPTION 2: If multiple empirical fasta + blast database pairs are present in the -i (input) directory, provide the full path to a tab-delimited text file containing the pair information to allow all of them to run sequentially.

-b <choice> or --blast_task <choice>

Required: The BLAST algorithm to use. Choices = blastn, blastn-short, dc-megablast, megablast. Recommended for most datasets: dc-megablast.

--max_hits <integer>

Optional: The maximum number of blast matches allowed per input sequence. May want to set < 300 for large sequence sets. If omitted, no limit is set.

-m <choice> or --merge_strategy <choice>

Optional: The strategy for dealing with multiple non-overlapping blast coordinates. Choices = span, nospan, all. Default = span.

--bp_bridge <integer>

Optional: The base pair distance used to merge non-overlapping blast coordinates for -m span choice. Default = 100.

--threads <integer>

Optional: Specifies number of threads to use for BLASTn. Default = 1. Recommended: 4.

Example Usage:

python Reference_Blast_Extract.py -i bin/Ref-Blast/Input -o bin/Ref-Blast/Output -d ND2_references.fasta -e ND2.fasta -b dc-megablast -m span --max_hits 300 --threads 4

Above command will form a BLAST database from ND2_references.fasta and BLAST sequences from ND2.fasta to the database using the dc-megablast algorithm with 4 threads, the span strategy for BLAST coordinate merging, and a hit limit of 300. The outputs are written to the bin/Ref-Blast/Output directory.

Outputs:

Several outputs are created in the output directory specified.

  • Directory /Blast-Output-Files - For each locus, this directory contains the BLASTn results in output format 6. The files are labeled as LOCUS_blast_results.txt.

  • Directory /Blast-Trimming-Log-Files - For each locus, this directory contains a summary file of the sequence length used for each record, including the starting length, trimmed length, and coordinates used. These files are labeled Log_File_LOCUS.txt. If any sequences failed to produce significant hits, those were discarded. For each locus, if there were any discarded sequences, they are written to a fasta file called Log_BadSeqs_LOCUS.fasta. These files can be inspected to ensure the discarded sequences are indeed non-target.

  • Directory /Filtered-Fasta-Files - For each locus, this directory contains a fasta file with the trimmed sequences that passed similarity filtering. These files should be used for the next step - sequence selection.

Run Options

Option 1 - Single File

A directory contains a single locus-specific fasta file and corresponding reference sequence fasta file. This option is invoked using the -e and -d flags.

Let's assume you have a directory RefBlast/Inputs/ with the following files:

RefBlast
│
├── Inputs
│	├── ND2.fasta
│	└── Squamate-ND2-References.fasta

Here you would use the following command structure:

python Reference_Blast_Extract.py -i RefBlast/Inputs -o RefBlast/Outputs -d Squamate-ND2-References.fasta -e ND2.fasta -b dc-megablast -m span 

Note that the full path to the fasta files is NOT used, just their names. It is assumed they are within RefBlast/Inputs/ and an error will occur if they are not. This option can be used to execute Reference_Blast_Extract.py for a single pair of files (a locus-specific fasta file and reference fasta file).

Option 2 - Multiple Files

A directory contains multiple locus-specific fasta files and corresponding reference sequence fasta files. This option is invoked using the --multisearch flag, and providing a map file.

Let's assume you have three locus-specific fasta files with corresponding reference sequence sets. You could make a separate directory for each and run the single file option, but that is not very efficient. Instead, you can take advantage of the multisearch feature. Let's assume you have a directory RefBlast/Inputs/ containing the following files:

RefBlast
│
├── Inputs
│	├── ND2.fasta
│	├── Squamate-ND2-References.fasta
│	├── ND4.fasta
│	├── Squamate-ND4-References.fasta
│	├── 12S.fasta
│	└── Squamate-12S-References.fasta

To use the multisearch feature, you will need to create a map file. It is a two column (tab-delimited) text file that contains the name of the locus-specific fasta file (-e ) in column one and the name of the reference fasta file (-d ) in column two. Based on the above example, the contents of the file would look like this:

12S.fasta	Squamate-12S-References.fasta
ND2.fasta	Squamate-ND2-References.fasta
ND4.fasta	Squamate-ND4-References.fasta

Note that the full path to these files is not used. Similar to the single file option, these files are assumed to be in the input directory specified with the -i flag. If they are not, an error will be thrown. Let's say this file is called multisearch-key.txt and is present in the RefBlast/ directory:

RefBlast
├── multisearch-key.txt
├── Inputs
│	├── ND2.fasta
│	├── Squamate-ND2-References.fasta
│	├── ND4.fasta
│	├── Squamate-ND4-References.fasta
│	├── 12S.fasta
│	└── Squamate-12S-References.fasta

Here you would use the following command structure:

python Reference_Blast_Extract.py -i RefBlast/Inputs -o RefBlast/Outputs --multisearch RefBlast/multisearch-key.txt -b dc-megablast -m span 

This would run the Reference_Blast_Extract.py for 12s, ND2, and ND4 in succession, using the correct file pairs.

BLAST algorithm choice

The required -b flag specifies the BLAST algorithm to use, which can greatly affect the filtering results. The blastn algorithm searches with a word size of 11, whereas megablast searches include a word size of 28, making blastn more appropriate for interspecies searches and megablast more appropriate for closely related or intraspecific searches. However, discontiguous megablast (dc-megablast) is better at producing non-fragmented hits for divergent sequences using similar word sizes as blastn, and as such it works well for interspecific and intraspecific searches. If the goal is to produce species level phylogenetic data sets then dc-megablast should be used, but if the focus is on population level phylogenetic data sets then megablast may be preferable. You can easily compare the effects of the different BLAST algorithms, as the coordinates used to extract sequences are readily available in the summary files produced (e.g., Log_File_LOCUS.txt).

BLAST coordinates

The optional -m flag specifies the strategy to use for merging BLAST coordinates. To explain these options, a little background is necessary. When a sequence is blasted, the sections of the query sequence that match the subject sequence (reference) are reported as start and stop coordinates. In general, many hits are produced for the query sequence and the result is many sets of coordinates. Often, these coordinates overlap and can be combined. Take for example the following set of coordinates for a given query sequence 'X':

[2, 435]
[27, 380]
[30, 500]

After BLAST searches are finished, the coordinates for each input sequence are merged. The above results for query sequence 'X' can be combined to one set of coordinates: [2, 500]. If sequence 'X' is 510 bp, this means position 1 and positions 501-510 did not match any regions of the reference sequences. As a result, positions 2-500 would be extracted from sequence 'X' and written to the filtered fasta file.

Often, merging coordinates is straightforward and results in a single continuous interval (as in the above example). However, sometimes merging coordinates can produce multiple non-overlapping coordinates, such as: [2, 70], [100, 500].

How can this happen?

The N scenario

One common reason is the sequence contains a stretch of N characters. These will never be matched. Take the following 35 bp sequence 'Y':

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Even if the non-N characters are a perfect match to a reference sequence, sequence 'Y' will only produce the following coordinates after merging: [1, 10], [20, 35]. The N characters are always going to be ignored, and any sequence of N characters will cause a gap in the the resulting coordinate intervals.

The duplication scenario

Another more problematic reason is that duplicate sequence regions are found within a single record, and the coordinates of both duplicates are returned. This can happen in mitochondrial genomes, which sometimes have undergone gene duplications. In cases like this, the genes tend to be separated by quite a long distance.

Given a mitochondrial genome, the duplicate genes may return a set of coordinates that look like this: [300, 860], [4800, 5300].

The huge gap between these coordinates is a strong signal that duplicate genes are present. Another signal is that the combined length of the coordinate intervals is roughly double the length of the reference target region.

Merging strategies

Given these different scenarios, I designed three options to handle instances of non-overlapping coordinates. These are implemented using the -m flag, and include span, nospan, and all. A visual depiction of these strategies, and their effect on the above scenarios is shown in the figure below:

SuperCrunch coords

span

When -m span is used, if the distance between non-overlapping coordinate sets is less than or equal to X base pairs, they are merged.

By default, the value of X is set to 100 bp, but this can be set to any integer using the --bp_bridge flag. In the above N scenario example, this would produce the final coordinates of [1, 35] from [1, 10], [20, 35], and the resulting sequence will contain the original stretch of N characters:

Starting sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Coordinates: 

1-35

Extracted sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

If the non-overlapping coordinate intervals are more than X base pairs apart, the coordinates interval with more base pairs is selected. If the --bp_bridge flag was set to 8, the coordinates [1, 10], [20, 35] would no longer be merged:

Starting sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Coordinates: 

20-35

Extracted sequence:

CGAAAAATGATGCTG

In the duplication scenario example, using the default 100 bp would produce [300, 860] from [300, 860], [4800, 5300]. The span option can safely handle the gene duplication scenario, and also allow full length lower-quality sequences to be extracted - as long as they contain a reasonably small stretch of N's. The span method is the default option used if the -m flag is omitted.

nospan

When -m nospan is used, no attempt is made to merge the non-overlapping coordinate intervals.

Rather, the coordinate interval containing more base pairs is selected. In the above N scenario example, this would produce the final coordinates of [20, 35] from [1, 10], [20, 35]. This is different from span using the 100bp default. In the duplication scenario example, this would produce [300, 860] from [300, 860], [4800, 5300]. This is the same result obtained using span. The nospan method guarantees stretches of N's will not be present in the extracted sequences. For downstream sequence selection based on length, this can be thought of as a way to penalize lower-quality sequences by reducing their length. It will also correctly handle the gene duplication scenario by only using one of the coordinate intervals. This method should be viewed as a more conservative implementation of span.

all

When -m all is used, the non-overlapping coordinate intervals are simply concatenated and used as is.

In the N scenario example above, this would produce [1, 10], [20, 35], which would simply remove the stretch of N's from the sequence:

Starting sequence:

TCATGTTCCANNNNNNNNNNCGAAAAATGATGCTG

Coordinates: 

1-10, 20-35

Extracted sequence:

TCATGTTCCACGAAAAATGATGCTG

Although this seems like a desirable outcome, the same strategy will severely backfire for duplications. For the duplication scenario, this would produce [300, 860], [4800, 5300]. In other words, two homologous gene regions would be extracted and concatenated, producing a single sequence that is double the length of the reference sequences. For duplications, this is a very poor outcome because it is incorrect and will interfere with sequence alignment. However, this option can be used to detect duplications in mitogenomes. For example, when I ran this option using many reptile mitogenomes, I was able to find all examples of gene duplications in mitogenomes for the genus Heteronotia (a parthenogenic gecko) by looking at the resulting coordinate summary files. Beyond this use, I would caution against using this option unless you inspect the results very carefully. It is included mostly as a comparison, and not recommended for actual analyses.

Overall, you can easily compare the effects of the options for the -m flag, as the coordinates used to extract sequences are readily available in the log files produced (e.g., Log_File_LOCUS.txt).

Back to top


Identifying 'Contaminated' Sequences

Similarity filtering will generally allow sequences containing homologous regions of the same gene to pass through the filters. One potential drawback is that these filters do not distinguish between organisms - they are based purely on sequence distance. In some cases, there can be 'contaminated' sequences present in datasets where the gene region was sequenced for a non-target organism. A common example is human DNA contamination. Given the conserved primer regions for many mtDNA loci, human-contaminated DNA for a given organism could produce a human 12S or 16S sequence rather than the sequence of the target organism. For example, a human-contaminated frog DNA extract could produce a human 12S sequence and this could pass through similarity filtering. As a side note, these types of sequences are definitely on GenBank - we identified several while working with Iguanian lizards for the SuperCRUNCH publication! To identify and remove this source of contamination, you can use the following Contamination_Filter.py module.

Back to top


Contamination_Filter.py

The Contamination_Filter.py module is an optional step for additional filtering, which can help identify and remove 'contaminated' sequences. The usage of the Contamination_Filter.py module is very similar to that of the Reference_Blast_Extract.py module. This includes the ability to run the module for a single file pair (locus-specific fasta file and a 'contamination' reference sequence file) or for multiple file pairs.

Running this filter requires a user-supplied set of sequences that represent the source of 'contamination'. For example, amphibian mtDNA sequences can be screened against human mtDNA sequences to make sure they are actually amphibian. Any reference sequences can be used, and the context depends on what the contamination source is expected to be. This step will remove all sequences scoring greater than 95% identity for a minimum of 100 continuous base pairs to the reference ‘contamination’ sequences. Although the same BLAST algorithms are available here, given the specific use it is strongly recommended that the megablast algorithm is used (-b megablast).

In general, it is easier to create one database of the 'contamination source' that contains all the genes you will want to BLAST against. For the Iguania data set, I used a contamination reference composed of human mtDNA, with one sequence for 12S, 16S, CO1, CYTB, ND1, ND2, and ND4. This file Contamination_Seqs_human_mtDNA.fasta is available in the data folder. This file was used in conjunction with the --multisearch option to screen the all the mtDNA fasta files in succession. Details on both usages with this specific use-case are provided below in the section called Run Options.

Basic Usage:

For filtering a single locus:

python Contamination_Filter.py -i <input directory> -o <output directory> -d <contamination fasta name> -e <empirical fasta name> -b <blast algorithm>

For filtering multiple loci:

python Contamination_Filter.py -i <input directory> -o <output directory> --multisearch <path to map file> -b <blast algorithm> 

Contamination_Filter.py

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

Required: The full path to a directory containing the reference(s) and the empirical fasta file(s).

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write output files.

-e <filename> or --emp_fasta <filename>

Required for OPTION 1: The name of an empirical fasta to blast to the database. Provide file name only, rather than full path. This file should be located in the input directory (-i), and it will identified using only the file name. DO NOT PROVIDE A FULL FILE PATH.

-d <filename> or --blast_db <filename>

Required for OPTION 1: The name of the reference fasta file that will be used to create the blast database. Provide file name only, rather than full path. This file should be located in the input directory (-i), and it will identified using only the file name. DO NOT PROVIDE A FULL FILE PATH.

--multisearch <path-to-file>

Required for OPTION 2: If multiple empirical fasta + blast database pairs are present in the -i (input) directory, provide the full path to a tab-delimited text file containing the pair information to allow all of them to run sequentially.

-b <choice> or --blast_task <choice>

Required: The BLAST algorithm to use. Choices = blastn, blastn-short, dc-megablast, megablast. Recommended for most datasets: megablast.

--max_hits <integer>

Optional: The maximum number of blast matches allowed per input sequence. May want to set < 300 for large sequence sets. If omitted, no limit is set.

Example Usage:

python Contamination_Filter.py -i bin/contamfilter/ND2/Input -o bin/contamfilter/ND2/Output -d Human_ND2.fasta -e ND2.fasta -b megablast

Above command will form a BLAST database from Human_ND2.fasta and BLAST sequences from ND2.fasta to the database using the megablast algorithm. Both files are in the bin/Ref-Blast/ directory, where output files are also written.

Outputs:

Several outputs are created in the output directory specified.

  • Directory /Blast-Output-Files - For each locus, this directory contains the BLASTn results in output format 6. The files are labeled as LOCUS_blast_results.txt. An additional file will be present called LOCUS_extracted_Contaminated_Seq_Summary.txt, which contains a summary of the number of sequences passing the filter, and those failing the filter (including the sequence names/descriptions).

  • Directory /Contaminated-Sequences - If any sequences failed the contamination filter and were discarded, they will be here. For each locus, if there were any discarded sequences, they are written to a fasta file called LOCUS_extracted_contaminated.fasta. These files can be inspected to ensure the discarded sequences are indeed non-target.

  • Directory /Filtered-Fasta-Files - For each locus, this directory contains a fasta file with the trimmed sequences that passed similarity filtering. These files should be used for the next step - sequence selection.

Run Options

Here are examples of how you can use a single 'contamination' reference sequence file to run the single and multisearch options.

Option 1 - Single File

A directory contains a single locus-specific fasta file and corresponding reference sequence fasta file. This option is invoked using the -e and -d flags.

Let's assume you have a directory Contamination/Inputs/ with the following files:

Contamination
│
├── Inputs
│	├── ND2.fasta
│	└── Contamination_Seqs_human_mtDNA.fasta

Here you would use the following command structure:

python Contamination_Filter.py -i Contamination/Inputs/ -o Contamination/Outputs/ -d Squamate-ND2-References.fasta -e ND2.fasta -b dc-megablast -m span 

Note that the full path to the fasta files is NOT used, just their names. It is assumed they are within Contamination/Inputs/ and an error will occur if they are not. This option can be used to execute Contamination_Filter.py for a single pair of files (a locus-specific fasta file and reference fasta file).

Option 2 - Multiple Files

A directory contains multiple locus-specific fasta files and corresponding reference sequence fasta files. This option is invoked using the --multisearch flag, and providing a map file.

Let's assume you have three locus-specific fasta files with corresponding reference sequence sets. You could make a separate directory for each and run the single file option, but that is not very efficient. Instead, you can take advantage of the multisearch feature. Let's assume you have a directory Contamination/Inputs/ containing the following files:

Contamination
│
├── Inputs
│	├── ND2.fasta
│	├── ND4.fasta
│	├── 12S.fasta
│	└── Contamination_Seqs_human_mtDNA.fasta

To use the multisearch feature, you will need to create a map file. It is a two column (tab-delimited) text file that contains the name of the locus-specific fasta file (-e ) in column one and the name of the reference fasta file (-d ) in column two. Based on the above example, the contents of the file would look like this:

12S.fasta	Contamination_Seqs_human_mtDNA.fasta
ND2.fasta	Contamination_Seqs_human_mtDNA.fasta
ND4.fasta	Contamination_Seqs_human_mtDNA.fasta

Note that the full path to these files is not used. Similar to the single file option, these files are assumed to be in the input directory specified with the -i flag. If they are not, an error will be thrown. Let's say this file is called multisearch-key.txt and is present in the Contamination/ directory:

Contamination
├── multisearch-key.txt
├── Inputs
│	├── ND2.fasta
│	├── ND4.fasta
│	├── 12S.fasta
│	└── Contamination_Seqs_human_mtDNA.fasta

Here you would use the following command structure:

python Contamination_Filter.py -i Contamination/Inputs -o Contamination/Outputs --multisearch Contamination/multisearch-key.txt -b megablast -m span 

This would run the Contamination_Filter.py for 12s, ND2, and ND4 in succession, using the contamination sequences present in the same reference file (Contamination_Seqs_human_mtDNA.fasta).

Back to top


Last updated: September, 2019

For SuperCRUNCH v1.2

Clone this wiki locally