03 Settings options

Settings in the config file

There is an example config file in ./data/localblast.config

General settings:
- num_threads: defines how many cores shall be used for the blasting and other parallelized parts
- mrca: ncbi taxonomy ID; this is optional and the user can also provide several ID's (e.g. mrca = 1, 2, 3). Often phylogenies include outgroups, and someone might not be interested in updating that part of the tree. This can be avoided by defining the most recent common ancestor. It requires the ncbi taxon identifier for the group of interest, which can be obtained from here or by using a provided script:
  
  python3 ./data/spn_to_taxid.py NAME_OF_MRCA
  
  Please note that in the case that the mrca is a species name, the space between the two words needs to be replaced with an '_'
unpublished sequence settings:
- unpublished: True or False; if additional database shall be used
- unpubl_data: Path to folder with files - no files other than unpublished sequences are allowed
- unpubl_names: path to file with unique name and taxon name; needs to be in a different folder than the unpublished sequences
- perpetual: True or False; False, if the input sequences shall only be blasted against the unpublished database a single time
- blast_all: True or False; BLAST original input plus the newly added during the following Genbank search or only the newly added once
BLAST settings:
- blast_type: must be either 'Genbank' or 'own' (own is a folder with FASTA formatted sequences)
- taxid_map: only needed if option blast_type is set to own; provide path to file that contains a translation table (sequence name, species name)
- localblastdb: path to local blast database; path must end with a '/'
- e_value_thresh: This is the e-value that can be retrieved from BLAST searches and is used to limit the BLAST results to sequences that are similar to the search input sequence. It is a parameter that describes how many hits can be expected by chance from a similar-sized database during BLAST searches. Small e-value indicate a significant match. In general, shorter sequences have lower e-values, because shorter sequences have a higher probability to occur in the database by chance. For more information please refer to the ncbi BLAST resources (e.g. \url{https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs}). We used an e-value of 0.001 for all example datasets.
- hitlist_size: This specifies the amount of sequences being returned from a BLAST search. If your phylogeny does not contain a lot of nodes, it can be a low value, which will speed-up the analysis. If the sampled lineage contains many sequences with low sequence divergence it is better to increase it to be able to retrieve all similar sequences. It is not advised to have a really low hitlist size, as this will influence the number of sequences that will be added to your alignment. Low hitlist sizes might not return all best-matches but only the first 10 even though there might be more best-matches in the database \citep{shah_misunderstood-2018}. Furthermore, for example, if the hitlist size is 10, but the phylogeny which shall be updated is sparsely sampled, this might result in an updated phylogeny, that has only the parts of the phylogeny updated, that were present in the input phylogeny. Lineages that were not present might never be added, as the 10 best hits all belong to the lineages already present.
- fix_blast_result_folder: True or False; True will use same blast folder across runs - be careful - if input sequences across runs have different loci this will not update the results as it uses files which exists. Furthermore, if the blast settings above are changed all files need to be deleted.
Alignment settings:
- min_len: minimum length a new sequence can have
- max_len: maximum length of sequences that are added to alignment, must be greater than 1.
- trim_perc: value that determines how many seq need to be present before the beginning and end of alignment will be trimmed
Filter Settings:
- filtertype: This defines how to filter the number of sequences per taxon.
  - blast: All sequences belonging to a taxon will be used for a filtering blast search. A sequence already present in the phylogeny, or a randomly chosen sequence, will be used to blast against all other sequences from the locus with the same taxon name. From the sequences that pass the filtering criterium, sequences will be randomly selected as representative. The filtering criterium is that they need to be within the mean +/- standard deviation of sequence similarity in relation to the queried sequence. See below for the explanation of the similarity value. If the taxon is likely monophyletic the distances will be similar and thus all sequences will fall within the mean and standard deviation of sequence similarity. If there are a few outlier sequences only, this seems to be likely a misidentification or mis-labeling in GenBank, outlier sequences will not be added, as they are most likely outside the allowed range of mean +/- SD. If the taxon is likely not monophyletic and sequences diverge a lot from each other, the mean and SD will be larger and allows to randomly pick sequences, that represent the divergence. As value for sequence similarity, we use bit-scores. Bit-scores are log-scaled scores and a score is a numerical value that describes the overall quality of an alignment (thus from the blasted sequence against the other available sequences). Higher numbers correspond to higher similarity. While scores are depending on database size, the rescaled bit-scores do not. Check out https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html for more detail. From the sequences that path the sequence similarity, random sequences are chosen.
  - length: Instead of randomly choosing between sequences that are within the criteria of the blast search using sequence divergence as a criteria, here the longest sequences will be selected.
- threshold: This defines the maximum number of sequences per taxon (e.g. species) to be retrieved. If your input dataset already contains more sequences, there will be no additional sequences added, but also not removed.
- downtorank: This defines the rank which is used to determine the maximum number of sequences per taxon. It can be set to None and then for all taxa, there will be the maximum number of threshold sequences retrieved. If it is set to species, there will no more than the maximum number of sequences randomly choosen from all sequences available for all the subspecies. It can be set to any ranks defined in the ncbi taxonomy browser.
- different_level: True or False. Makes hierarchical adding of sequences possible - see Asteroideae example.
- identical_seqs: True or False; set to True if identical sequences shall be kept
- preferred_taxa: True or False; set to True if you use the wrapper run_multiple() and want to match taxa across loci
  - allow_parent: True or False, set to True if preferred_taxa is set to True, but instead of mapping exact ncbi taxa, you want to allow for parent taxa as well - experimental and memory heavy
  - preferred_taxa_fn: Path to file where to store the ids of the preferred taxa.
Tree Calculation:
- update_tree: True or False. Set to true if you want to calculate an updated phylogeny from within the program. Often not advisable as HighPerformanceClusters are often set up in different ways and the tool might not make the use of the power of the cluster.
- backbone: True or False. Set to true if you only want to add new sequences to an existing tree, without recalculating the full tree.
- modeltest_criteria: must be BIC, AIC or AICc; defined which information criterion is used to select substitution model for the alignment
Internal settings:
- nodes_fn: path to nodes file from ncbi, is usually downloaded automatically and placed in the provided path
- names_fn: path to names file from ncbi, is usually downloaded automatically and placed in the provided path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03 Settings options

Settings in the config file

Clone this wiki locally