Skip to content

This repo contains a nextflow pipeline that reproduces the main results of the Phylo-IMD paper

License

Notifications You must be signed in to change notification settings

l-mansouri/Phylo-IMD

Repository files navigation

MULTISTRAP

Boosting phylogenetic boostrap with structural information

Nextflow run with docker run with singularity Cite with Zenodo

Multistrap is a toolkit designed to calculate and combine phylogenetic bootstrap support values. It generates these support values using both sequence and structural data, and then combines them.

For more details, see the associated manuscript: "Boosting phylogenetic bootstrap with structural information".

Installation

Requirements

curl -s https://get.nextflow.io | bash
chmod +x nextflow
sudo mv nextflow /usr/local/bin

! Remember to start Docker before starting the pipeline.

Multistrap was tested on Scientific Linux release 7.2.

Get Multistrap

Multistrap is distributed as a Nextflow pipeline. To obtain the source code:

curl -L -o main.zip https://github.com/l-mansouri/Phylo-IMD/archive/refs/heads/main.zip 
unzip main.zip
cd Phylo-IMD-main

or alternatively you can use wget:

wget https://github.com/l-mansouri/Phylo-IMD/archive/refs/heads/main.zip 
unzip main.zip
cd Phylo-IMD-main

On a normal Desktop computer this step should take seconds. Now you are ready to run Multistrap!

Run Multistrap

Multistrap per default will:

  • compute the mTMalign MSA
  • compute the sequence based tree and corresponding bootstrap replicates (ME or ML tree)
  • compute the IMD tree and corresponding bootstrap replicates
  • return:
    • the ME (or ML) tree with:
      • the combined (multistrap) bootstrap support values
      • the sequence based support values
      • the IMD support values

Please refer to the output section for a precise description of the output file naming.

On a test dataset

nextflow run main.nf -profile multistrap,test,docker

If you want to use singularity:

nextflow run main.nf -profile multistrap,test,singularity
More

This will use the test data to run multistrap. We use --seq_tree ME as ML takes longer and this is meant to be just a basic test. replicatesNum is also set to 10, to speed up the run. In a normal Desktop computer this should take few minutes to complete.

On your dataset

To obtain the combined bootstrap support values in your own dataset please use the multistrap profile as shown in the following lines. To see how to properly prepare the input files, look into the example dataset in the data.

The command line:

nextflow run main.nf -profile multistrap -fasta <id.fasta> -templates <id.template> -pdbs mypdbs/* -seq_tree <ML|ME>
  • fasta is a fasta file with the sequences you want to build the tree on.
  • pdbs is all the pdbs associated to the sequences present in your fasta file.
  • templates is a file with the explicit mapping of each sequence in your fasta file and each pdb you are providing. The template files should follow the corresponding syntax (mTM-align or 3D-Coffee correspondingly). You can find examples for both in the data folder.
Output files
  • results/dataset_id
    • msas/*.fa: alignment files.
    • trees_and_replicates/: trees computed using your preferred sequence method (ME or ML) (trees/<ME|ML> folder) and the IMD trees (trees/IMD folder). Tree replicates are found in the replicates folder within the ME|ML|IMD folders respectively.
    • tree_supports/ the Bootstrap support values are stored as node labels in the trees found in tree_supports folder. Here you will find one folder with:
      • the trees with the <ME|ML> topology and the <ME|ML> support values (ID_ME|ML_tree_ME|ML_bs.nwk)
      • the IMD support values (ID_ME|ML_tree_IMD_bs.nwk)
      • the multistrap support values (ID_ME|ML_tree_multistrap_bs.nwk).

Pipelines parameters

You can modify the default pipeline parameters by using:

Parameters
  • Input parameters
    • fasta is a fasta file with the sequences you want to build the tree on.
    • pdbs is all the pdbs associated to the sequences present in your fasta file.
    • templates is a file with the explicit mapping of each sequence in your fasta file and each pdb you are providing. The template files should follow the corresponding syntax (mTM-align or 3D-Coffee correspondingly). You can find examples for both in the data folder.
  • Parameters for tree computation:
    • seq_tree determines the type of sequence based tree to be computed: either ME or ML. Default: ML.
    • gammaRate that determines the gamma rate for FastME tree reconstruction. Default: 1.0.
    • seedValue that is the random seed for FastME tree reconstruction. Default: 5.
    • replicatesNum that determines the number of bootstrap replicates. Default: 100.
    • tree_mode that determines the distance mode to run the IMD distance matrix computation. Default: 10.
  • Output parameter:
    • output that determines where to store the outputs that the pipeline publishes. Default: ./results.

Overview of the repository

For a more detailed overiview of the content of the repository please refer to overview

Analysis

In the paper we perform an extensive benchmark and produce accessory analyses to assess the robustness and validity of Multistrap. For more information on how to reproduce this please refer to analysis

About

This repo contains a nextflow pipeline that reproduces the main results of the Phylo-IMD paper

Resources

License

Stars

Watchers

Forks

Packages

No packages published