- Introduction
- Install the pipeline
- Running the pipeline
- Main arguments
- Mandatory arguments
- Generic arguments
- Data integrity
- Primers removal
- QC and feature table
- Merge ASV tables
- ASV clustering
- Taxonomic assignation
- Taxonomy filtering
- Samples decontamination
- Funtional predictions
- Differential abundance
- Long reads
- Sample removing
- Statistics
- Final analysis report
- Job resources
- Other command line parameters
Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen
/ tmux
or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.
It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'
Make sure that on your system either install Nextflow as well as Docker or Singularity allowing full reproducibility
How to install samba:
git clone https://github.com/ifremer-bioinformatics/samba.git
To use samba on a computing cluster, it is necessary to provide a configuration file for your system. For some institutes, this one already exists and is referenced on nf-core/configs. If so, you can simply download your institute custom config file and simply use -c <institute_config_file>
in the samba launch command.
If your institute does not have a referenced config file, you can create it using files from other infrastructure
The most simple command for running the pipeline is as follows:
nextflow run main.nf -profile shortreadstest,<docker/singularity/conda>
This will launch the pipeline with the shortreadstest
configuration profile using either docker
, singularity
or conda
. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
When you run the above command, Nextflow automatically runs the pipeline code from your git clone - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the version of the pipeline:
cd samba
git pull
It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the samba releases page and find the latest version number (eg. v3.1.0
). Then, you can configure your local samba installation to use your desired version as follows:
cd samba
git checkout v3.1.0
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker,shortreadstest
.
If -profile
is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH
.
conda
docker
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub:
samba
Profiles are also available to configure the samba workflow and can be combined with execution profiles listed above.
shortreadstest
- A profile with a complete configuration for automated testing of short reads metabarcoding analysis
- Includes training dataset so needs no other parameters
longreadstest
- A profile with a complete configuration for automated testing of long reads metabarcoding analysis
- Includes training dataset so needs no other parameters
custom
- A profile to complete according to your dataset and experiment
Path to a XSL file containing a sheet with samples reads files paths (like the manifest file) and an other sheet with samples metadata (tsv format) (like the metadata file).
/!\ OR /!\
Path to input file with project samples metadata (tsv format). Headers of metadata file must follow the Qiime2 requirements Qiime2 metadata.
Path to input file with samples reads files paths (tsv format). Headers of manifest file must follow the Qiime2 requirements Qiime2 manifest. Please note that the input data must be in fastq.gz format.
Set to true to specify that the inputs are single-end reads. Default is paired-end reads.
Set to true to specify that the inputs are long reads (Nanopore/Pacbio) (default = false for illumina short reads).
Name of the project being analyzed.
This process is optional and checks if input datasets are correctly demultiplexed, if primers ratio is high enough, if metadata file is well-formed and creates a CSV report. Please note that the header of your input data must contain the barcode as in the follonwing example : @M00176:65:000000000-A41FR:1:2114:9875:23134 1:N:0:CAACTAGA
Data integrity checking step. Set to false to deactivate this step. (default = true)
Primer removal process. Set to false to deactivate this step. (default = true)
Percentage of primers supposed to be found in raw reads (default : 70).
If you pass --control_list
argument from MicroDecon, the primer filter threshold is disable ONLY for control samples.
Forward primer (to be used in Cutadapt cleaning step).
Reverse primer (to be used in Cutadapt cleaning step).
Cutadapt error rate allowed to match primers (default : 0.1).
Cutadapt overlaping length between primer and read (default : 18 for test dataset, must be changed for user dataset).
This process is based on Qiime2/Dada2.
The number of nucleotides to remove from the start of each forward read (default : 0 = no trimming).
The number of nucleotides to remove from the start of each reverse read (default : 0 = no trimming).
Truncate forward reads after FtruncLen bases. Reads shorter than this are discarded (default : 0 = no trimming).
Truncate reverse reads after RtruncLen bases. Reads shorter than this are discarded (default : 0 = no trimming).
Forward reads with higher than maxEE "expected errors" will be discarded (default = 2).
Reverse reads with higher than maxEE "expected errors" will be discarded (default = 2).
Truncate reads at the first instance of a quality score less than or equal to minQ (default = 2).
Chimera detection method : default = "consensus". Set to "pooled" if the samples in the sequence table are all pooled together for bimera identification.
This process is optional and based on Qiime2/feature-table function. The workflow can begin at this step if you already have Dada2 ASV tables that you want to merge to perform the analysis.
Set to true to merge DADA2 ASV tables.
Path to the directory containing the ASV tables to merge (this directory must contain only the ASV tables to merge).
Path to the directory containing the representative sequences to merge (this directory must constain only the representative sequences to merge).
This step, based on dbotu3 is optional if you do not want to cluster your ASV sequences.
ASV clustering step. Set to false to deactivate this step. (default = true)
dbotu3 Genetic criterion (default = 0.1).
dbotu3 Abundance criterion (default = 10).
dbotu3 P-value criterion (default = 0.0005).
This process is based on Qiime2/feature-classifier function.
Set to true to extract marker specific region (using the sequencing primers) from reference database (default = false).
Path to reference database (required if extract_db = true).
Path to taxonomic reference database (required if extract_db = true).
Path to preformatted QIIME2 format database (required if extract_db = false).
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments (default = 0.9).
This process is optional and based on Qiime2/taxa plugin.
Set to true to filter asv table and sequences based on taxonomic assignationSet to true to activate this step. (default = false)
List of taxa you want to exclude (comma-separated list).
List of taxa you want to include (comma-separated list).
This step is optional and based on microDecon package.
Sample decontamination step. Set to true to activate this step. (default = false)
Comma separated list of control samples (e.g : "sample1,sample4,sample7") (required if microDecon_enable = true).
Number of control sample listed (required if microDecon_enable = true).
Number of samples that are not control samples (required if microDecon_enable = true).
This step is optional and based on Qiime2/PICRUSt2.
Set to true to enable functionnal prediction step. (default = false)
According to your metadata file, list the column names corresponding to the variables to group samples for functional predictions (comma-separated list).
HSP method of your choice (default = 'mp' ). The most accurate prediction method. Faster method: 'pic'.
Max nsti value accepted. (default = 2) NSTI cut-off of 2 should eliminate junk sequences.
Step based on Qiime2/Composition ancom.
According to your metadata file, list the column names corresponding to the variables to group samples for ANCOM analysis (comma-separated list).
Analysis based on mapping with Minimap2 and Python script developed by the SeBiMER team based on the preprint Freshwater monitoring by nanopore sequencing for the taxonomic assignation.
Long reads technology. For pacbio, [map-pb] and for nanopore, [map-ont]
Path to reference database indexed with Minimap2 (required).
Path to taxonomic reference file (required).
Minimal rank level to keep a hit as assigned [5]. 1:Kingdom, 2:Phylum, 3:Class, 4:Order, 5:Family, 6:Genus, 7:Species
Set to true to enable samples removing. (default = false) This optional step allow you to remove any problematic samples
Names of samples you want to delete
Alpha diversity, Beta diversity and Descriptive comparisons statistics can be enabled or disabled. Statistics steps can also being run alone (without the above bioinformatics steps). See below.
Set to false to deactivate Alpha diversity statistics step. (default = true)
Kingdom to be displayed in barplots (default = "Bacteria").
Number of top taxa to be displayed in barplots.
According to your metadata file, list the column names corresponding to the variables to group samples for Alpha diversity (comma-separated list).
Set to false to deactivate Beta diversity statistics steps. (default = true)
According to your metadata file, list the column names corresponding to the variables of interest for Beta diversity (comma-separated list).
Hierarchical clustering method (default = 'ward.D2').
Set to false to deactivate Descriptive comparisons steps (default = true).
According to your metadata file, list the column names corresponding to the variables of interest for descriptive comparisons graphs (comma-separated list).
Statistics steps can be run without running previous bioinformatics steps. Parameters below must be set to perform statistics only steps.
Perform only statistical analysis (ASV table and newick tree required). Set to true to activate (default = false).
if stats_only is activated, set the path to your own ASV table in tsv format.
if stats_only is activated, set the path to your own phylogenetic tree in newick format.
This step is optional and create a HTML report of samba analysis.
Set to false to deactivate report creation (default = true).
Path to HTML template to use for samba report.
Path to CSS style file to be used in samba report.
Path to samba workflow logo.
Path to samba workflow steps image.
Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143
(exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped.
The output directory where the results will be published.
The temporary directory where intermediate data will be written.
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits.
Same as --email, except only send mail if the workflow is not successful.
Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
NB: Single hyphen (core Nextflow option)
Specify the path to a specific config file (this is a core NextFlow command).
NB: Single hyphen (core Nextflow option)
Note - you can use this to override pipeline defaults.
Use to set a top-limit for the default memory requirement for each process.
Should be a string in the format integer-unit. eg. --max_memory '8.GB'
Use to set a top-limit for the default time requirement for each process.
Should be a string in the format integer-unit. eg. --max_time '2.h'
Use to set a top-limit for the default CPU requirement for each process.
Should be a string in the format integer-unit. eg. --max_cpus 1
Set to receive plain-text e-mails instead of HTML formatted.
Set to disable colourful command line output and live life in monochrome.