-
Notifications
You must be signed in to change notification settings - Fork 0
Day2 Task3
Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day2/task3
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/data/mapping
Environment: day2
This is a the Snakemake version of script we did in Day1-Task2 that runs Bowtie2 to map reads, sorts the resulting BAM file with Samtools, and cleans up intermediate files.
We will start with a simple workflow that maps reads to a reference genome, and then we will rewrite the same workflow using snakemake wrappers.
This workflow uses Snakemake to map reads with Bowtie2 and sort the resulting BAM files with Samtools. The workflow takes as input a list of sample names and the path to the Bowtie2 index file. For each sample, it maps the reads to the reference genome and generates a sorted BAM file as output.
Let's take a closer look at the workflow defined in the Snakemake file.
The first step in defining the workflow is to specify the input and output files. In this case, the input is a list of sample names and the path to the Bowtie2 index file:
samples = config.get("samples")
index = (config.get("bowtie2_index"),)
The config.get
function reads values from a configuration file or dictionary. In this case, we assume that the samples and bowtie2_index values are defined in the configuration file config.yaml
.
The output of the workflow is defined using the rule all
statement:
rule all:
input:
expand("results/{sample}.sorted.bam", sample=samples),
This specifies that the output of the workflow is a set of sorted BAM files, one for each sample. The expand
function is used to generate a list of output files, one for each sample name in the samples list.
The next step in defining the workflow is to specify the rules for generating the output files. In this case, we have one rule called bowtie2
that performs the read mapping and sorting.
rule bowtie2:
input:
idx=multiext(
index[0],
".1.bt2",
".2.bt2",
".3.bt2",
".4.bt2",
".rev.1.bt2",
".rev.2.bt2",
),
reads="data/{sample}.fq.gz",
output:
"results/{sample}.sorted.bam",
conda:
"envs/mapping.yaml"
params:
samtools_extra=config.get("samtools_extra", ""),
index=index[0],
threads: config.get("threads", 1)
shell:
"""
bowtie2 --threads {threads} -x {index} -U {input.reads} --no-unal\
| samtools view -bS -@ {threads} {params.samtools_extra} \
| samtools sort {params.samtools_extra} -@ {threads} -o {output}
"""
This rule defines the input
and output
files for the mapping and sorting steps, specifies any necessary parameters, and provides the command to run the steps.
The input section of the rule specifies the input files for the Bowtie2 and Samtools commands. The multiext
function is used to generate the list of files that make up the Bowtie2 index.
The output
section specifies the output file for the sorted BAM.
The params
section specifies any additional parameters that are passed to the command. In this case, we specify the samtools_extra
parameter, which allows us to pass additional command-line options to Samtools, and the index parameter, which is the path to the Bowtie2 index file.
The threads
parameter specifies the number of threads to use for the mapping and sorting steps.
The conda
section specifies the conda environment
to use for running the rule. This ensures that the necessary dependencies are installed and available.
Finally, the shell
section provides the command to run the mapping and sorting steps. This command uses the bowtie2 and samtools commands to map the reads and generate a sorted BAM file.
To run the workflow, you can use the snakemake command with the -s option to specify the path to the Snakemake file:
cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task3
snakemake -s Snakefile --profile profile --configfile config.yaml
Now we are going to rewrite the workflow, but using snakemake wrappers. This will allow us to use the same workflow on different types of data without having to change the workflow.
The worklfow is very similar to the one above, but now instead of using the shell
section to run the commands, we use the wrapper
section to specify the snakemake wrapper to use for each step.
rule bowtie2:
input:
sample=["data/{sample}.fq.gz"],
idx=multiext(
index[0],
".1.bt2",
".2.bt2",
".3.bt2",
".4.bt2",
".rev.1.bt2",
".rev.2.bt2",
),
output:
"results/mapped/{sample}.bam",
log:
"logs/map/{sample}.log",
params:
extra="--no-unal", # optional parameters
threads: config.get("threads", 1)
wrapper:
"v1.23.5/bio/bowtie2/align"
# Define the rule for mapping reads with Bowtie2 and sorting with Samtools
rule samtools_sort:
input:
"results/mapped/{sample}.bam",
output:
"results/mapped/{sample}.sorted.bam",
log:
"logs/sort/{sample}.log",
params:
extra=config.get("samtools_sort_extra", ""),
threads: config.get("threads", 1)
wrapper:
"v1.23.5/bio/samtools/sort"
The input
and output
sections are the same as in the previous workflow. The params
section is also the same, except that we have added the extra
parameter to pass additional parameters to Bowtie2. This is a very neat way of implementing a workflow that can be used with different types of data.
We have also added the log
section to specify the log files for each step. This is a good practice, as it allows us to keep track of the progress of the workflow. The log files are generated by the snakemake wrappers.
We can run the workflow using the same command as before:
cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task3
snakemake -s Snakefile-wrapper --profile profile --configfile config.yaml
- Day 0
- Day 1
- Setting up the environment
- Task 1: Writing a simple BASH script
- Task 2: Writing a More Complex BASH script
- Day 2
- Resources