Skip to content

Day2 Task3

genomewalker edited this page May 26, 2024 · 6 revisions

A workflow to map reads using Bowtie2

Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day2/task3
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/data/mapping
Environment: day2

This is a the Snakemake version of script we did in Day1-Task2 that runs Bowtie2 to map reads, sorts the resulting BAM file with Samtools, and cleans up intermediate files.

We will start with a simple workflow that maps reads to a reference genome, and then we will rewrite the same workflow using snakemake wrappers.

A simple mapping workflow

This workflow uses Snakemake to map reads with Bowtie2 and sort the resulting BAM files with Samtools. The workflow takes as input a list of sample names and the path to the Bowtie2 index file. For each sample, it maps the reads to the reference genome and generates a sorted BAM file as output.

Let's take a closer look at the workflow defined in the Snakemake file.

The first step in defining the workflow is to specify the input and output files. In this case, the input is a list of sample names and the path to the Bowtie2 index file:

samples = config.get("samples")
index = (config.get("bowtie2_index"),)

The config.get function reads values from a configuration file or dictionary. In this case, we assume that the samples and bowtie2_index values are defined in the configuration file config.yaml.

The output of the workflow is defined using the rule all statement:

rule all:
    input:
        expand("results/{sample}.sorted.bam", sample=samples),

This specifies that the output of the workflow is a set of sorted BAM files, one for each sample. The expand function is used to generate a list of output files, one for each sample name in the samples list.

The next step in defining the workflow is to specify the rules for generating the output files. In this case, we have one rule called bowtie2 that performs the read mapping and sorting.

rule bowtie2:
    input:
        idx=multiext(
            index[0],
            ".1.bt2",
            ".2.bt2",
            ".3.bt2",
            ".4.bt2",
            ".rev.1.bt2",
            ".rev.2.bt2",
        ),
        reads="data/{sample}.fq.gz",
    output:
        "results/{sample}.sorted.bam",
    conda:
        "envs/mapping.yaml"
    params:
        samtools_extra=config.get("samtools_extra", ""),
        index=index[0],
    threads: config.get("threads", 1)
    shell:
        """
        bowtie2 --threads {threads} -x {index} -U {input.reads} --no-unal\
            | samtools view -bS -@ {threads} {params.samtools_extra} \
            | samtools sort {params.samtools_extra} -@ {threads} -o {output}
        """

This rule defines the input and output files for the mapping and sorting steps, specifies any necessary parameters, and provides the command to run the steps.

The input section of the rule specifies the input files for the Bowtie2 and Samtools commands. The multiext function is used to generate the list of files that make up the Bowtie2 index.

The output section specifies the output file for the sorted BAM.

The params section specifies any additional parameters that are passed to the command. In this case, we specify the samtools_extra parameter, which allows us to pass additional command-line options to Samtools, and the index parameter, which is the path to the Bowtie2 index file.

The threads parameter specifies the number of threads to use for the mapping and sorting steps.

The conda section specifies the conda environment to use for running the rule. This ensures that the necessary dependencies are installed and available.

Finally, the shell section provides the command to run the mapping and sorting steps. This command uses the bowtie2 and samtools commands to map the reads and generate a sorted BAM file.

To run the workflow, you can use the snakemake command with the -s option to specify the path to the Snakemake file:

cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task3
snakemake -s Snakefile --profile profile --configfile config.yaml 

A simple mapping workflow using wrappers

Now we are going to rewrite the workflow, but using snakemake wrappers. This will allow us to use the same workflow on different types of data without having to change the workflow.

The worklfow is very similar to the one above, but now instead of using the shell section to run the commands, we use the wrapper section to specify the snakemake wrapper to use for each step.

rule bowtie2:
    input:
        sample=["data/{sample}.fq.gz"],
        idx=multiext(
            index[0],
            ".1.bt2",
            ".2.bt2",
            ".3.bt2",
            ".4.bt2",
            ".rev.1.bt2",
            ".rev.2.bt2",
        ),
    output:
        "results/mapped/{sample}.bam",
    log:
        "logs/map/{sample}.log",
    params:
        extra="--no-unal",  # optional parameters
    threads: config.get("threads", 1)
    wrapper:
        "v1.23.5/bio/bowtie2/align"

# Define the rule for mapping reads with Bowtie2 and sorting with Samtools
rule samtools_sort:
    input:
        "results/mapped/{sample}.bam",
    output:
        "results/mapped/{sample}.sorted.bam",
    log:
        "logs/sort/{sample}.log",
    params:
        extra=config.get("samtools_sort_extra", ""),
    threads: config.get("threads", 1)
    wrapper:
        "v1.23.5/bio/samtools/sort"

The input and output sections are the same as in the previous workflow. The params section is also the same, except that we have added the extra parameter to pass additional parameters to Bowtie2. This is a very neat way of implementing a workflow that can be used with different types of data.

We have also added the log section to specify the log files for each step. This is a good practice, as it allows us to keep track of the progress of the workflow. The log files are generated by the snakemake wrappers.

We can run the workflow using the same command as before:

cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task3
snakemake -s Snakefile-wrapper --profile profile --configfile config.yaml