Skip to content

Day2 Task1

genomewalker edited this page May 26, 2024 · 5 revisions

Writing a Simple Snakemake Workflow.

Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day2/task1
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day2/task/data
Environment: day2

A workflow in Snakemake is defined using a Snakefile, which is a text file that contains a set of rules. Each rule describes a step in the workflow, along with its input files, output files, and the commands necessary to run that step.

Here's an example Snakefile that defines a simple workflow that concatenates two input files:

rule concat:
    input:
        "file1.txt",
        "file2.txt"
    output:
        "concatenated.txt"
    shell:
        "cat {input} > {output}"

In this example, we've defined a single rule called concat. This rule takes two input files (file1.txt and file2.txt) and produces a single output file (concatenated.txt). The command to run this rule is specified using the shell directive, which tells Snakemake to execute the specified shell command.

To run this workflow, let's follow this steps:

  1. Move to the folder and activate the day2 environment

    cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task1
    conda activate day2
  2. Create a folder in our HOME with the name my_first_wf

    cd ~/course/wdir
    mkdir -p my_first_wf
    cd my_first_wf
  3. Let's create some sample text files:

    echo "This is file1" > file1.txt
    echo "This is file2" > file2.txt
  4. Save this code as a file called Snakefile in ~/course/wdir/my_first_wf:

    rule concat:
     input:
         "file1.txt",
         "file2.txt"
     output:
         "concatenated.txt"
     shell:
         "cat {input} > {output}"
  5. Run the following command

    snakemake -c 1

Snakemake will automatically determine the correct order in which to run the rules based on their input and output files, and will execute the necessary steps to produce the final output file. We should be able to get a file with the name concatenated.txt with the contents of file1.txt and file2.txt.

Adding Parameters and Wildcards

Snakemake allows you to parameterize your workflows by defining variables that can be used to customize the behavior of your rules. For example, you might want to run your workflow with different input files or with different settings for a particular tool.

One way to do this in Snakemake is to use wildcards, which are special placeholders that can be used in your rule definitions to represent variable parts of your input or output files. You can then specify the values of these wildcards when you run your workflow.

Here's an example of a more complex Snakefile that uses wildcards to allow for variable input files:

samples = ["sample1", "sample2", "sample3"]

rule all:
    input:
        expand("results/{sample}/output.txt", sample=samples)

rule process_sample:
    input:
        "data/{sample}.txt"
    output:
        "results/{sample}/output.txt"
    params:
        parameter1=config.get("parameter1", 5),
        parameter2=config.get("parameter2", "abc")
    shell:
        "process_data.py {input} {params.parameter1} {params.parameter2} > {output}"

In this example, we've defined two rules: all and process_sample. The all rule specifies that the final output of the workflow should be a set of output files for each sample defined in the samples list. We use the expand function to generate a list of output file paths based on the samples list.

The process_sample rule takes a single input file from the data directory for each sample, and produces a corresponding output file in the results directory. The input file name is specified using the {sample} wildcard, which will be replaced with the actual sample name when the rule is executed.

The process_sample rule also includes two params that can be used to customize the behavior of the process_data.py script that it runs. These parameters are specified using the params directive, and can be referred to in the command using the {params.parameter1} and {params.parameter2} placeholders.

Handling Dependencies

Snakemake also allows you to specify dependencies between rules, so that one rule will only be executed after another rule has successfully completed. This can be useful for cases where the output of one step is required as input for another step.

To specify dependencies between rules, you can use the input directive in your rule definitions. For example, here's a modified version of the process_sample rule that depends on a separate preprocess rule:

rule preprocess:
    input:
        "data/{sample}.txt"
    output:
        "preprocessed/{sample}.txt"
    shell:
        "preprocess_data.py {input} > {output}"

rule process_sample:
    input:
        "preprocessed/{sample}.txt"
    output:
        "results/{sample}/output.txt"
    params:
        parameter1=config.get("parameter1", 5),
        parameter2=config.get("parameter2", "abc")
    shell:
        "process_data.py {input} {params.parameter1} {params.parameter2} > {output}"

In this example, the process_sample rule now depends on the preprocess rule, which produces a preprocessed input file for each sample. The input directive in the process_sample rule now refers to the output of the preprocess rule.

When you run this workflow, Snakemake will automatically determine the correct order in which to execute the rules based on their input and output files, and will only execute a rule after all of its dependencies have been successfully completed.

To run this workflow with different input files or parameter values, you can specify them on the command line using the --config option. For example, to run the workflow with a different value for parameter1, you could run:

cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task1
snakemake -c 1 --config parameter1=10

or use a config file:

snakemake -c 1 --configfile config.yaml

where the config.yaml contains:

parameter1=10