-
Notifications
You must be signed in to change notification settings - Fork 0
Day2 Task1
Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day2/task1
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day2/task/data
Environment: day2
A workflow in Snakemake is defined using a Snakefile, which is a text file that contains a set of rules. Each rule describes a step in the workflow, along with its input files, output files, and the commands necessary to run that step.
Here's an example Snakefile that defines a simple workflow that concatenates two input files:
rule concat:
input:
"file1.txt",
"file2.txt"
output:
"concatenated.txt"
shell:
"cat {input} > {output}"
In this example, we've defined a single rule called concat. This rule takes two input files (file1.txt and file2.txt) and produces a single output file (concatenated.txt). The command to run this rule is specified using the shell directive, which tells Snakemake to execute the specified shell command.
To run this workflow, let's follow this steps:
-
Move to the folder and activate the
day2
environmentcd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task1 conda activate day2
-
Create a folder in our
HOME
with the namemy_first_wf
cd ~/course/wdir mkdir -p my_first_wf cd my_first_wf
-
Let's create some sample text files:
echo "This is file1" > file1.txt echo "This is file2" > file2.txt
-
Save this code as a file called
Snakefile
in~/course/wdir/my_first_wf
:rule concat: input: "file1.txt", "file2.txt" output: "concatenated.txt" shell: "cat {input} > {output}"
-
Run the following command
snakemake -c 1
Snakemake will automatically determine the correct order in which to run the rules based on their input and output files, and will execute the necessary steps to produce the final output file. We should be able to get a file with the name concatenated.txt
with the contents of file1.txt
and file2.txt
.
Snakemake allows you to parameterize your workflows by defining variables that can be used to customize the behavior of your rules. For example, you might want to run your workflow with different input files or with different settings for a particular tool.
One way to do this in Snakemake is to use wildcards, which are special placeholders that can be used in your rule definitions to represent variable parts of your input or output files. You can then specify the values of these wildcards when you run your workflow.
Here's an example of a more complex Snakefile that uses wildcards
to allow for variable input files:
samples = ["sample1", "sample2", "sample3"]
rule all:
input:
expand("results/{sample}/output.txt", sample=samples)
rule process_sample:
input:
"data/{sample}.txt"
output:
"results/{sample}/output.txt"
params:
parameter1=config.get("parameter1", 5),
parameter2=config.get("parameter2", "abc")
shell:
"process_data.py {input} {params.parameter1} {params.parameter2} > {output}"
In this example, we've defined two rules: all
and process_sample
. The all
rule specifies that the final output of the workflow should be a set of output files for each sample defined in the samples list. We use the expand
function to generate a list of output file paths based on the samples list.
The process_sample
rule takes a single input file from the data
directory for each sample, and produces a corresponding output file in the results
directory. The input file name is specified using the {sample}
wildcard, which will be replaced with the actual sample name when the rule is executed.
The process_sample
rule also includes two params
that can be used to customize the behavior of the process_data.py
script that it runs. These parameters are specified using the params
directive, and can be referred to in the command using the {params.parameter1}
and {params.parameter2}
placeholders.
Snakemake also allows you to specify dependencies between rules, so that one rule will only be executed after another rule has successfully completed. This can be useful for cases where the output of one step is required as input for another step.
To specify dependencies between rules, you can use the input directive in your rule definitions. For example, here's a modified version of the process_sample rule that depends on a separate preprocess rule:
rule preprocess:
input:
"data/{sample}.txt"
output:
"preprocessed/{sample}.txt"
shell:
"preprocess_data.py {input} > {output}"
rule process_sample:
input:
"preprocessed/{sample}.txt"
output:
"results/{sample}/output.txt"
params:
parameter1=config.get("parameter1", 5),
parameter2=config.get("parameter2", "abc")
shell:
"process_data.py {input} {params.parameter1} {params.parameter2} > {output}"
In this example, the process_sample rule now depends on the preprocess rule, which produces a preprocessed input file for each sample. The input directive in the process_sample rule now refers to the output of the preprocess rule.
When you run this workflow, Snakemake will automatically determine the correct order in which to execute the rules based on their input and output files, and will only execute a rule after all of its dependencies have been successfully completed.
To run this workflow with different input files or parameter values, you can specify them on the command line using the --config
option. For example, to run the workflow with a different value for parameter1
, you could run:
cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task1
snakemake -c 1 --config parameter1=10
or use a config file:
snakemake -c 1 --configfile config.yaml
where the config.yaml
contains:
parameter1=10
- Day 0
- Day 1
- Setting up the environment
- Task 1: Writing a simple BASH script
- Task 2: Writing a More Complex BASH script
- Day 2
- Resources