Day2 Task2

Writing a More Complex Workflow

Source: https://github.com/GeoGenetics/data-analysis-2024/reproducible-data-analysis/tree/main/day2/task2
Data: The workflow downloads the necessary data
Environment: snakemake

This Snakemake workflow is designed to process a CSV file containing population data for cities around the world, and create a histogram of population sizes for each country. The workflow is broken down into several rules, each of which performs a specific task, such as downloading the data, selecting data for a specific country, plotting a histogram, and converting the plot to a PDF format.

To begin, the download_data rule downloads the worldcitiespop.csv file containing the population data for cities around the world. The downloaded file is then used as input data for the rest of the pipeline.

Next, the select_by_country rule is used to select the population data for each country. The rule creates a separate CSV file for each country by filtering the data based on the country name. For example, the select_by_country rule might create a file called fr.csv that contains population data for cities in France.

After the select_by_country rule has created a CSV file for each country, the plot_histogram rule is used to create a histogram of the population data for each country. The plot_histogram rule takes each CSV file created by the select_by_country rule as input and creates a histogram of the population data using a Python script.

Finally, the convert_to_pdf rule converts the resulting SVG files to PDF format. The plot_histogram rule creates an SVG file for each country, and the convert_to_pdf rule takes each SVG file as input and creates a PDF file with the same name.

The repository for this task contains the following files:

├── config.yaml
├── envs
│   ├── matplotlib.yaml
│   └── xsv.yaml
├── profile
│   └── config.yaml
├── scripts
│   └── plot-hist.py
└── Snakefile

The config.yaml file is used to specify the configuration parameters for the pipeline, such as the list of countries to process. The envs directory contains environment files that specify the software dependencies required for each rule.

The rule all section at the beginning of the Snakemake file specifies that the pipeline should create a histogram for each country listed in the config.yaml file. When the pipeline is run, Snakemake will execute each rule in the correct order to create the final output files.

Generating the Config and Environment Files

The config.yaml file specifies the configuration parameters for the pipeline, such as the list of countries to process. The file contains a single key-value pair, where the key is countries and the value is a list of country names. Here's an example config.yaml file:

countries:
  - fr
  - at
  - us

The envs directory contains environmenxt files that specify the software dependencies required for each rule. For example, the xsv.yaml file contains the dependencies required for the select_by_country rule:

channels:
  - conda-forge
dependencies:
  - xsv=0.13

And the matplotlib.yaml file contains the dependencies required for the plot_histogram rule:

channels:
  - conda-forge
dependencies:
  - python=3.9
  - matplotlib
  - pandas

Snakemake will automatically detect the software dependencies required for each rule, and will create an environment file for each rule that contains the software dependencies required for that rule. We have defined how this will happen in the config.yaml file in the profile folder:

default-resources:
  - mem_mb=2000
  - time=480
jobs: 100
latency-wait: 60
cores: 1
restart-times: 1
max-jobs-per-second: 20
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: ilp
use-conda: True
conda-frontend: mamba
use-singularity: True

A brief description of each of these options is provided below:

default-resources: This is a list of default resources to be used by Snakemake. In this case, the resources are mem_mb (memory in megabytes) and time (time in minutes) for each job. This means that each rule will be allocated a default memory of 2000 MB and a default time limit of 480 minutes.
jobs: This specifies the maximum number of jobs that can be executed in parallel.
latency-wait: Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 5).
cores: Use at most N CPU cores/jobs in parallel
restart-times: This specifies the number of times that Snakemake will try to restart a job if it fails.
max-jobs-per-second: This specifies the maximum number of jobs that can be executed per second.
keep-going: If set to True, Snakemake will continue executing the workflow even if some jobs fail.
rerun-incomplete: If set to True, Snakemake will attempt to re-run incomplete jobs from previous runs.
scheduler: This specifies the scheduler to be used by Snakemake. In this case, the ilp scheduler aims to reduce runtime and hdd usage by best possible use of resources.
use-conda: If set to True, Snakemake will use Conda to manage software dependencies.
conda-frontend: This specifies the Conda package manager to be used by Snakemake. In this case, mamba is used.
use-singularity: If set to True, Snakemake will use Singularity to execute jobs in containers.

The Rule All

The rule all section at the beginning of the Snakemake file defines the final output files that should be generated by the workflow. The input directive specifies the files to be created, and the expand() function is used to create a template for each file name. In this case, the expand() function generates a list of file names for each country in the config.yaml file. The {country} wildcard is replaced with each country name in the list, resulting in a list of output file names.

rule all:
    input:
        expand(
            "plots/{country}.hist.pdf",
            country=config.get("countries")
        )

When Snakemake runs, it checks whether all of the files specified in the input directive exist or can be created by the rules in the workflow. If any of the files are missing or out of date, Snakemake will execute the necessary rules to generate them.

The Rule Download_Data

The rule download_data section in the Snakemake file specifies how to download the data required for the workflow. This rule has no input files, and its only output is the worldcitiespop.csv file, which will be downloaded using the curl command.

rule download_data:
    output:
        "data/worldcitiespop.csv"
    shell:
        "curl -L https://burntsushi.net/stuff/worldcitiespop.csv > {output}"

The output directive specifies the file name and location of the output file that will be created by this rule. In this case, theworldcitiespop.csvfile will be saved in the data directory.

The shell directive specifies the command that will be executed to download the file. The {output} placeholder will be replaced with the path to the output file specified in the output directive.

When this rule is executed by Snakemake, the curl command will download the file from the specified URL and save it to the output file path.

The Rule select_by_country

The rule select_by_country section in the Snakemake file specifies how to select the data for a specific country from the worldcitiespop.csv file. The selected data will be saved to a new file in the by-country directory.

rule select_by_country:
    input:
        "data/worldcitiespop.csv"
    output:
        "by-country/{country}.csv"
    conda:
        "envs/xsv.yaml"
    shell:
        "xsv search -s Country '{wildcards.country}' "
        "{input} > {output}"

The input directive specifies the file name and location of the input file that will be used by this rule. In this case, it is the worldcitiespop.csv file located in the data directory.

The output directive specifies the file name and location of the output file that will be created by this rule. The {country} placeholder will be replaced with the name of the country specified in the config.yaml file. In this case, the output file will be saved in the by-country directory.

The conda directive specifies the name of the environment that will be used to execute the rule. In this case, it is the xsv.yaml environment.

The shell directive specifies the command that will be executed to select the data for the specified country. The {wildcards.country} placeholder will be replaced with the name of the country specified in the config.yaml file. The {input} placeholder will be replaced with the path to the input file specified in the input directive. The selected data will be saved to the output file path.

When this rule is executed by Snakemake, the xsv search command will select the data for the specified country from the input file and save it to the output file.

Rules plot_histogram and rule convert_to_pdf

The last two rules in the Snakemake workflow are plot_histogram and convert_to_pdf.

plot_histogram creates a histogram for each country's population using the CSV file produced by select_by_country as input. The resulting SVG files are stored in the plots directory.

rule plot_histogram:
    input:
        "by-country/{country}.csv"
    output:
        "plots/{country}.hist.svg"
    conda:
        "envs/matplotlib.yaml"
    script:
        "scripts/plot-hist.py"

The rule defines an input file, which is a CSV file for a specific country generated by select_by_country. The output file is an SVG file that will be saved in the plots directory with the name of the corresponding country and the .hist.svg extension.

The conda directive specifies the environment required to run the script plot-hist.py, which is defined in the envs/matplotlib.yaml file.

plot-hist.py is a Python script that takes the input CSV file, extracts the population data, and creates a histogram using the Matplotlib library. The resulting SVG file is saved in the plots directory. The script is defined in the scripts/plot-hist.py file and the code is shown below.

import matplotlib.pyplot as plt
import pandas as pd

cities = pd.read_csv(snakemake.input[0])
plt.hist(cities["Population"], bins=50)
plt.savefig(snakemake.output[0])

convert_to_pdf takes the SVG files produced by plot_histogram and converts them to PDF format.

rule convert_to_pdf:
    input:
        "{prefix}.svg"
    output:
        "{prefix}.pdf"
    wrapper:
        "0.47.0/utils/cairosvg"

The input files are SVG files produced by the plot_histogram rule. The output files are PDF files with the same names as the SVG files. The wrapper directive specifies the tool used to convert the SVG files to PDF format, which is CairoSVG version 0.47.0.

When executed, Snakemake will first run the download_data rule to download the required CSV file. Then, it will execute the select_by_country rule for each country specified in the config.yaml file, creating a CSV file for each country in the by-country directory. Next, it will execute the plot_histogram rule for each of these CSV files, creating a histogram for each country in the plots directory. Finally, it will execute the convert_to_pdf rule for each SVG file, converting them to PDF format in the same directory.

At the end of the workflow, the output will be a set of PDF files representing histograms of population data for each country specified in the config.yaml file.

You can run it as follows:

snakemake --configfile config.yaml --profile profile

Day 0
Day 1
- Setting up the environment
- Task 1: Writing a simple BASH script
- Task 2: Writing a More Complex BASH script
Day 2
- Task 1: Writing a simple workflow
- Task 2: Writing a More Complex Workflow
- Task3: Mapping reads using Bowtie2
Resources
- Reproducible Research
- Setting up a Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly