-
Notifications
You must be signed in to change notification settings - Fork 0
Day2 Task2
Source: https://github.com/GeoGenetics/data-analysis-2024/reproducible-data-analysis/tree/main/day2/task2
Data: The workflow downloads the necessary data
Environment: day2
This Snakemake workflow is designed to process a CSV file containing population data for cities around the world, and create a histogram of population sizes for each country. The workflow is broken down into several rules, each of which performs a specific task, such as downloading the data, selecting data for a specific country, plotting a histogram, and converting the plot to a PDF format.
To begin, the download_data
rule downloads the worldcitiespop.csv
file containing the population data for cities around the world. The downloaded file is then used as input data for the rest of the pipeline.
Next, the select_by_country
rule is used to select the population data for each country. The rule creates a separate CSV file for each country by filtering the data based on the country name. For example, the select_by_country rule might create a file called fr.csv
that contains population data for cities in France.
After the select_by_country
rule has created a CSV file for each country, the plot_histogram rule is used to create a histogram of the population data for each country. The plot_histogram
rule takes each CSV file created by the select_by_country
rule as input and creates a histogram of the population data using a Python script.
Finally, the convert_to_pdf
rule converts the resulting SVG files to PDF format. The plot_histogram
rule creates an SVG file for each country, and the convert_to_pdf
rule takes each SVG file as input and creates a PDF file with the same name.
The repository for this task contains the following files:
├── config.yaml
├── envs
│ ├── matplotlib.yaml
│ └── xsv.yaml
├── profile
│ └── config.yaml
├── scripts
│ └── plot-hist.py
└── Snakefile
The config.yaml
file is used to specify the configuration parameters for the pipeline, such as the list of countries to process. The envs directory contains environment files that specify the software dependencies required for each rule.
The rule all section at the beginning of the Snakemake file specifies that the pipeline should create a histogram for each country listed in the config.yaml file. When the pipeline is run, Snakemake will execute each rule in the correct order to create the final output files.
The config.yaml
file specifies the configuration parameters for the pipeline, such as the list of countries to process. The file contains a single key-value pair, where the key is countries and the value is a list of country names. Here's an example config.yaml
file:
countries:
- fr
- at
- us
The envs directory contains environmenxt files that specify the software dependencies required for each rule. For example, the xsv.yaml
file contains the dependencies required for the select_by_country
rule:
channels:
- conda-forge
dependencies:
- xsv=0.13
And the matplotlib.yaml file contains the dependencies required for the plot_histogram rule:
channels:
- conda-forge
dependencies:
- python=3.9
- matplotlib
- pandas
Snakemake will automatically detect the software dependencies required for each rule, and will create an environment file for each rule that contains the software dependencies required for that rule. We have defined how this will happen in the config.yaml file in the profile folder:
default-resources:
- mem_mb=2000
- time=480
jobs: 100
latency-wait: 60
cores: 1
restart-times: 1
max-jobs-per-second: 20
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: ilp
use-conda: True
conda-frontend: mamba
use-singularity: True
A brief description of each of these options is provided below:
- default-resources: This is a list of default resources to be used by Snakemake. In this case, the resources are mem_mb (memory in megabytes) and time (time in minutes) for each job. This means that each rule will be allocated a default memory of 2000 MB and a default time limit of 480 minutes.
- jobs: This specifies the maximum number of jobs that can be executed in parallel.
- latency-wait: Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 5).
- cores: Use at most N CPU cores/jobs in parallel
- restart-times: This specifies the number of times that Snakemake will try to restart a job if it fails.
- max-jobs-per-second: This specifies the maximum number of jobs that can be executed per second.
- keep-going: If set to True, Snakemake will continue executing the workflow even if some jobs fail.
- rerun-incomplete: If set to True, Snakemake will attempt to re-run incomplete jobs from previous runs.
- scheduler: This specifies the scheduler to be used by Snakemake. In this case, the ilp scheduler aims to reduce runtime and hdd usage by best possible use of resources.
- use-conda: If set to True, Snakemake will use Conda to manage software dependencies.
- conda-frontend: This specifies the Conda package manager to be used by Snakemake. In this case, mamba is used.
- use-singularity: If set to True, Snakemake will use Singularity to execute jobs in containers.
The rule all
section at the beginning of the Snakemake file defines the final output files that should be generated by the workflow. The input
directive specifies the files to be created, and the expand()
function is used to create a template for each file name. In this case, the expand()
function generates a list of file names for each country in the config.yaml
file. The {country}
wildcard is replaced with each country name in the list, resulting in a list of output file names.
rule all:
input:
expand(
"plots/{country}.hist.pdf",
country=config.get("countries")
)
When Snakemake runs, it checks whether all of the files specified in the input directive exist or can be created by the rules in the workflow. If any of the files are missing or out of date, Snakemake will execute the necessary rules to generate them.
The rule download_data section in the Snakemake file specifies how to download the data required for the workflow. This rule has no input files, and its only output is the worldcitiespop.csv
file, which will be downloaded using the curl command.
rule download_data:
output:
"data/worldcitiespop.csv"
shell:
"curl -L https://burntsushi.net/stuff/worldcitiespop.csv > {output}"
The output directive specifies the file name and location of the output file that will be created by this rule. In this case, theworldcitiespop.csv
file will be saved in the data directory.
The shell directive specifies the command that will be executed to download the file. The {output}
placeholder will be replaced with the path to the output file specified in the output
directive.
When this rule is executed by Snakemake, the curl command will download the file from the specified URL and save it to the output file path.
The rule select_by_country
section in the Snakemake file specifies how to select the data for a specific country from the worldcitiespop.csv
file. The selected data will be saved to a new file in the by-country
directory.
rule select_by_country:
input:
"data/worldcitiespop.csv"
output:
"by-country/{country}.csv"
conda:
"envs/xsv.yaml"
shell:
"xsv search -s Country '{wildcards.country}' "
"{input} > {output}"
The input
directive specifies the file name and location of the input file that will be used by this rule. In this case, it is the worldcitiespop.csv
file located in the data
directory.
The output
directive specifies the file name and location of the output file that will be created by this rule. The {country}
placeholder will be replaced with the name of the country specified in the config.yaml
file. In this case, the output file will be saved in the by-country
directory.
The conda
directive specifies the name of the environment that will be used to execute the rule. In this case, it is the xsv.yaml
environment.
The shell
directive specifies the command that will be executed to select the data for the specified country. The {wildcards.country}
placeholder will be replaced with the name of the country specified in the config.yaml
file. The {input}
placeholder will be replaced with the path to the input file specified in the input
directive. The selected data will be saved to the output file path.
When this rule is executed by Snakemake, the xsv search
command will select the data for the specified country from the input file and save it to the output file.
The last two rules in the Snakemake workflow are plot_histogram
and convert_to_pdf
.
plot_histogram
creates a histogram for each country's population using the CSV file produced by select_by_country
as input. The resulting SVG files are stored in the plots directory.
rule plot_histogram:
input:
"by-country/{country}.csv"
output:
"plots/{country}.hist.svg"
conda:
"envs/matplotlib.yaml"
script:
"scripts/plot-hist.py"
The rule defines an input file, which is a CSV file for a specific country generated by select_by_country
. The output file is an SVG file that will be saved in the plots directory with the name of the corresponding country and the .hist.svg
extension.
The conda
directive specifies the environment required to run the script plot-hist.py
, which is defined in the envs/matplotlib.yaml
file.
plot-hist.py
is a Python script that takes the input CSV file, extracts the population data, and creates a histogram using the Matplotlib library. The resulting SVG file is saved in the plots directory. The script is defined in the scripts/plot-hist.py
file and the code is shown below.
import matplotlib.pyplot as plt
import pandas as pd
cities = pd.read_csv(snakemake.input[0])
plt.hist(cities["Population"], bins=50)
plt.savefig(snakemake.output[0])
convert_to_pdf
takes the SVG files produced by plot_histogram
and converts them to PDF format.
rule convert_to_pdf:
input:
"{prefix}.svg"
output:
"{prefix}.pdf"
wrapper:
"0.47.0/utils/cairosvg"
The input files are SVG files produced by the plot_histogram
rule. The output files are PDF files with the same names as the SVG files. The wrapper directive specifies the tool used to convert the SVG files to PDF format, which is CairoSVG version 0.47.0.
When executed, Snakemake will first run the download_data
rule to download the required CSV file. Then, it will execute the select_by_country
rule for each country specified in the config.yaml
file, creating a CSV file for each country in the by-country
directory. Next, it will execute the plot_histogram
rule for each of these CSV files, creating a histogram for each country in the plots directory. Finally, it will execute the convert_to_pdf
rule for each SVG file, converting them to PDF format in the same directory.
At the end of the workflow, the output will be a set of PDF files representing histograms of population data for each country specified in the config.yaml
file.
You can run it as follows:
cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day2/task2
snakemake --configfile config.yaml --profile profile
- Day 0
- Day 1
- Setting up the environment
- Task 1: Writing a simple BASH script
- Task 2: Writing a More Complex BASH script
- Day 2
- Resources