Skip to content

PacificBiosciences/HiFi-human-assembly-WDL

Repository files navigation

PacBio Human Assembly Pipeline

Workflow for running de novo assembly using human PacBio whole genome sequencing (WGS) data. Written using Workflow Description Language (WDL).

  • Docker images used by these workflows are defined here.
  • Common tasks that may be reused within or between workflows are defined here.

Workflow

Workflow entrypoint: workflows/main.wdl

The assembly workflow performs de novo assembly on samples and trios.

De novo assembly workflow diagram

Setup

Clone a tagged version of the git repository. Use the --branch flag to pull the desired version, and the --recursive flag to pull code from any submodules.

git clone \
  --depth 1 --branch v1.0.2 \  # for reproducibility
  --recursive \                # to clone submodule
  https://github.com/PacificBiosciences/HiFi-human-assembly-WDL.git

Resource requirements

The workflow requires at minimum 48 cores and 288 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.

Reference datasets and associated workflow files

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Running the workflow

  1. Select a backend environment
  2. Configure a workflow execution engine in the chosen environment
  3. Fill out the inputs JSON file for your cohort
  4. Run the workflow

Selecting a backend

The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.

For backend-specific configuration, see the relevant documentation:

Configuring a workflow engine and container runtime

An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl and Cromwell.

Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.

See backend-specific documentation for details on setting up an engine.

Engine Azure AWS GCP HPC
miniwdl Unsupported Supported via the Amazon Genomics CLI Unsupported (SLURM only) Supported via the miniwdl-slurm plugin
Cromwell Supported via Cromwell on Azure Supported via the Amazon Genomics CLI Supported via Google's Pipelines API Supported - Configuration varies depending on HPC infrastructure

Filling out the inputs JSON

The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:

Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).

If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix> in the input template file with the local path to the reference datasets on your HPC.

Running the workflow

Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).

Note that the calls to miniwdl and Cromwell assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).

Run directly using miniwdl

miniwdl run workflows/main.wdl -i <input_file_path.json>

Run directly using Cromwell

java -jar <cromwell_jar_path> run workflows/main.wdl -i <input_file_path.json>

If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:

# The base URL (and port, if applicable) of your Cromwell server
CROMWELL_URL=
# The path to your inputs JSON file
INPUTS_JSON=

(cd workflows && zip -r dependencies.zip assembly_structs.wdl assemble_genome/ de_novo_assembly_sample/ de_novo_assembly_trio/ wdl-common/)
curl -X "POST" \
  "${CROMWELL_URL}/api/workflows/v1" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "workflowSource=@workflows/main.wdl" \
  -F "workflowInputs=@${INPUTS_JSON};type=application/json" \
  -F "workflowDependencies=@workflows/dependencies.zip;type=application/zip"

To specify workflow options, add the following to the request (assuming your options file is a file called options.json located in the pwd): -F "workflowOptions=@options.json;type=application/json".

Workflow inputs

This section describes the inputs required for a run of the workflow. Typically, only the de_novo_assembly.cohort and potentially run/backend-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.

A cohort can include one or more samples. Samples need not be related.

Type Name Description Notes
String cohort_id A unique name for the cohort; used to name outputs. Alphanumeric characters, underscore (_), and dash (-) are allowed.
Array[Sample] samples The set of samples for the cohort. At least one sample must be defined.
Boolean run_de_novo_assembly_trio Run trio binned de novo assembly. Cohort must contain at least one valid trio (child and both parents present in the cohort)

Sample information for each sample in the workflow run.

Type Name Description Notes
String sample_id A unique name for the sample; used to name outputs. Alphanumeric characters, underscore (_), and dash (-) are allowed
Array[IndexData] movie_bams The set of unaligned movie BAMs associated with this sample
String? father_id Paternal sample_id. Alphanumeric characters, underscore (_), and dash (-) are allowed.
String? mother_id Maternal sample_id. Alphanumeric characters, underscore (_), and dash (-) are allowed.
Boolean run_de_novo_assembly If true, run single-sample de novo assembly for this sample [true, false]

Array of references and their associated names and indices.

These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json.

Type Name Description Notes
String name Reference name; used to name outputs (e.g., "GRCh38")
IndexData fasta Reference genome and associated index

Other inputs

Type Name Description Notes
String backend Backend where the workflow will be executed ["Azure", "AWS", "GCP", "HPC"]
String? zones Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'.
String? aws_spot_queue_arn Queue ARN for the spot batch queue; required if backend is set to 'AWS' and preemptible is set to true Determining the AWS queue ARN
String? aws_on_demand_queue_arn Queue ARN for the on demand batch queue; required if backend is set to 'AWS' and preemptible is set to false Determining the AWS queue ARN
String? container_registry Container registry where workflow images are hosted. If left blank, PacBio's public Quay.io registry will be used.
Boolean preemptible If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC. [true, false]

Workflow outputs

De novo assembly - sample

These files will be output if cohort.samples[sample] is set to true for any sample.

Type Name Description Notes
Array[Array[File]?] zipped_assembly_fastas De novo dual assembly generated by hifiasm
Array[Array[File]?] assembly_noseq_gfas Assembly graphs in GFA format.
Array[Array[File]?] assembly_lowQ_beds Coordinates of low quality regions in BED format.
Array[Array[File]?] assembly_stats Assembly size and NG50 stats generated by calN50.
Array[Array[IndexData?]] asm_bam minimap2 alignment of assembly to reference.
Array[Array[IndexData?]] paftools_vcf calls variants from coordinate-sorted assembly-to-reference alignment. It calls variants from the cs tag and identifies confident/callable regions as those covered by exactly one contig paftools
Array[Array[File?]] paftools_vcf_stats bcftools stats summary statistics for paftools variant calls

De novo assembly - trio

These files will be output if cohort.de_novo_assembly_trio is set to true and there is at least one parent-parent-kid trio in the cohort.

Type Name Description Notes
Array[Array[File]]? trio_zipped_assembly_fastas Haplotype-resolved de novo assembly of the trio kid generated by hifiasm with trio binning
Array[Array[File]]? trio_assembly_noseq_gfas Assembly graphs in GFA format.
Array[Array[File]]? trio_assembly_lowQ_beds Coordinates of low quality regions in BED format.
Array[Array[File]]? trio_assembly_stats Assembly size and NG50 stats generated by calN50.
Array[Array[IndexData]?] trio_asm_bams minimap2 alignment of assembly to reference.
Array[Map[String, String]]? haplotype_key Indication of which haplotype (hap1/hap2) corresponds to which parent.

Tool versions and Docker images

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task. Images can be referenced in the following table by looking for the name after the final / character and before the @sha256:.... For example, the image referred to here is "align_hifiasm":

~{runtime_attributes.container_registry}/align_hifiasm@sha256:3968cb<...>b01f80fe

Image Major tool versions Links
align_hifiasm Dockerfile
bcftools Dockerfile
gfatools Dockerfile
hifiasm Dockerfile
htslib Dockerfile
paftools Dockerfile
pyyaml
  • python 3.8.10; custom scripts
Dockerfile
samtools Dockerfile
yak Dockerfile

DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.