Workflow for running de novo assembly using human PacBio whole genome sequencing (WGS) data. Written using Workflow Description Language (WDL).
- Docker images used by these workflows are defined here.
- Common tasks that may be reused within or between workflows are defined here.
Workflow entrypoint: workflows/main.wdl
The assembly workflow performs de novo assembly on samples and trios.
Clone a tagged version of the git repository. Use the --branch
flag to pull the desired version, and the --recursive
flag to pull code from any submodules.
git clone \
--depth 1 --branch v1.0.2 \ # for reproducibility
--recursive \ # to clone submodule
https://github.com/PacificBiosciences/HiFi-human-assembly-WDL.git
The workflow requires at minimum 48 cores and 288 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.
Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.
- Select a backend environment
- Configure a workflow execution engine in the chosen environment
- Fill out the inputs JSON file for your cohort
- Run the workflow
The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.
For backend-specific configuration, see the relevant documentation:
An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl
and Cromwell
.
Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.
See backend-specific documentation for details on setting up an engine.
Engine | Azure | AWS | GCP | HPC |
---|---|---|---|---|
miniwdl | Unsupported | Supported via the Amazon Genomics CLI | Unsupported | (SLURM only) Supported via the miniwdl-slurm plugin |
Cromwell | Supported via Cromwell on Azure | Supported via the Amazon Genomics CLI | Supported via Google's Pipelines API | Supported - Configuration varies depending on HPC infrastructure |
The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:
Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).
If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix>
in the input template file with the local path to the reference datasets on your HPC.
Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).
Note that the calls to miniwdl
and Cromwell
assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).
miniwdl run workflows/main.wdl -i <input_file_path.json>
java -jar <cromwell_jar_path> run workflows/main.wdl -i <input_file_path.json>
If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:
# The base URL (and port, if applicable) of your Cromwell server
CROMWELL_URL=
# The path to your inputs JSON file
INPUTS_JSON=
(cd workflows && zip -r dependencies.zip assembly_structs.wdl assemble_genome/ de_novo_assembly_sample/ de_novo_assembly_trio/ wdl-common/)
curl -X "POST" \
"${CROMWELL_URL}/api/workflows/v1" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "workflowSource=@workflows/main.wdl" \
-F "workflowInputs=@${INPUTS_JSON};type=application/json" \
-F "workflowDependencies=@workflows/dependencies.zip;type=application/zip"
To specify workflow options, add the following to the request (assuming your options file is a file called options.json
located in the pwd
): -F "workflowOptions=@options.json;type=application/json"
.
This section describes the inputs required for a run of the workflow. Typically, only the de_novo_assembly.cohort
and potentially run/backend-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.
A cohort can include one or more samples. Samples need not be related.
Type | Name | Description | Notes |
---|---|---|---|
String | cohort_id | A unique name for the cohort; used to name outputs. Alphanumeric characters, underscore (_ ), and dash (- ) are allowed. |
|
Array[Sample] | samples | The set of samples for the cohort. At least one sample must be defined. | |
Boolean | run_de_novo_assembly_trio | Run trio binned de novo assembly. | Cohort must contain at least one valid trio (child and both parents present in the cohort) |
Sample information for each sample in the workflow run.
Type | Name | Description | Notes |
---|---|---|---|
String | sample_id | A unique name for the sample; used to name outputs. Alphanumeric characters, underscore (_ ), and dash (- ) are allowed |
|
Array[IndexData] | movie_bams | The set of unaligned movie BAMs associated with this sample | |
String? | father_id | Paternal sample_id . Alphanumeric characters, underscore (_ ), and dash (- ) are allowed. |
|
String? | mother_id | Maternal sample_id . Alphanumeric characters, underscore (_ ), and dash (- ) are allowed. |
|
Boolean | run_de_novo_assembly | If true, run single-sample de novo assembly for this sample | [true, false] |
Array of references and their associated names and indices.
These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json
.
Type | Name | Description | Notes |
---|---|---|---|
String | name | Reference name; used to name outputs (e.g., "GRCh38") | |
IndexData | fasta | Reference genome and associated index |
Type | Name | Description | Notes |
---|---|---|---|
String | backend | Backend where the workflow will be executed | ["Azure", "AWS", "GCP", "HPC"] |
String? | zones | Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'. | |
String? | aws_spot_queue_arn | Queue ARN for the spot batch queue; required if backend is set to 'AWS' and preemptible is set to true |
Determining the AWS queue ARN |
String? | aws_on_demand_queue_arn | Queue ARN for the on demand batch queue; required if backend is set to 'AWS' and preemptible is set to false |
Determining the AWS queue ARN |
String? | container_registry | Container registry where workflow images are hosted. If left blank, PacBio's public Quay.io registry will be used. | |
Boolean | preemptible | If set to true , run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false , on-demand VMs will be used for every task. Ignored if backend is set to HPC. |
[true, false] |
These files will be output if cohort.samples[sample]
is set to true
for any sample.
Type | Name | Description | Notes |
---|---|---|---|
Array[Array[File]?] | zipped_assembly_fastas | De novo dual assembly generated by hifiasm | |
Array[Array[File]?] | assembly_noseq_gfas | Assembly graphs in GFA format. | |
Array[Array[File]?] | assembly_lowQ_beds | Coordinates of low quality regions in BED format. | |
Array[Array[File]?] | assembly_stats | Assembly size and NG50 stats generated by calN50. | |
Array[Array[IndexData?]] | asm_bam | minimap2 alignment of assembly to reference. | |
Array[Array[IndexData?]] | paftools_vcf | calls variants from coordinate-sorted assembly-to-reference alignment. It calls variants from the cs tag and identifies confident/callable regions as those covered by exactly one contig paftools |
|
Array[Array[File?]] | paftools_vcf_stats | bcftools stats summary statistics for paftools variant calls |
These files will be output if cohort.de_novo_assembly_trio
is set to true
and there is at least one parent-parent-kid trio in the cohort.
Type | Name | Description | Notes |
---|---|---|---|
Array[Array[File]]? | trio_zipped_assembly_fastas | Haplotype-resolved de novo assembly of the trio kid generated by hifiasm with trio binning | |
Array[Array[File]]? | trio_assembly_noseq_gfas | Assembly graphs in GFA format. | |
Array[Array[File]]? | trio_assembly_lowQ_beds | Coordinates of low quality regions in BED format. | |
Array[Array[File]]? | trio_assembly_stats | Assembly size and NG50 stats generated by calN50. | |
Array[Array[IndexData]?] | trio_asm_bams | minimap2 alignment of assembly to reference. | |
Array[Map[String, String]]? | haplotype_key | Indication of which haplotype (hap1 /hap2 ) corresponds to which parent. |
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.
The Docker image used by a particular step of the workflow can be identified by looking at the docker
key in the runtime
block for the given task. Images can be referenced in the following table by looking for the name after the final /
character and before the @sha256:...
. For example, the image referred to here is "align_hifiasm":
~{runtime_attributes.container_registry}/align_hifiasm@sha256:3968cb<...>b01f80fe
Image | Major tool versions | Links |
---|---|---|
align_hifiasm | Dockerfile | |
bcftools | Dockerfile | |
gfatools | Dockerfile | |
hifiasm | Dockerfile | |
htslib | Dockerfile | |
paftools | Dockerfile | |
pyyaml |
|
Dockerfile |
samtools | Dockerfile | |
yak | Dockerfile |
TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.