[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527

cimendes · 2024-07-02T10:44:17Z

This PR closes #127 and #528

🗑️ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

~~Current work in progress! Will update soon!~~

This PR updates HRRT (also known as ncbi-scrub) to the latest stable version v2.2.1.

This update aims to correct a few issues:

The paired-information was not being kept with the current processing
The reads were being masked with 'N' instead of removed, which could save a lot of compute resources downstream

🛠️ Impacted Workflows/Tasks & Changes Being Made

The following editions were done to the ncbi_scrub task:

docker has been updated to us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1
In the ncbi_scrub_pe task, the reads are now interleaved before processing with HRRT
In the ncbi_scrub_pe task, the option to remove spots instead of replacing them with 'N' is explicitly passed
In the ncbi_scrub_pe task, the option to mask both pairs and keep interleaved information is explicitly passed
In the ncbi_scrub_se task, the option to remove spots instead of replacing them with 'N' is explicitly passed

These updates are reflected on the following workflows:

TheiaCoV_Illumina_PE
TheiaCoV_ClearLabs
TheiaCoV_ONT
TheiaMeta_Illumina_PE

Additionally, the scrubbing task has been added to the following workflows for read-processing standardization:

TheiaCoV_Illumina_SE

Note: ncbi/sra-human-scrubber#30 clarifies the usability of HRRT on ONT data.

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

📋 Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: sra-human-scrubber:1.0.2021-05-05 -> sra-human-scrubber:2.2.1

Databases or database versions changed: N/A

Data processing/commands changed: Added logic to interleave and split the reads with paste

File processing changed: Enabled read-scrubbing on TheiaCoV_Illumina_SE workflows, which impacts results downstream

Compute resources changed: N/A but maybe it's a good opportunity to do so

➡️ Inputs

Nothing has changed

⬅️ Outputs

Nothing has changed

🧪 Testing

Test Dataset

Mock Illumina PE data was generated for each TheiaCoV target organism with ART:

flu: GCF_000865085
hiv: NC_001722
mpxv: NC_063383
rsva: NC_001803
sc2: NC_045512
wnv: NC_001563

Additionally, the same was done for a human reference:

human: GCF_000001405

The Illumina sequences are available at gs://benchmark_data_theiagen/Illumina/human_viral_mix_1000reads and gs://benchmark_data_theiagen/Illumina/human_viral_mix

Mixed Viral and Human ONT sequences are available at gs://benchmark_data_theiagen/ONT/mix_human_viral

Evaluation of interleaving FASTQ file

Using SC2 as a test case:

miniwdl run --task ncbi_scrub_pe task_ncbi_scrub.wdl read1= sc2_1000_1.fq.gz read2= sc2_1000_2.fq.gz samplename="sc2"

Interleaved read file:

Reads are correctly paired, wich each /1 read being followed by the same id with /2

Split read files:

Resulting file contain only forward reads in the first file, and only reverse reads in the second file

Edge case scenario: In the read_screen process, we allow reads that have a slight mismatch on the number of lines between forward and reverse reads to pass. This could cause paste command to append singletons to the interleaved file at the bottom separated by newlines.

    paste <($cat_command ~{read1} | paste - - - -) <($cat_command ~{read2} | paste - - - -) | awk '{if (NF == 8) print $1"\n"$2"\n"$3"\n"$4"\n"$5"\n"$6"\n"$7"\n"$8}' | tr '\t' '\n' > interleaved.fastq

To avoid this, the interleaving block only prints a read pair (a block of 8 columns, 4 belonging to each read) if all fields have content (this is accomplished by using awk).

By discarding these reads, we ensure that the splitting command works as expected and reads don't get mixed up in the final files.

paste - - - - - - - - < interleaved.fastq.clean \
      | tee >(cut -f 1-4 | tr '\t' '\n' | gzip > ~{samplename}_R1_dehosted.fastq.gz) \
      | cut -f 5-8 | tr '\t' '\n' | gzip > ~{samplename}_R2_dehosted.fastq.gz

Commandline Testing with MiniWDL or Cromwell (optional)

For each read-pair, the following command was run

miniwdl run --task ncbi_scrub_pe task_ncbi_scrub.wdl read1= <forward_read> read2= <reverse_read> samplename="<sample_name>"

Sample	reference used	spots removed
flu	GCF_000865085	0
hiv	NC_001722	0
human	GCF_000001405	693504
mpxv	NC_063383	0
rsva	NC_001803	0
sc2	NC_045512	0
wnv	NC_001563	0

Terra Testing

TheiaCoV Illumina PE:
- ✅ 9 in silicon reads mixed with human for several taxa: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/43837945-6202-444b-ad9e-6e60dcd23477
TheiaCoV Illumina SE:
- ✅ 34 samples belonging to the Validation dataset: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/dfa92755-77cc-44ff-9556-a197e9dade44
TheiaCoV ONT:
- ✅ 9 mock ONT samples containing both viral and human reads (50% mixture) for several taxa: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/34480674-63f8-4ac8-a91d-1ffb2af565a4
TheiaCoV ClearLabs:
- ✅ 25 SC2 samples from Validation workspace: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/347b0b33-ad77-4cda-b008-a0481003a86f
TheiaMeta Illumina PE:
- ✅ 10 metagenomic samples collected from various human sources: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/4a930c75-2500-457f-aee1-3e2b4da67f23

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

🔬 Final Developer Checklist

The workflow/task has been tested locally and results, including file contents, are as anticipated
The workflow/task has been tested on Terra and results, including file contents, are as anticipated
The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
Code changes follow the style guide

🎯 Reviewer Checklist

All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
All reviewer-suggested scenarios have been tested and any additional
All changed results have been confirmed to be accurate
All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
All code adheres to the style guide
MD5 sums have been updated
The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

Relevant documentation on the Public Health Resources "PHB Main" has been updated
Workflow diagrams have been updated to reflect changes

on pe task, fix paired reads not being both removed

cimendes · 2024-07-16T10:48:59Z

Will update docs after merging!

cimendes · 2024-07-16T11:17:30Z

Only TheiaMeta remains but it's a slow workflow. Setting this as ready for review! @jrotieno enjoy!

cimendes · 2024-07-16T14:43:29Z

Unneven-read files will be mixed!

cimendes · 2024-07-16T14:59:16Z

Todo PE:

Add the number of reads check
Check with reads from two different samples

…wk conditional to paste block to only print a read that is actually paired (exists in borth read1 and read2).

jrotieno

Code looks good, and examined the results and the output is as required.

The only thing I am thinking about is read_screen process allowing reads that have a slight mismatch on the number of lines between forward and reverse reads to pass. Was this because we did some tests and found that the unmatched reads were still useful in assembly and/or characterization in which case discarding them here will be disadvantageous?

update hrrt to version 2.2.1 and optimise task;

3835bd8

on pe task, fix paired reads not being both removed

cimendes linked an issue Jul 2, 2024 that may be closed by this pull request

[TheiaCoV] task_ncbi_scrub.wdl CPU and singleton bugs #127

Closed

cimendes added 2 commits July 2, 2024 10:54

adjust PE sub-workflow for new srub task

bd1e349

rename variable

2b65dbf

cimendes linked an issue Jul 3, 2024 that may be closed by this pull request

[TheiaCoV SE] Implement human scrubbing of reads with HRRT to mirror TheiaCoV PE #528

Closed

cimendes added 4 commits July 3, 2024 10:43

add ncbi_scrub to theiacov SE

bcf4af3

update CI to changes in TheiaCoV SE readQC subworkflow

e4b2acc

increase verbosity to allow easier debugging

4b8927d

cleanup

b310e79

cimendes requested a review from jrotieno July 16, 2024 09:44

cimendes marked this pull request as ready for review July 16, 2024 11:18

cimendes marked this pull request as draft July 16, 2024 14:43

Add check of number of reads in each file to see if they match. Add a…

08e51a8

…wk conditional to paste block to only print a read that is actually paired (exists in borth read1 and read2).

cimendes marked this pull request as ready for review July 17, 2024 15:10

jrotieno approved these changes Jul 18, 2024

View reviewed changes

jrotieno merged commit 6d406b6 into main Jul 18, 2024
11 checks passed

cimendes mentioned this pull request Jul 23, 2024

[TheiaCoV and TheiaMeta - HRRT] Patch bug by removing unneeded awk verification #550

Merged

13 tasks

kapsakcj mentioned this pull request Jul 29, 2024

[theiacov] Add additional vadr output files & tarball; upgrade VADR docker #556

Merged

10 tasks

sage-wright deleted the im-hrrt-uptade-dev branch August 30, 2024 13:41

cimendes mentioned this pull request Sep 9, 2024

[NCBI Scrub Standalone Workflows] Correct output declarations for the number of spots removed #610

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527

[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527

cimendes commented Jul 2, 2024 •

edited

Loading

cimendes commented Jul 16, 2024

cimendes commented Jul 16, 2024

cimendes commented Jul 16, 2024

cimendes commented Jul 16, 2024

jrotieno left a comment

[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527

[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527

Conversation

cimendes commented Jul 2, 2024 • edited Loading

🧠 Aim, Context and Functionality

🛠️ Impacted Workflows/Tasks & Changes Being Made

📋 Workflow/Task Step Changes

🔄 Data Processing

➡️ Inputs

⬅️ Outputs

🧪 Testing

Test Dataset

Evaluation of interleaving FASTQ file

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

🔬 Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

cimendes commented Jul 16, 2024

cimendes commented Jul 16, 2024

cimendes commented Jul 16, 2024

cimendes commented Jul 16, 2024

jrotieno left a comment

Choose a reason for hiding this comment

cimendes commented Jul 2, 2024 •

edited

Loading