Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527

Merged
merged 8 commits into from
Jul 18, 2024

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented Jul 2, 2024

This PR closes #127 and #528

🗑️ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

Current work in progress! Will update soon!

This PR updates HRRT (also known as ncbi-scrub) to the latest stable version v2.2.1.

This update aims to correct a few issues:

  • The paired-information was not being kept with the current processing
  • The reads were being masked with 'N' instead of removed, which could save a lot of compute resources downstream

🛠️ Impacted Workflows/Tasks & Changes Being Made

The following editions were done to the ncbi_scrub task:

  • docker has been updated to us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1
  • In the ncbi_scrub_pe task, the reads are now interleaved before processing with HRRT
  • In the ncbi_scrub_pe task, the option to remove spots instead of replacing them with 'N' is explicitly passed
  • In the ncbi_scrub_pe task, the option to mask both pairs and keep interleaved information is explicitly passed
  • In the ncbi_scrub_se task, the option to remove spots instead of replacing them with 'N' is explicitly passed

These updates are reflected on the following workflows:

  • TheiaCoV_Illumina_PE
  • TheiaCoV_ClearLabs
  • TheiaCoV_ONT
  • TheiaMeta_Illumina_PE

Additionally, the scrubbing task has been added to the following workflows for read-processing standardization:

  • TheiaCoV_Illumina_SE

Note: ncbi/sra-human-scrubber#30 clarifies the usability of HRRT on ONT data.

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

📋 Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: sra-human-scrubber:1.0.2021-05-05 -> sra-human-scrubber:2.2.1

Databases or database versions changed: N/A

Data processing/commands changed: Added logic to interleave and split the reads with paste

File processing changed: Enabled read-scrubbing on TheiaCoV_Illumina_SE workflows, which impacts results downstream

Compute resources changed: N/A but maybe it's a good opportunity to do so

➡️ Inputs

Nothing has changed

⬅️ Outputs

Nothing has changed

🧪 Testing

Test Dataset

Mock Illumina PE data was generated for each TheiaCoV target organism with ART:

  • flu: GCF_000865085
  • hiv: NC_001722
  • mpxv: NC_063383
  • rsva: NC_001803
  • sc2: NC_045512
  • wnv: NC_001563

Additionally, the same was done for a human reference:

  • human: GCF_000001405

The Illumina sequences are available at gs://benchmark_data_theiagen/Illumina/human_viral_mix_1000reads and gs://benchmark_data_theiagen/Illumina/human_viral_mix

Mixed Viral and Human ONT sequences are available at gs://benchmark_data_theiagen/ONT/mix_human_viral

Evaluation of interleaving FASTQ file

Using SC2 as a test case:

miniwdl run --task ncbi_scrub_pe task_ncbi_scrub.wdl read1= sc2_1000_1.fq.gz read2= sc2_1000_2.fq.gz samplename="sc2"

Interleaved read file:
image

  • Reads are correctly paired, wich each /1 read being followed by the same id with /2

Split read files:
image
image

  • Resulting file contain only forward reads in the first file, and only reverse reads in the second file

Edge case scenario: In the read_screen process, we allow reads that have a slight mismatch on the number of lines between forward and reverse reads to pass. This could cause paste command to append singletons to the interleaved file at the bottom separated by newlines.

    paste <($cat_command ~{read1} | paste - - - -) <($cat_command ~{read2} | paste - - - -) | awk '{if (NF == 8) print $1"\n"$2"\n"$3"\n"$4"\n"$5"\n"$6"\n"$7"\n"$8}' | tr '\t' '\n' > interleaved.fastq

image
To avoid this, the interleaving block only prints a read pair (a block of 8 columns, 4 belonging to each read) if all fields have content (this is accomplished by using awk).

By discarding these reads, we ensure that the splitting command works as expected and reads don't get mixed up in the final files.

paste - - - - - - - - < interleaved.fastq.clean \
      | tee >(cut -f 1-4 | tr '\t' '\n' | gzip > ~{samplename}_R1_dehosted.fastq.gz) \
      | cut -f 5-8 | tr '\t' '\n' | gzip > ~{samplename}_R2_dehosted.fastq.gz

Commandline Testing with MiniWDL or Cromwell (optional)

For each read-pair, the following command was run

miniwdl run --task ncbi_scrub_pe task_ncbi_scrub.wdl read1= <forward_read> read2= <reverse_read> samplename="<sample_name>"
Sample reference used spots removed
flu GCF_000865085 0
hiv NC_001722 0
human GCF_000001405 693504
mpxv NC_063383 0
rsva NC_001803 0
sc2 NC_045512 0
wnv NC_001563 0

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

🔬 Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes

on pe task, fix paired reads not being both removed
@cimendes cimendes linked an issue Jul 2, 2024 that may be closed by this pull request
@cimendes cimendes requested a review from jrotieno July 16, 2024 09:44
@cimendes
Copy link
Member Author

Will update docs after merging!

@cimendes
Copy link
Member Author

Only TheiaMeta remains but it's a slow workflow. Setting this as ready for review! @jrotieno enjoy!

@cimendes cimendes marked this pull request as ready for review July 16, 2024 11:18
@cimendes
Copy link
Member Author

Unneven-read files will be mixed!

@cimendes cimendes marked this pull request as draft July 16, 2024 14:43
@cimendes
Copy link
Member Author

Todo PE:

  • Add the number of reads check
  • Check with reads from two different samples

…wk conditional to paste block to only print a read that is actually paired (exists in borth read1 and read2).
@cimendes cimendes marked this pull request as ready for review July 17, 2024 15:10
Copy link
Contributor

@jrotieno jrotieno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, and examined the results and the output is as required.

The only thing I am thinking about is read_screen process allowing reads that have a slight mismatch on the number of lines between forward and reverse reads to pass. Was this because we did some tests and found that the unmatched reads were still useful in assembly and/or characterization in which case discarding them here will be disadvantageous?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants