-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task #527
Conversation
on pe task, fix paired reads not being both removed
Will update docs after merging! |
Only TheiaMeta remains but it's a slow workflow. Setting this as ready for review! @jrotieno enjoy! |
Unneven-read files will be mixed! |
Todo PE:
|
…wk conditional to paste block to only print a read that is actually paired (exists in borth read1 and read2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, and examined the results and the output is as required.
The only thing I am thinking about is read_screen
process allowing reads that have a slight mismatch on the number of lines between forward and reverse reads to pass. Was this because we did some tests and found that the unmatched reads were still useful in assembly and/or characterization in which case discarding them here will be disadvantageous?
This PR closes #127 and #528
🗑️ This dev branch should be deleted after merging to main.
🧠 Aim, Context and Functionality
Current work in progress! Will update soon!This PR updates HRRT (also known as ncbi-scrub) to the latest stable version v2.2.1.
This update aims to correct a few issues:
🛠️ Impacted Workflows/Tasks & Changes Being Made
The following editions were done to the
ncbi_scrub
task:us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1
ncbi_scrub_pe
task, the reads are now interleaved before processing with HRRTncbi_scrub_pe
task, the option to remove spots instead of replacing them with 'N' is explicitly passedncbi_scrub_pe
task, the option to mask both pairs and keep interleaved information is explicitly passedncbi_scrub_se
task, the option to remove spots instead of replacing them with 'N' is explicitly passedThese updates are reflected on the following workflows:
Additionally, the scrubbing task has been added to the following workflows for read-processing standardization:
Note: ncbi/sra-human-scrubber#30 clarifies the usability of HRRT on ONT data.
This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
📋 Workflow/Task Step Changes
🔄 Data Processing
Docker/software or software versions changed:
sra-human-scrubber:1.0.2021-05-05
->sra-human-scrubber:2.2.1
Databases or database versions changed: N/A
Data processing/commands changed: Added logic to interleave and split the reads with paste
File processing changed: Enabled read-scrubbing on TheiaCoV_Illumina_SE workflows, which impacts results downstream
Compute resources changed: N/A but maybe it's a good opportunity to do so
➡️ Inputs
Nothing has changed
⬅️ Outputs
Nothing has changed
🧪 Testing
Test Dataset
Mock Illumina PE data was generated for each TheiaCoV target organism with ART:
Additionally, the same was done for a human reference:
The Illumina sequences are available at gs://benchmark_data_theiagen/Illumina/human_viral_mix_1000reads and gs://benchmark_data_theiagen/Illumina/human_viral_mix
Mixed Viral and Human ONT sequences are available at gs://benchmark_data_theiagen/ONT/mix_human_viral
Evaluation of interleaving FASTQ file
Using SC2 as a test case:
Interleaved read file:

/1
read being followed by the same id with/2
Split read files:


Edge case scenario: In the
read_screen
process, we allow reads that have a slight mismatch on the number of lines between forward and reverse reads to pass. This could cause paste command to append singletons to the interleaved file at the bottom separated by newlines.To avoid this, the interleaving block only prints a read pair (a block of 8 columns, 4 belonging to each read) if all fields have content (this is accomplished by using awk).
By discarding these reads, we ensure that the splitting command works as expected and reads don't get mixed up in the final files.
Commandline Testing with MiniWDL or Cromwell (optional)
For each read-pair, the following command was run
Terra Testing
Suggested Scenarios for Reviewer to Test
Theiagen Version Release Testing (optional)
🔬 Final Developer Checklist
🎯 Reviewer Checklist
🗂️ Associated Documentation (to be completed by Theiagen developer)