Set trimmomatic_args "-phred33" as default #389
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR closes #332 .
🗑️ This dev branch should be deleted after merging to main.
🧠 Aim, Context and Functionality
This PR sets the default
trimmomatic_args
input parameter to"-phred33"
at the WDL task level.This will allow for SRALite-formatted FASTQ files (Qscores are all Q30) to pass the trimmomatic task as these FASTQ files cause trimmomatic to throw an error due to it being unable to determine the phred score encoding.
It's my understanding that Illumina has been using phred33 encoding for a long time (10+ years or something?) so it's safe to pass this option into trimmomatic. It should not impact users analyzing FASTQs with real/actual quality scores
NOTE: this will allow TheiaProk to run successfully on SRA-lite formatted FASTQs, but users should be aware that these false Q scores can impact downstream assembly and analysis so they should interpret results with caution. Users will be warned of this in future version of the SRA_Fetch workflow after PR #387 is merged (which will warn users if SRA-lite formatted FASTQs have been downloaded from NCBI SRA/ENA/DDBJ/etc.
🛠️ Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
📋 Workflow/Task Step Changes
🔄 Data Processing
Docker/software or software versions changed: No
Databases or database versions changed: No
Data processing/commands changed: Yes, the
-phred33
option is passed to the trimmomatic command, but it should not impact functionality at all. Normally, trimmomatic auto-detects phred score encoding, but struggles to detect the encoding with SRA_Lite formatted FASTQ filesFile processing changed: No
Compute resources changed: No
➡️ Inputs
Set
trimmomatic_args
to "-phred33" as default at the task level. This is still user-modifiable at the workflow level as it is exposed in both read_QC_trim_pe and read_QC_trim_se workflows⬅️ Outputs
No
🧪 Testing
Test Dataset
Samples from @cimendes that were used to test PR #387
20 samples. According to tests from PR387, 1 sample has normal FASTQ quality scores, and the remaining 19 samples are SRA-lite formatted.
Commandline Testing with MiniWDL or Cromwell (optional)
Not shown since it's a simple code change.
Terra Testing
❌ FAILURE (for SRA-lite formatted files): v1.3.0 of TheiaProk_Illumina_PE_PHB on test data described above: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/61273d6f-ed86-4ad0-986e-18f3b8eaf152 . 3 succeeded. 1 of these 3 has the original/normal Qscores, but the other 2 were flagged as SRA-Lite formatted FASTQs, so IDK why trimmomatic did not fail on those 🤷
✅ SUCCESS (for all FASTQ files in test set): used this dev branch which allowed trimmomatic to succeed and the rest of the workflow to finish running: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/b5b692cb-8576-4ca5-90ab-8a78e9d34d3d -
TODO check outputs to ensure they actually succeededEverything ran successfully and outputs look as expected. It should be obvious to the user when SRA_Lite FASTQ files are encountered:
Suggested Scenarios for Reviewer to Test
I would recommend testing with FASTQ files that are known to be SRA-Lite formatted as well as normal FASTQ files (with original Q scores) to ensure they are not impacted.
Workflows affected by this change:
Theiagen Version Release Testing (optional)
🔬 Final Developer Checklist
🎯 Reviewer Checklist
🗂️ Associated Documentation (to be completed by Theiagen developer)