Set trimmomatic_args "-phred33" as default #389

kapsakcj · 2024-03-21T17:12:17Z

This PR closes #332 .

🗑️ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

This PR sets the default trimmomatic_args input parameter to "-phred33" at the WDL task level.

This will allow for SRALite-formatted FASTQ files (Qscores are all Q30) to pass the trimmomatic task as these FASTQ files cause trimmomatic to throw an error due to it being unable to determine the phred score encoding.

It's my understanding that Illumina has been using phred33 encoding for a long time (10+ years or something?) so it's safe to pass this option into trimmomatic. It should not impact users analyzing FASTQs with real/actual quality scores

NOTE: this will allow TheiaProk to run successfully on SRA-lite formatted FASTQs, but users should be aware that these false Q scores can impact downstream assembly and analysis so they should interpret results with caution. Users will be warned of this in future version of the SRA_Fetch workflow after PR #387 is merged (which will warn users if SRA-lite formatted FASTQs have been downloaded from NCBI SRA/ENA/DDBJ/etc.

🛠️ Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

📋 Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: Yes, the -phred33 option is passed to the trimmomatic command, but it should not impact functionality at all. Normally, trimmomatic auto-detects phred score encoding, but struggles to detect the encoding with SRA_Lite formatted FASTQ files

File processing changed: No

Compute resources changed: No

➡️ Inputs

Set trimmomatic_args to "-phred33" as default at the task level. This is still user-modifiable at the workflow level as it is exposed in both read_QC_trim_pe and read_QC_trim_se workflows

⬅️ Outputs

No

🧪 Testing

Test Dataset

Samples from @cimendes that were used to test PR #387

20 samples. According to tests from PR387, 1 sample has normal FASTQ quality scores, and the remaining 19 samples are SRA-lite formatted.

Commandline Testing with MiniWDL or Cromwell (optional)

Not shown since it's a simple code change.

Terra Testing

❌ FAILURE (for SRA-lite formatted files): v1.3.0 of TheiaProk_Illumina_PE_PHB on test data described above: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/61273d6f-ed86-4ad0-986e-18f3b8eaf152 . 3 succeeded. 1 of these 3 has the original/normal Qscores, but the other 2 were flagged as SRA-Lite formatted FASTQs, so IDK why trimmomatic did not fail on those 🤷

✅ SUCCESS (for all FASTQ files in test set): used this dev branch which allowed trimmomatic to succeed and the rest of the workflow to finish running: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/b5b692cb-8576-4ca5-90ab-8a78e9d34d3d - ~~TODO check outputs to ensure they actually succeeded~~

Everything ran successfully and outputs look as expected. It should be obvious to the user when SRA_Lite FASTQ files are encountered:

Suggested Scenarios for Reviewer to Test

I would recommend testing with FASTQ files that are known to be SRA-Lite formatted as well as normal FASTQ files (with original Q scores) to ensure they are not impacted.

Workflows affected by this change:

theiaprok_illumina_pe_phb
theiaprok_illumina_pe_phb
theiacov_illumina_pe_phb
theiacov_illumina_se_phb
TheiaMeta_illumina_pe_PHB
Freyja_FASTQ_PHB
TheiaEuk_Illumina_PE_PHB

Theiagen Version Release Testing (optional)

Will changes require functional or validation testing (checking outputs etc) during the release? functional
Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen. No
Are there any output files that should be checked after running the version release testing? cleaned reads

🔬 Final Developer Checklist

The workflow/task has been tested locally and results, including file contents, are as anticipated
The workflow/task has been tested on Terra and results, including file contents, are as anticipated
The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
Code changes follow the style guide

🎯 Reviewer Checklist

All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
All reviewer-suggested scenarios have been tested and any additional
All changed results have been confirmed to be accurate
All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
All code adheres to the style guide
MD5 sums have been updated
The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

Relevant documentation on the Public Health Resources "PHB Main" has been updated
Workflow diagrams have been updated to reflect changes

…C_trim_pe and read_QC_trim_se subworkflows

…and se subworkflows

cimendes · 2024-03-22T09:14:57Z

Testing on a set of SRA-lite detected FASTQs here: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/006022d0-70ed-4de5-9bba-ad77ebb4d9ab

kapsakcj added 4 commits March 14, 2024 15:29

set trimmomatic_args input param default to "-phred33" in both read_Q…

7315f9e

…C_trim_pe and read_QC_trim_se subworkflows

move trimmomatic_args default to WDL task instead of read_QC_trim_pe …

98f275f

…and se subworkflows

update trimmomatic_se task to also use default arg for phred33 encoding

f360f5a

update CI

6c50bd8

cimendes self-requested a review March 22, 2024 09:10

kapsakcj marked this pull request as ready for review March 26, 2024 14:13

cimendes approved these changes Mar 26, 2024

View reviewed changes

cimendes merged commit d2aca05 into main Mar 26, 2024
7 checks passed

cimendes deleted the cjk-trimmomatic-phred33-default branch March 26, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set trimmomatic_args "-phred33" as default #389

Set trimmomatic_args "-phred33" as default #389

kapsakcj commented Mar 21, 2024 •

edited

Loading

cimendes commented Mar 22, 2024

Set trimmomatic_args "-phred33" as default #389

Set trimmomatic_args "-phred33" as default #389

Conversation

kapsakcj commented Mar 21, 2024 • edited Loading

🧠 Aim, Context and Functionality

🛠️ Impacted Workflows/Tasks & Changes Being Made

📋 Workflow/Task Step Changes

🔄 Data Processing

➡️ Inputs

⬅️ Outputs

🧪 Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

🔬 Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

cimendes commented Mar 22, 2024

kapsakcj commented Mar 21, 2024 •

edited

Loading