Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set trimmomatic_args "-phred33" as default #389

Merged
merged 4 commits into from
Mar 26, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Mar 21, 2024

This PR closes #332 .

🗑️ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

This PR sets the default trimmomatic_args input parameter to "-phred33" at the WDL task level.

This will allow for SRALite-formatted FASTQ files (Qscores are all Q30) to pass the trimmomatic task as these FASTQ files cause trimmomatic to throw an error due to it being unable to determine the phred score encoding.

It's my understanding that Illumina has been using phred33 encoding for a long time (10+ years or something?) so it's safe to pass this option into trimmomatic. It should not impact users analyzing FASTQs with real/actual quality scores

NOTE: this will allow TheiaProk to run successfully on SRA-lite formatted FASTQs, but users should be aware that these false Q scores can impact downstream assembly and analysis so they should interpret results with caution. Users will be warned of this in future version of the SRA_Fetch workflow after PR #387 is merged (which will warn users if SRA-lite formatted FASTQs have been downloaded from NCBI SRA/ENA/DDBJ/etc.

🛠️ Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

📋 Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: Yes, the -phred33 option is passed to the trimmomatic command, but it should not impact functionality at all. Normally, trimmomatic auto-detects phred score encoding, but struggles to detect the encoding with SRA_Lite formatted FASTQ files

File processing changed: No

Compute resources changed: No

➡️ Inputs

Set trimmomatic_args to "-phred33" as default at the task level. This is still user-modifiable at the workflow level as it is exposed in both read_QC_trim_pe and read_QC_trim_se workflows

⬅️ Outputs

No

🧪 Testing

Test Dataset

Samples from @cimendes that were used to test PR #387

20 samples. According to tests from PR387, 1 sample has normal FASTQ quality scores, and the remaining 19 samples are SRA-lite formatted.

Commandline Testing with MiniWDL or Cromwell (optional)

Not shown since it's a simple code change.

Terra Testing

❌ FAILURE (for SRA-lite formatted files): v1.3.0 of TheiaProk_Illumina_PE_PHB on test data described above: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/61273d6f-ed86-4ad0-986e-18f3b8eaf152 . 3 succeeded. 1 of these 3 has the original/normal Qscores, but the other 2 were flagged as SRA-Lite formatted FASTQs, so IDK why trimmomatic did not fail on those 🤷

✅ SUCCESS (for all FASTQ files in test set): used this dev branch which allowed trimmomatic to succeed and the rest of the workflow to finish running: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/b5b692cb-8576-4ca5-90ab-8a78e9d34d3d - TODO check outputs to ensure they actually succeeded

Everything ran successfully and outputs look as expected. It should be obvious to the user when SRA_Lite FASTQ files are encountered:

image

Suggested Scenarios for Reviewer to Test

I would recommend testing with FASTQ files that are known to be SRA-Lite formatted as well as normal FASTQ files (with original Q scores) to ensure they are not impacted.

Workflows affected by this change:

  • theiaprok_illumina_pe_phb
  • theiaprok_illumina_pe_phb
  • theiacov_illumina_pe_phb
  • theiacov_illumina_se_phb
  • TheiaMeta_illumina_pe_PHB
  • Freyja_FASTQ_PHB
  • TheiaEuk_Illumina_PE_PHB

Theiagen Version Release Testing (optional)

  • Will changes require functional or validation testing (checking outputs etc) during the release? functional
  • Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen. No
  • Are there any output files that should be checked after running the version release testing? cleaned reads

🔬 Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes

@cimendes cimendes self-requested a review March 22, 2024 09:10
@cimendes
Copy link
Member

@kapsakcj kapsakcj marked this pull request as ready for review March 26, 2024 14:13
@cimendes cimendes merged commit d2aca05 into main Mar 26, 2024
7 checks passed
@cimendes cimendes deleted the cjk-trimmomatic-phred33-default branch March 26, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set phred to 33 on trimmomatic task
2 participants