-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring gCNV WDL #5176
Refactoring gCNV WDL #5176
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5176 +/- ##
==============================================
+ Coverage 86.798% 87.02% +0.223%
- Complexity 29711 31801 +2090
==============================================
Files 1820 1834 +14
Lines 137406 143956 +6550
Branches 15160 16642 +1482
==============================================
+ Hits 119265 125271 +6006
- Misses 12637 13037 +400
- Partials 5504 5648 +144
|
@samuelklee Could you please review? Also @mwalker174 can you see if this conflicts with your changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor typos and WDL style issues. Otherwise, looks good to me---thanks for making these changes! Perhaps @mwalker174 can take a run through?
scripts/cnv_wdl/germline/README.md
Outdated
- Calling a cohort of samples and building a model for denoising further case samples: ``cnv_germline_cohort_workflow.wdl`` | ||
- Calling a case sample using a previously built model for denoising: ``cnv_germline_case_workflow.wdl`` | ||
- Cohort WDL: Calling a cohort of samples and building a model for denoising further case samples: ``cnv_germline_cohort_workflow.wdl`` | ||
- Case WDL: calling case samples using a previously built model for denoising: ``cnv_germline_case_workflow.wdl`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling -> Calling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/germline/README.md
Outdated
- Calling a case sample using a previously built model for denoising: ``cnv_germline_case_workflow.wdl`` | ||
- Cohort WDL: Calling a cohort of samples and building a model for denoising further case samples: ``cnv_germline_cohort_workflow.wdl`` | ||
- Case WDL: calling case samples using a previously built model for denoising: ``cnv_germline_case_workflow.wdl`` | ||
- Scattered case WDL(recommended): functionally equivalent to case WDL, written for reducing cloud compute cost (see below) and wall-clock time ``cnv_germline_case_scattered_workflow.wdl`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDL(recommended): functionally -> WDL (recommended): Functionally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/germline/README.md
Outdated
|
||
- ``CNVGermlineCaseScatteredWorkflow.num_samples_per_scatter_block`` -- (recommended WES value=25) number of samples to process in a single block; blocks of this size will be sent to germline case workflow and processed in a batch; | ||
- ``CNVGermlineCaseScatteredWorkflow.preemptible_attempts`` -- (recommended value=5) this allows to reduce cost by using preemptible instances | ||
- ``CNVGermlineCaseScatteredWorkflow.mem_gb_for_determine_germline_contig_ploidy`` -- amount of memory alloted for ploidy determination tasks (the lower the cheaper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alloted -> allotted (here and below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/germline/README.md
Outdated
|
||
#### Required parameters in the scattered germline case workflow | ||
|
||
Same required parameters as in germline case workflow, however in order to reduce wall-clock time and compute cost it is recommended to optimize for the following parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same required parameters as in germline case workflow, however in order to reduce wall-clock time and compute cost it is recommended to optimize for the following parameters: -> Same required parameters as in the germline case workflow. However, in order to reduce wall-clock time and compute cost, it is recommended to optimize for the following parameters:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/germline/README.md
Outdated
|
||
Same required parameters as in germline case workflow, however in order to reduce wall-clock time and compute cost it is recommended to optimize for the following parameters: | ||
|
||
- ``CNVGermlineCaseScatteredWorkflow.num_samples_per_scatter_block`` -- (recommended WES value=25) number of samples to process in a single block; blocks of this size will be sent to germline case workflow and processed in a batch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to germline case -> to the germline case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
File ref_fasta_fai | ||
File ref_fasta | ||
String gatk_docker | ||
Int num_samples_per_scatter_block = 25 #recommended value for WES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I've lost track of how we are supposed to specify default parameter values in WDLs. Having these default values here seems idiosyncratic compared to the other CNV WDLs. Can you not push this down to the task level? Perhaps @LeeTL1220 can comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more consistent to keep num_samples_per_scatter_block
as required with no default (since this parameter is pretty fundamental to this WDL), and specify the default for preemptible_attempts
in the task using select_first
as we do elsewhere. You've given the defaults for these parameters in the documentation, and we can eventually provide them in the default json for FireCloud, so I think we don't need to set them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that's better. Left num_samples_per_scatter_block
with no default and pushed down default for preemptible_attempts
into select_first
of each task
#### optional basic arguments #### | ||
################################## | ||
File? gatk4_jar_override | ||
Int preemptible_attempts = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
File input_array_file = write_lines(input_array) | ||
|
||
command <<< | ||
python <<CODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a comment here describing what the python code is doing would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, done
@@ -1,4 +1,5 @@ | |||
# Workflow for running GATK GermlineCNVCaller on a single case sample. Supports both WGS and WES. | |||
# Workflow for running GATK GermlineCNVCaller on a multiple case samples using a trained model (obtained from running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a multiple case samples -> on multiple case samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Int num_inputs_in_scatter_block | ||
String gatk_docker | ||
|
||
Int machine_mem_mb = 4000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, I think it is probably fine to hardcode defaults for this simple task (rather than use select_first
and expose these parameters at the workflow level), since it is unlikely that we'll need to adjust these for any given run.
} | ||
|
||
output { | ||
Array[Array[String]] splitArray = read_tsv("input_array_split.tsv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitArray -> split_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
############################################## | ||
#### optional arguments for CollectCounts #### | ||
############################################## | ||
String? collect_counts_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason why you changed format -> collect_counts_format? should we make the same change over in the somatic WDLs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was confused by it's occurrence somewhere out of the context of CollectCounts task, so I decided to make it more specific. I can make the same change in the somatic WDLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it in the somatic workflows as well
-Added multi-sample functionality to gCNV case mode WDL, and added a wrapper for gCNV case mode WDL to help optimize cloud computation cost. Also optimized how data is sent to postprocessing task in gCNV WDLs.
69d9f24
to
1b4ede7
Compare
Addresses #4397 and #5054.
Restructured gCNV WDLs to pass data more efficiently to postprocessing tasks, and added a wrapper workflow for gCNV case WDL that scatters samples in multiple blocks. Also cleaned up some of the unused cromwell travis tests.