Add denoised coverage file concatenation output to gCNV postprocessor #5823

asmirnov239 · 2019-03-21T19:56:51Z

This PR adds an option to PostprocessGermlineCNVCalls to concatenate all denoised copy ratio files from call directories output by GermlineCNVCaller

codecov-io · 2019-03-21T20:50:33Z

Codecov Report

❗ No coverage uploaded for pull request base (master@06df7e8). Click here to learn what that means.
The diff coverage is 75.41%.

@@            Coverage Diff             @@
##             master     #5823   +/-   ##
==========================================
  Coverage          ?   86.834%           
  Complexity        ?     32337           
==========================================
  Files             ?      1994           
  Lines             ?    149405           
  Branches          ?     16492           
==========================================
  Hits              ?    129735           
  Misses            ?     13654           
  Partials          ?      6016

Impacted Files	Coverage Δ	Complexity Δ
...ls/copynumber/gcnv/GermlineCNVNamingConstants.java	`0% <ø> (ø)`	`0 <0> (?)`
...ons/CopyNumberPosteriorDistributionCollection.java	`73.684% <0%> (ø)`	`6 <0> (?)`
...mats/collections/BaselineCopyNumberCollection.java	`63.636% <100%> (ø)`	`3 <0> (?)`
...ls/copynumber/formats/records/LinearCopyRatio.java	`47.059% <47.059%> (ø)`	`5 <5> (?)`
...formats/collections/LinearCopyRatioCollection.java	`55.556% <55.556%> (ø)`	`3 <3> (?)`
...mats/collections/NonLocatableDoubleCollection.java	`70% <70%> (ø)`	`3 <3> (?)`
...er/PostprocessGermlineCNVCallsIntegrationTest.java	`96.117% <88.889%> (ø)`	`18 <0> (?)`
.../tools/copynumber/PostprocessGermlineCNVCalls.java	`93.689% <94.643%> (ø)`	`30 <7> (?)`

samuelklee

Thanks for adding this, @asmirnov239!

I'm a bit conflicted by what the column headers should be for the new file formats introduced here. Perhaps @mwalker174 can chime in.

We have two formats: 1) per-shard files that contain only the denoised CR posterior means (without intervals), which are just essentially internal files used to pass results from the python code to the Java code, and 2) the final output containing both intervals and the denoised CR posterior means.

My initial inclination was to use generic NonLocatableLinearCopyRatio records to represent the first and LinearCopyRatio records to represent the second. This means we'd need to change the gCNV code (which currently uses the column header DENOISED_COPY_RATIO_MEAN for the first, which is somewhere between generic and precise). This is because we use a generic CopyRatio record (with the column header LOG2_COPY_RATIO) to represent both log2 standardized and denoised CRs over in the somatic pipeline. My comments in the PR so far assume we go in this direction.

However, we may want to go in the opposite direction, and instead make the column headers specific, verbose, and precise. In that case, DENOISED_LINEAR_COPY_RATIO_POSTERIOR_MEAN could be used as the column header for both, and we'd have corresponding NonLocatableDenoisedLinearCopyRatioPosteriorMean and DenoisedLinearCopyRatioPosteriorMean records. In this case, I would also be inclined to just go ahead and add the posterior standard deviations as another column (which would require a second internal format if we want to be specific---alternatively, we could just use a generic VALUE column header for both the non-locatable mean and standard deviation).

This is all made a bit more confusing by the fact that the denoised CRs, as defined by gCNV, can formally be negative! (Otherwise, I would be inclined to just convert them to log2 and use CopyRatio records.)

samuelklee · 2019-03-22T19:36:25Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

 import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution;
+import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio;


Let's make this NonLocatableLinearCopyRatio. Sorry to be a stickler.

samuelklee · 2019-03-22T19:38:01Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

 import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution;
+import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio;
+import org.broadinstitute.hellbender.tools.copynumber.formats.records.DenoisedLocatableCopyRatio;


Likewise, this should probably be LinearCopyRatio. The reasoning is that we want to distinguish from log2 CopyRatio records, but since those are used to represent both standardized and denoised copy ratio, we probably don't need to add the Denoised modifier here.

samuelklee · 2019-03-22T19:38:33Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

@@ -38,7 +41,8 @@
 import java.util.stream.IntStream;

 /**
- * Postprocesses the output of {@link GermlineCNVCaller} and generates VCF files.
+ * Postprocesses the output of {@link GermlineCNVCaller} and generates VCF files as well as concatenated denoised


as concatenated -> as a concatenated

samuelklee · 2019-03-22T19:42:00Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

@@ -57,6 +61,9 @@
 * by the sex karyotype of the sample and is set to the pre-determined contig ploidy state fetched from the output
 * calls of {@link DetermineGermlineContigPloidy}.</p>
 *
+ * <p>Finally, the Postprocessor concatenates denoised copy ratio tables from all the call shards produced by the


Probably a bit misleading to refer to "tables" here. Finally, the tool concatenates posterior means for denoised copy ratios from all the call shards produced by the {@link GermlineCNVCaller} into a single file.

Update the Required inputs section below to include the output path.

Oops, also update the call to CopyNumberArgumentValidationUtils.validateOutputFiles to include the denoised CR path.

Oh good catch! Done

samuelklee · 2019-03-22T19:49:39Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

 * </pre>
 *
 * @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
 * @author Andrey Smirnov &lt;asmirnov@broadinstitute.org&gt;
 */
 @CommandLineProgramProperties(
-        summary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
-        oneLineSummary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
+        summary = "Postprocesses the output of GermlineCNVCaller, generates VCF files and concatenates denoised copy ratio files ",


Oxford comma!!!

That said, it's a bit odd to stress "concatenation" for just the denoised CRs, since we are pretty much concatenating other quantities to generate the VCFs. Maybe Postprocesses the output of GermlineCNVCaller and generates VCFs and denoised copy ratios is fine.

samuelklee · 2019-03-22T20:38:37Z

...te/hellbender/tools/copynumber/formats/collections/DenoisedLocatableCopyRatioCollection.java

+        CONTIG,
+        START,
+        END,
+        DENOISED_COPY_RATIO;


This should just be LINEAR_COPY_RATIO.

samuelklee · 2019-03-22T20:41:46Z

...ain/java/org/broadinstitute/hellbender/tools/copynumber/gcnv/GermlineCNVNamingConstants.java

@@ -15,4 +15,6 @@
    public final static String BASELINE_COPY_NUMBER_TABLE_COLUMN = "BASELINE_COPY_NUMBER";
    public final static String SAMPLE_PREFIX = "SAMPLE_";
    public final static String INTERVAL_LIST_FILE_NAME = "interval_list.tsv";
+    public final static String DENOISED_COPY_RATIO_MEAN_FILE_NAME = "mu_denoised_copy_ratio_t.tsv";
+    public final static String DENOISED_COPY_RATIO_MEAN_TABLE_COLUMN = "DENOISED_COPY_RATIO_MEAN";


Let's think carefully about what this column should be---we can discuss with @mwalker174 in person.

samuelklee · 2019-03-22T20:42:59Z

.../broadinstitute/hellbender/tools/copynumber/formats/records/LinearNonLocatableCopyRatio.java

+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;


Add some braces, make that final, and clean up this automatically generated code.

samuelklee · 2019-03-22T20:43:06Z

...g/broadinstitute/hellbender/tools/copynumber/formats/records/DenoisedLocatableCopyRatio.java

+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;


Add some braces, make that final, and clean up this automatically generated code.

samuelklee · 2019-03-22T20:43:56Z

...adinstitute/hellbender/tools/copynumber/gcnv-postprocess/denoised_copy_ratios_SAMPLE_000.tsv

@@ -0,0 +1,674 @@
+@HD	VN:1.6


Can you add a comment in #4007 to briefly describe what you did to regenerate this test data?

asmirnov239 · 2019-04-16T19:49:00Z

I addressed PR comments and added the changed to the TSV file headers output by gCNV. @samuelklee and @mwalker174, could you please take a look?

mwalker174

Looks straight-forward. I have one minor suggestion about the code. Did you check the test files thoroughly? At a glance the changes in output look insignificant, but we should really be sure.

mwalker174 · 2019-04-16T21:25:28Z

...roadinstitute/hellbender/tools/copynumber/formats/collections/LinearCopyRatioCollection.java

-                .append(denoisedLocatableCopyRatioCopyRatio.getStart())
-                .append(denoisedLocatableCopyRatioCopyRatio.getEnd())
-                .append(denoisedLocatableCopyRatioCopyRatio.getDenoisedCopyRatio().getDenoisedCopyRatio());
+    private static BiConsumer<LinearCopyRatio, DataLine> getDenoisedLocatableCopyRatioRecordToDataLineEncoder() {


Should this function be renamed to getLinearCopyRatioRecordToDataLineEncoder?

Yes, good catch

samuelklee

Thanks, looks good for the most part! A few more minor issues and then some questions about why there are non-trivial differences in the test files---apologies if we discussed this previously or in person, but I can't recall.

samuelklee · 2019-04-22T18:04:51Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

 * </pre>
 *
 * @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
 * @author Andrey Smirnov &lt;asmirnov@broadinstitute.org&gt;
 */
 @CommandLineProgramProperties(
-        summary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
-        oneLineSummary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
+        summary = "Postprocesses the output of GermlineCNVCaller and generates VCFs and denoised copy ratios.",


Remove periods after both summaries.

samuelklee · 2019-04-22T18:05:54Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/PostprocessGermlineCNVCalls.java

@@ -160,11 +169,22 @@
    )
    private File outputSegmentsVCFFile;

+    @Argument(
+            doc = "Output denoised copy ratio file concatenated together from call shards.",


Just "Output denoised copy ratio file." is probably fine.

samuelklee · 2019-04-22T18:08:13Z

...dinstitute/hellbender/tools/copynumber/formats/collections/NonLocatableDoubleCollection.java

+
+    public NonLocatableDoubleCollection(final File inputFile) {
+        super(inputFile,
+                new TableColumnCollection(GermlineCNVNamingConstants.DEFAULT_GCNV_OUTPUT_COLUMN_PREFIX + "0"),


Maybe extract GermlineCNVNamingConstants.DEFAULT_GCNV_OUTPUT_COLUMN_PREFIX + "0" as well.

samuelklee · 2019-04-22T18:09:58Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

@@ -153,10 +153,10 @@ def assert_output_path_writable(output_path: str,

 def write_ndarray_to_tsv(output_file: str,
                         array: np.ndarray,
-                         comment=io_consts.default_comment_char,
+                         comment_char=io_consts.default_comment_char,


I think the comment parameter name was fine (since this is what is used by pandas, etc.), but up to you.

samuelklee · 2019-04-22T18:13:16Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

-                    dtype = _get_value('dtype', stripped_line)
-                if shape is None:
-                    shape = _get_value('shape', stripped_line)
+                key, value = parse_sam_comment(stripped_line)


Might be worth checking that there aren't multiple dtype/shape lines present?

I agree, done

samuelklee · 2019-04-22T18:20:59Z

...stitute/hellbender/tools/copynumber/gcnv-postprocess/ploidy-calls/SAMPLE_0/contig_ploidy.tsv

-3	2	133.59170389084028
-X	1	147.62074680826697
-Y	1	153.52529778863038
+1	2	123.4391607890214


Why does this cover 1-5, X, Y now (as opposed to 1-3, X, Y previously)?

samuelklee · 2019-04-22T18:22:47Z

...ute/hellbender/tools/copynumber/gcnv-postprocess/ploidy-calls/SAMPLE_0/global_read_depth.tsv

@@ -1,3 +1,3 @@
 @RG	ID:GATKCopyNumber	SM:SAMPLE_000
 GLOBAL_READ_DEPTH	AVERAGE_PLOIDY
-830.2515779981966	1.6701807228915662
+402.21816418875244	1.9669421487603307


This also looks pretty different?

samuelklee · 2019-04-22T18:26:28Z

...bender/tools/copynumber/gcnv-postprocess/shard_0-calls/SAMPLE_0/mu_denoised_copy_ratio_t.tsv

@@ -0,0 +1,225 @@
+@RG	ID:GATKCopyNumber	SM:SAMPLE_000


Wait, so are these files supposed to have dtype and shape header lines? I thought all mu_ and std_ files would?

Fixed this, some calls to write_ndarray_to_tsv overriden write_shape_info

samuelklee · 2019-04-22T18:26:58Z

...ender/tools/copynumber/gcnv-postprocess/shard_0-calls/SAMPLE_0/std_denoised_copy_ratio_t.tsv

@@ -0,0 +1,225 @@
+@RG	ID:GATKCopyNumber	SM:SAMPLE_000


samuelklee · 2019-04-22T18:27:41Z

...llbender/tools/copynumber/gcnv-postprocess/shard_1-calls/SAMPLE_0/baseline_copy_number_t.tsv

-1
-1
-1
+2


Are these differences expected?

samuelklee · 2019-08-06T14:46:06Z

Looks like you need to resolve conflicts with the QC branch you just merged? But I'll trust that you addressed everything else and go ahead and approve.

…ults from shards output generated by GermlineCNVCaller

gokalpcelik · 2019-08-17T09:58:17Z

Hi looks like this change breaks the compatibility between old models generated by older versions of the tool (4.1.2.0) for the use of CASE analysis.

I've posted a regression also regarding io.commons.py that is breaking the compatibility with the older files. (probably?)

https://gatkforums.broadinstitute.org/gatk/discussion/24347/germlinecnv-tools-regression-or-incompatibility-between-gatk-env-4-1-3-0-and-gatk#latest

Can you comment on that?

asmirnov239 requested a review from samuelklee March 21, 2019 19:57

asmirnov239 force-pushed the as_concat_denoised_cr branch 2 times, most recently from 11c3256 to 5cc6c71 Compare March 21, 2019 22:11

asmirnov239 assigned samuelklee Mar 22, 2019

samuelklee requested changes Mar 22, 2019

View reviewed changes

asmirnov239 force-pushed the as_concat_denoised_cr branch from 1643491 to a8dcff4 Compare April 16, 2019 02:25

asmirnov239 assigned mwalker174 and samuelklee and unassigned samuelklee Apr 16, 2019

asmirnov239 requested a review from mwalker174 April 16, 2019 19:47

mwalker174 approved these changes Apr 16, 2019

View reviewed changes

asmirnov239 force-pushed the as_concat_denoised_cr branch 2 times, most recently from b21f646 to 0cf9802 Compare April 18, 2019 22:38

samuelklee requested changes Apr 22, 2019

View reviewed changes

asmirnov239 force-pushed the as_concat_denoised_cr branch 2 times, most recently from b2a1ea2 to 46c0ea0 Compare August 6, 2019 03:12

samuelklee approved these changes Aug 6, 2019

View reviewed changes

Germline CNV postprocessor now concatenates the denoised coverage res…

ace073a

…ults from shards output generated by GermlineCNVCaller

asmirnov239 force-pushed the as_concat_denoised_cr branch from a45ae20 to ace073a Compare August 6, 2019 15:24

asmirnov239 added 2 commits August 6, 2019 12:29

Interval file fix

3bf8328

Another fix

85a5d39

asmirnov239 merged commit 22c0f97 into master Aug 7, 2019

samuelklee mentioned this pull request Aug 28, 2019

Change PostprocessGermlineCNVCalls to use output prefixes. #6128

Closed

samuelklee mentioned this pull request Oct 15, 2021

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

Closed

		import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution;
		import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio;

Add denoised coverage file concatenation output to gCNV postprocessor #5823

Add denoised coverage file concatenation output to gCNV postprocessor #5823

Conversation

asmirnov239 commented Mar 21, 2019

codecov-io commented Mar 21, 2019 • edited by codecov bot Loading

Codecov Report

samuelklee left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asmirnov239 commented Apr 16, 2019

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee commented Aug 6, 2019

gokalpcelik commented Aug 17, 2019

codecov-io commented Mar 21, 2019 •

edited by codecov bot

Loading

samuelklee left a comment •

edited

Loading