Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add denoised coverage file concatenation output to gCNV postprocessor #5823

Merged
merged 3 commits into from
Aug 7, 2019

Conversation

asmirnov239
Copy link
Collaborator

This PR adds an option to PostprocessGermlineCNVCalls to concatenate all denoised copy ratio files from call directories output by GermlineCNVCaller

@asmirnov239 asmirnov239 requested a review from samuelklee March 21, 2019 19:57
@codecov-io
Copy link

codecov-io commented Mar 21, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@06df7e8). Click here to learn what that means.
The diff coverage is 75.41%.

@@            Coverage Diff             @@
##             master     #5823   +/-   ##
==========================================
  Coverage          ?   86.834%           
  Complexity        ?     32337           
==========================================
  Files             ?      1994           
  Lines             ?    149405           
  Branches          ?     16492           
==========================================
  Hits              ?    129735           
  Misses            ?     13654           
  Partials          ?      6016
Impacted Files Coverage Δ Complexity Δ
...ls/copynumber/gcnv/GermlineCNVNamingConstants.java 0% <ø> (ø) 0 <0> (?)
...ons/CopyNumberPosteriorDistributionCollection.java 73.684% <0%> (ø) 6 <0> (?)
...mats/collections/BaselineCopyNumberCollection.java 63.636% <100%> (ø) 3 <0> (?)
...ls/copynumber/formats/records/LinearCopyRatio.java 47.059% <47.059%> (ø) 5 <5> (?)
...formats/collections/LinearCopyRatioCollection.java 55.556% <55.556%> (ø) 3 <3> (?)
...mats/collections/NonLocatableDoubleCollection.java 70% <70%> (ø) 3 <3> (?)
...er/PostprocessGermlineCNVCallsIntegrationTest.java 96.117% <88.889%> (ø) 18 <0> (?)
.../tools/copynumber/PostprocessGermlineCNVCalls.java 93.689% <94.643%> (ø) 30 <7> (?)

@asmirnov239 asmirnov239 force-pushed the as_concat_denoised_cr branch 2 times, most recently from 11c3256 to 5cc6c71 Compare March 21, 2019 22:11
Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, @asmirnov239!

I'm a bit conflicted by what the column headers should be for the new file formats introduced here. Perhaps @mwalker174 can chime in.

We have two formats: 1) per-shard files that contain only the denoised CR posterior means (without intervals), which are just essentially internal files used to pass results from the python code to the Java code, and 2) the final output containing both intervals and the denoised CR posterior means.

My initial inclination was to use generic NonLocatableLinearCopyRatio records to represent the first and LinearCopyRatio records to represent the second. This means we'd need to change the gCNV code (which currently uses the column header DENOISED_COPY_RATIO_MEAN for the first, which is somewhere between generic and precise). This is because we use a generic CopyRatio record (with the column header LOG2_COPY_RATIO) to represent both log2 standardized and denoised CRs over in the somatic pipeline. My comments in the PR so far assume we go in this direction.

However, we may want to go in the opposite direction, and instead make the column headers specific, verbose, and precise. In that case, DENOISED_LINEAR_COPY_RATIO_POSTERIOR_MEAN could be used as the column header for both, and we'd have corresponding NonLocatableDenoisedLinearCopyRatioPosteriorMean and DenoisedLinearCopyRatioPosteriorMean records. In this case, I would also be inclined to just go ahead and add the posterior standard deviations as another column (which would require a second internal format if we want to be specific---alternatively, we could just use a generic VALUE column header for both the non-locatable mean and standard deviation).

This is all made a bit more confusing by the fact that the denoised CRs, as defined by gCNV, can formally be negative! (Otherwise, I would be inclined to just convert them to log2 and use CopyRatio records.)

import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution;
import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this NonLocatableLinearCopyRatio. Sorry to be a stickler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution;
import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio;
import org.broadinstitute.hellbender.tools.copynumber.formats.records.DenoisedLocatableCopyRatio;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this should probably be LinearCopyRatio. The reasoning is that we want to distinguish from log2 CopyRatio records, but since those are used to represent both standardized and denoised copy ratio, we probably don't need to add the Denoised modifier here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -38,7 +41,8 @@
import java.util.stream.IntStream;

/**
* Postprocesses the output of {@link GermlineCNVCaller} and generates VCF files.
* Postprocesses the output of {@link GermlineCNVCaller} and generates VCF files as well as concatenated denoised
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as concatenated -> as a concatenated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -57,6 +61,9 @@
* by the sex karyotype of the sample and is set to the pre-determined contig ploidy state fetched from the output
* calls of {@link DetermineGermlineContigPloidy}.</p>
*
* <p>Finally, the Postprocessor concatenates denoised copy ratio tables from all the call shards produced by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a bit misleading to refer to "tables" here. Finally, the tool concatenates posterior means for denoised copy ratios from all the call shards produced by the {@link GermlineCNVCaller} into a single file.

Update the Required inputs section below to include the output path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, also update the call to CopyNumberArgumentValidationUtils.validateOutputFiles to include the denoised CR path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch! Done

* </pre>
*
* @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
* @author Andrey Smirnov &lt;asmirnov@broadinstitute.org&gt;
*/
@CommandLineProgramProperties(
summary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
oneLineSummary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
summary = "Postprocesses the output of GermlineCNVCaller, generates VCF files and concatenates denoised copy ratio files ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oxford comma!!!

That said, it's a bit odd to stress "concatenation" for just the denoised CRs, since we are pretty much concatenating other quantities to generate the VCFs. Maybe Postprocesses the output of GermlineCNVCaller and generates VCFs and denoised copy ratios is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

CONTIG,
START,
END,
DENOISED_COPY_RATIO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be LINEAR_COPY_RATIO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -15,4 +15,6 @@
public final static String BASELINE_COPY_NUMBER_TABLE_COLUMN = "BASELINE_COPY_NUMBER";
public final static String SAMPLE_PREFIX = "SAMPLE_";
public final static String INTERVAL_LIST_FILE_NAME = "interval_list.tsv";
public final static String DENOISED_COPY_RATIO_MEAN_FILE_NAME = "mu_denoised_copy_ratio_t.tsv";
public final static String DENOISED_COPY_RATIO_MEAN_TABLE_COLUMN = "DENOISED_COPY_RATIO_MEAN";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think carefully about what this column should be---we can discuss with @mwalker174 in person.


@Override
public boolean equals(Object o) {
if (this == o) return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some braces, make that final, and clean up this automatically generated code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@Override
public boolean equals(Object o) {
if (this == o) return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some braces, make that final, and clean up this automatically generated code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,674 @@
@HD VN:1.6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment in #4007 to briefly describe what you did to regenerate this test data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@asmirnov239
Copy link
Collaborator Author

I addressed PR comments and added the changed to the TSV file headers output by gCNV. @samuelklee and @mwalker174, could you please take a look?

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks straight-forward. I have one minor suggestion about the code. Did you check the test files thoroughly? At a glance the changes in output look insignificant, but we should really be sure.

.append(denoisedLocatableCopyRatioCopyRatio.getStart())
.append(denoisedLocatableCopyRatioCopyRatio.getEnd())
.append(denoisedLocatableCopyRatioCopyRatio.getDenoisedCopyRatio().getDenoisedCopyRatio());
private static BiConsumer<LinearCopyRatio, DataLine> getDenoisedLocatableCopyRatioRecordToDataLineEncoder() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be renamed to getLinearCopyRatioRecordToDataLineEncoder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch

@asmirnov239 asmirnov239 force-pushed the as_concat_denoised_cr branch 2 times, most recently from b21f646 to 0cf9802 Compare April 18, 2019 22:38
Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good for the most part! A few more minor issues and then some questions about why there are non-trivial differences in the test files---apologies if we discussed this previously or in person, but I can't recall.

* </pre>
*
* @author Mehrtash Babadi &lt;mehrtash@broadinstitute.org&gt;
* @author Andrey Smirnov &lt;asmirnov@broadinstitute.org&gt;
*/
@CommandLineProgramProperties(
summary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
oneLineSummary = "Postprocesses the output of GermlineCNVCaller and generates VCF files",
summary = "Postprocesses the output of GermlineCNVCaller and generates VCFs and denoised copy ratios.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove periods after both summaries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -160,11 +169,22 @@
)
private File outputSegmentsVCFFile;

@Argument(
doc = "Output denoised copy ratio file concatenated together from call shards.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "Output denoised copy ratio file." is probably fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public NonLocatableDoubleCollection(final File inputFile) {
super(inputFile,
new TableColumnCollection(GermlineCNVNamingConstants.DEFAULT_GCNV_OUTPUT_COLUMN_PREFIX + "0"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract GermlineCNVNamingConstants.DEFAULT_GCNV_OUTPUT_COLUMN_PREFIX + "0" as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -153,10 +153,10 @@ def assert_output_path_writable(output_path: str,

def write_ndarray_to_tsv(output_file: str,
array: np.ndarray,
comment=io_consts.default_comment_char,
comment_char=io_consts.default_comment_char,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment parameter name was fine (since this is what is used by pandas, etc.), but up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

dtype = _get_value('dtype', stripped_line)
if shape is None:
shape = _get_value('shape', stripped_line)
key, value = parse_sam_comment(stripped_line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth checking that there aren't multiple dtype/shape lines present?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, done

3 2 133.59170389084028
X 1 147.62074680826697
Y 1 153.52529778863038
1 2 123.4391607890214
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this cover 1-5, X, Y now (as opposed to 1-3, X, Y previously)?

@@ -1,3 +1,3 @@
@RG ID:GATKCopyNumber SM:SAMPLE_000
GLOBAL_READ_DEPTH AVERAGE_PLOIDY
830.2515779981966 1.6701807228915662
402.21816418875244 1.9669421487603307
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks pretty different?

@@ -0,0 +1,225 @@
@RG ID:GATKCopyNumber SM:SAMPLE_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so are these files supposed to have dtype and shape header lines? I thought all mu_ and std_ files would?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, some calls to write_ndarray_to_tsv overriden write_shape_info

@@ -0,0 +1,225 @@
@RG ID:GATKCopyNumber SM:SAMPLE_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

1
1
1
2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these differences expected?

@asmirnov239 asmirnov239 force-pushed the as_concat_denoised_cr branch 2 times, most recently from b2a1ea2 to 46c0ea0 Compare August 6, 2019 03:12
@samuelklee
Copy link
Contributor

Looks like you need to resolve conflicts with the QC branch you just merged? But I'll trust that you addressed everything else and go ahead and approve.

…ults from shards output generated by GermlineCNVCaller
@asmirnov239 asmirnov239 force-pushed the as_concat_denoised_cr branch from a45ae20 to ace073a Compare August 6, 2019 15:24
@asmirnov239 asmirnov239 merged commit 22c0f97 into master Aug 7, 2019
@gokalpcelik
Copy link
Contributor

Hi looks like this change breaks the compatibility between old models generated by older versions of the tool (4.1.2.0) for the use of CASE analysis.

I've posted a regression also regarding io.commons.py that is breaking the compatibility with the older files. (probably?)

https://gatkforums.broadinstitute.org/gatk/discussion/24347/germlinecnv-tools-regression-or-incompatibility-between-gatk-env-4-1-3-0-and-gatk#latest

Can you comment on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants