-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add denoised coverage file concatenation output to gCNV postprocessor #5823
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5823 +/- ##
==========================================
Coverage ? 86.834%
Complexity ? 32337
==========================================
Files ? 1994
Lines ? 149405
Branches ? 16492
==========================================
Hits ? 129735
Misses ? 13654
Partials ? 6016
|
11c3256
to
5cc6c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, @asmirnov239!
I'm a bit conflicted by what the column headers should be for the new file formats introduced here. Perhaps @mwalker174 can chime in.
We have two formats: 1) per-shard files that contain only the denoised CR posterior means (without intervals), which are just essentially internal files used to pass results from the python code to the Java code, and 2) the final output containing both intervals and the denoised CR posterior means.
My initial inclination was to use generic NonLocatableLinearCopyRatio
records to represent the first and LinearCopyRatio
records to represent the second. This means we'd need to change the gCNV code (which currently uses the column header DENOISED_COPY_RATIO_MEAN
for the first, which is somewhere between generic and precise). This is because we use a generic CopyRatio
record (with the column header LOG2_COPY_RATIO
) to represent both log2 standardized and denoised CRs over in the somatic pipeline. My comments in the PR so far assume we go in this direction.
However, we may want to go in the opposite direction, and instead make the column headers specific, verbose, and precise. In that case, DENOISED_LINEAR_COPY_RATIO_POSTERIOR_MEAN
could be used as the column header for both, and we'd have corresponding NonLocatableDenoisedLinearCopyRatioPosteriorMean
and DenoisedLinearCopyRatioPosteriorMean
records. In this case, I would also be inclined to just go ahead and add the posterior standard deviations as another column (which would require a second internal format if we want to be specific---alternatively, we could just use a generic VALUE
column header for both the non-locatable mean and standard deviation).
This is all made a bit more confusing by the fact that the denoised CRs, as defined by gCNV, can formally be negative! (Otherwise, I would be inclined to just convert them to log2 and use CopyRatio
records.)
import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution; | ||
import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this NonLocatableLinearCopyRatio
. Sorry to be a stickler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyNumberPosteriorDistribution; | ||
import org.broadinstitute.hellbender.tools.copynumber.formats.records.LinearNonLocatableCopyRatio; | ||
import org.broadinstitute.hellbender.tools.copynumber.formats.records.DenoisedLocatableCopyRatio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this should probably be LinearCopyRatio
. The reasoning is that we want to distinguish from log2 CopyRatio
records, but since those are used to represent both standardized and denoised copy ratio, we probably don't need to add the Denoised
modifier here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -38,7 +41,8 @@ | |||
import java.util.stream.IntStream; | |||
|
|||
/** | |||
* Postprocesses the output of {@link GermlineCNVCaller} and generates VCF files. | |||
* Postprocesses the output of {@link GermlineCNVCaller} and generates VCF files as well as concatenated denoised |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as concatenated -> as a concatenated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -57,6 +61,9 @@ | |||
* by the sex karyotype of the sample and is set to the pre-determined contig ploidy state fetched from the output | |||
* calls of {@link DetermineGermlineContigPloidy}.</p> | |||
* | |||
* <p>Finally, the Postprocessor concatenates denoised copy ratio tables from all the call shards produced by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a bit misleading to refer to "tables" here. Finally, the tool concatenates posterior means for denoised copy ratios from all the call shards produced by the {@link GermlineCNVCaller} into a single file.
Update the Required inputs
section below to include the output path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, also update the call to CopyNumberArgumentValidationUtils.validateOutputFiles
to include the denoised CR path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch! Done
* </pre> | ||
* | ||
* @author Mehrtash Babadi <mehrtash@broadinstitute.org> | ||
* @author Andrey Smirnov <asmirnov@broadinstitute.org> | ||
*/ | ||
@CommandLineProgramProperties( | ||
summary = "Postprocesses the output of GermlineCNVCaller and generates VCF files", | ||
oneLineSummary = "Postprocesses the output of GermlineCNVCaller and generates VCF files", | ||
summary = "Postprocesses the output of GermlineCNVCaller, generates VCF files and concatenates denoised copy ratio files ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oxford comma!!!
That said, it's a bit odd to stress "concatenation" for just the denoised CRs, since we are pretty much concatenating other quantities to generate the VCFs. Maybe Postprocesses the output of GermlineCNVCaller and generates VCFs and denoised copy ratios
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
CONTIG, | ||
START, | ||
END, | ||
DENOISED_COPY_RATIO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be LINEAR_COPY_RATIO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -15,4 +15,6 @@ | |||
public final static String BASELINE_COPY_NUMBER_TABLE_COLUMN = "BASELINE_COPY_NUMBER"; | |||
public final static String SAMPLE_PREFIX = "SAMPLE_"; | |||
public final static String INTERVAL_LIST_FILE_NAME = "interval_list.tsv"; | |||
public final static String DENOISED_COPY_RATIO_MEAN_FILE_NAME = "mu_denoised_copy_ratio_t.tsv"; | |||
public final static String DENOISED_COPY_RATIO_MEAN_TABLE_COLUMN = "DENOISED_COPY_RATIO_MEAN"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think carefully about what this column should be---we can discuss with @mwalker174 in person.
|
||
@Override | ||
public boolean equals(Object o) { | ||
if (this == o) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some braces, make that
final, and clean up this automatically generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
@Override | ||
public boolean equals(Object o) { | ||
if (this == o) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some braces, make that
final, and clean up this automatically generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,674 @@ | |||
@HD VN:1.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment in #4007 to briefly describe what you did to regenerate this test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
1643491
to
a8dcff4
Compare
I addressed PR comments and added the changed to the TSV file headers output by gCNV. @samuelklee and @mwalker174, could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks straight-forward. I have one minor suggestion about the code. Did you check the test files thoroughly? At a glance the changes in output look insignificant, but we should really be sure.
.append(denoisedLocatableCopyRatioCopyRatio.getStart()) | ||
.append(denoisedLocatableCopyRatioCopyRatio.getEnd()) | ||
.append(denoisedLocatableCopyRatioCopyRatio.getDenoisedCopyRatio().getDenoisedCopyRatio()); | ||
private static BiConsumer<LinearCopyRatio, DataLine> getDenoisedLocatableCopyRatioRecordToDataLineEncoder() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function be renamed to getLinearCopyRatioRecordToDataLineEncoder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch
b21f646
to
0cf9802
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good for the most part! A few more minor issues and then some questions about why there are non-trivial differences in the test files---apologies if we discussed this previously or in person, but I can't recall.
* </pre> | ||
* | ||
* @author Mehrtash Babadi <mehrtash@broadinstitute.org> | ||
* @author Andrey Smirnov <asmirnov@broadinstitute.org> | ||
*/ | ||
@CommandLineProgramProperties( | ||
summary = "Postprocesses the output of GermlineCNVCaller and generates VCF files", | ||
oneLineSummary = "Postprocesses the output of GermlineCNVCaller and generates VCF files", | ||
summary = "Postprocesses the output of GermlineCNVCaller and generates VCFs and denoised copy ratios.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove periods after both summaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -160,11 +169,22 @@ | |||
) | |||
private File outputSegmentsVCFFile; | |||
|
|||
@Argument( | |||
doc = "Output denoised copy ratio file concatenated together from call shards.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just "Output denoised copy ratio file." is probably fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
public NonLocatableDoubleCollection(final File inputFile) { | ||
super(inputFile, | ||
new TableColumnCollection(GermlineCNVNamingConstants.DEFAULT_GCNV_OUTPUT_COLUMN_PREFIX + "0"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract GermlineCNVNamingConstants.DEFAULT_GCNV_OUTPUT_COLUMN_PREFIX + "0"
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -153,10 +153,10 @@ def assert_output_path_writable(output_path: str, | |||
|
|||
def write_ndarray_to_tsv(output_file: str, | |||
array: np.ndarray, | |||
comment=io_consts.default_comment_char, | |||
comment_char=io_consts.default_comment_char, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment
parameter name was fine (since this is what is used by pandas, etc.), but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dtype = _get_value('dtype', stripped_line) | ||
if shape is None: | ||
shape = _get_value('shape', stripped_line) | ||
key, value = parse_sam_comment(stripped_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth checking that there aren't multiple dtype/shape lines present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, done
3 2 133.59170389084028 | ||
X 1 147.62074680826697 | ||
Y 1 153.52529778863038 | ||
1 2 123.4391607890214 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this cover 1-5, X, Y now (as opposed to 1-3, X, Y previously)?
@@ -1,3 +1,3 @@ | |||
@RG ID:GATKCopyNumber SM:SAMPLE_000 | |||
GLOBAL_READ_DEPTH AVERAGE_PLOIDY | |||
830.2515779981966 1.6701807228915662 | |||
402.21816418875244 1.9669421487603307 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also looks pretty different?
@@ -0,0 +1,225 @@ | |||
@RG ID:GATKCopyNumber SM:SAMPLE_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so are these files supposed to have dtype
and shape
header lines? I thought all mu_
and std_
files would?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, some calls to write_ndarray_to_tsv
overriden write_shape_info
@@ -0,0 +1,225 @@ | |||
@RG ID:GATKCopyNumber SM:SAMPLE_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
1 | ||
1 | ||
1 | ||
2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these differences expected?
b2a1ea2
to
46c0ea0
Compare
Looks like you need to resolve conflicts with the QC branch you just merged? But I'll trust that you addressed everything else and go ahead and approve. |
…ults from shards output generated by GermlineCNVCaller
a45ae20
to
ace073a
Compare
Hi looks like this change breaks the compatibility between old models generated by older versions of the tool (4.1.2.0) for the use of CASE analysis. I've posted a regression also regarding io.commons.py that is breaking the compatibility with the older files. (probably?) Can you comment on that? |
This PR adds an option to PostprocessGermlineCNVCalls to concatenate all denoised copy ratio files from call directories output by GermlineCNVCaller