-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding base level comparison modes for CompareReferences tool #7987
Conversation
… do base comparison on >2 references
Github actions tests reported job failures from actions build 2841114156
|
ff3ef6b
to
2038fd1
Compare
Github actions tests reported job failures from actions build 2841313433
|
2038fd1
to
164bad3
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #7987 +/- ##
================================================
- Coverage 52.260% 20.735% -31.524%
+ Complexity 29146 12332 -16814
================================================
Files 2310 2331 +21
Lines 180344 181988 +1644
Branches 19840 19985 +145
================================================
- Hits 94247 37736 -56511
- Misses 80124 140700 +60576
+ Partials 5973 3552 -2421
|
…-- need intel build
164bad3
to
40c723f
Compare
Github actions tests reported job failures from actions build 2842345810
|
Github actions tests reported job failures from actions build 2842334324
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint review of everything except for the mummer excectuor
@@ -141,6 +171,28 @@ public void traverse(){ | |||
for(ReferencePair pair : referencePairs){ | |||
System.out.println(pair); | |||
} | |||
|
|||
if(referencePairs.size() != 1 && baseComparisonMode != BaseComparisonMode.NO_BASE_COMPARISON){ | |||
throw new UserException.BadInput(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finish this warning message
} | ||
if(baseComparisonMode != BaseComparisonMode.NO_BASE_COMPARISON){ | ||
if(baseComparisonOutputDirectory == null) { | ||
throw new UserException.CouldNotCreateOutputFile(baseComparisonOutputDirectory, "Output directory not provided but required in base comparison mode."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in "--argument name BASE_COMPARISON" mode
@@ -141,6 +171,28 @@ public void traverse(){ | |||
for(ReferencePair pair : referencePairs){ | |||
System.out.println(pair); | |||
} | |||
|
|||
if(referencePairs.size() != 1 && baseComparisonMode != BaseComparisonMode.NO_BASE_COMPARISON){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these sanity checks to before the tool starts doing any of the work, (perhaps in "custom commandline validation"
// pass fastas into mummer, get back a vcf and add to list | ||
MummerExecutor executor = new MummerExecutor(); | ||
logger.info("Running mummer alignment on sequence " + sequenceName); | ||
File tempSnpsDirectory = IOUtils.createTempDir("tempsnps"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you are making a whole new tmpdir within each of these loops for every mismatching contig... This can get to be a lot potentially and these are all java tmpdirs that close on the jvm exit too (with apprxoimately 2 uncompressed references' worth of inputs in here....
I think this method should probably manage the deletion of the tmp dir itself to cut down on disk usage. @droazen thoughs?
} | ||
// merge individual snps files | ||
File snps = IOUtils.createTempFile(String.format("%s_%s", refPair.getRef1AsString(), refPair.getRef2AsString()), ".snps"); | ||
/*new File(baseComparisonOutputDirectory.toPath().toString(), String.format("%s_%s.snps", refPair.getRef1AsString(), refPair.getRef2AsString()));*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented code
SNPRecord record = new SNPRecord(sequenceName, position, new String(new byte[]{ref1Allele}), new String(new byte[]{ref2Allele}), refPair.getRef1AsString(), refPair.getRef2AsString()); | ||
writer.writeRecord(record); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check at the end that both are finished when you kick out and throw an exception otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do i need this in addition to the check earlier in the loop that the sequence lengths for the current sequence are equal? if i want one specifically on these iterators, would that be a user exception? should be same sequence length as is being used in the first check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen This is an error that should never happen, since you've already checked that the sequence lengths are the same. If it does happen that one iterator finishes before the other, something has gone horribly wrong internally. You'd probably want to throw a ShouldNeverReachHereException
in that case, since we never expect it to happen in practice.
int sequenceLength = source.getSequenceDictionary().getSequence(sequenceName).getSequenceLength(); | ||
FastaReferenceWriter writer = new FastaReferenceWriterBuilder() | ||
.setFastaFile(output.toPath()) | ||
.setBasesPerLine(80) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic ints should be variales somewhere.
@@ -35,7 +35,7 @@ | |||
* between command/script/module execution. Using -i doesn't buy you anything (for this version of the executor, at | |||
* least) since the process is terminated after each command completes. | |||
*/ | |||
public class PythonScriptExecutor extends PythonExecutorBase { | |||
public class PythonScriptExecutor extends PythonExecutorBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental, please remove
@@ -16,7 +16,8 @@ | |||
* {@link #getApproximateCommandLine} | |||
* {@link #getScriptException} | |||
*/ | |||
public abstract class ScriptExecutor { | |||
public abstract class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental, please remove
|
||
// need intel build for MUMmer | ||
@Test(enabled = false) | ||
public void testExecuteMummer() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are going to need a test that asserts we get a nice user exception in the even that hte correct installation of mummer doesn't exist for the users machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen Here's part 1 of my review -- still need to review the tests and MummerExecutor
@@ -0,0 +1,42 @@ | |||
package org.broadinstitute.hellbender.tools.reference; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unit test class should be in src/test/java/org/broadinstitute/hellbender/utils/alignment/
import java.io.File; | ||
import java.io.IOException; | ||
|
||
public class MummerExecutorUnitTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test classes should extend GATKBaseTest
, not CommandLineProgramTest
@@ -93,6 +105,15 @@ public class CompareReferences extends GATKTool { | |||
@Argument(fullName = "display-only-differing-sequences", doc = "If provided, only display sequence names that differ in their actual sequence.", optional = true) | |||
private boolean onlyDisplayDifferingSequences = false; | |||
|
|||
@Argument(fullName = "base-comparison", doc = "If provided, any mismatching, same-length sequences will be aligned for a base-comparison.", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argument description is incorrect, since full alignment mode can handle sequences of different lengths. Should be something like: Mode for base-level comparisons. Off by default, but can do either full alignment of mismatching sequences to find both SNPs and indels (FULL_ALIGNMENT), or can find SNPs only in mismatching sequences of the same length (FIND_SNPS_ONLY)
public enum BaseComparisonMode{ | ||
// no base comparison | ||
NO_BASE_COMPARISON, | ||
// run the mummer pipeline to generate a snps file containing SNPs and INDELs for any mismatching sequences of same name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to generate a snps file" -> "to generate a VCF"
NO_BASE_COMPARISON, | ||
// run the mummer pipeline to generate a snps file containing SNPs and INDELs for any mismatching sequences of same name | ||
FULL_ALIGNMENT, | ||
// do a base-by-base comparison of any mistmatching sequences of same name to output a table containing each base mismatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistmatching -> mismatching
"of same name" -> "of same name and length"
// find the mismatch sequence | ||
for (String sequenceName : table.getAllSequenceNames()) { | ||
Set<ReferenceSequenceTable.TableRow> rows = table.queryBySequenceName(sequenceName); | ||
if (rows.size() == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment explaining significance of two rows here
// if the lengths of the 2 sequences aren't equal, error - can't compare different sequence lengths, probably indel need alignment | ||
ReferenceSequenceTable.TableRow[] rowArray = rows.toArray(new ReferenceSequenceTable.TableRow[0]); | ||
if (rowArray[0].getLength() != rowArray[1].getLength()) { | ||
logger.warn("Sequence lengths are not equal and can't be compared. Consider running in FULL_ALIGNMENT mode."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the sequence name in this logger message
} | ||
} | ||
|
||
private static class MummerIndel{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining what this class is
@@ -16,7 +16,8 @@ | |||
* {@link #getApproximateCommandLine} | |||
* {@link #getScriptException} | |||
*/ | |||
public abstract class ScriptExecutor { | |||
public abstract class | |||
ScriptExecutor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert accidental whitespace change
@@ -35,7 +35,7 @@ | |||
* between command/script/module execution. Using -i doesn't buy you anything (for this version of the executor, at | |||
* least) since the process is terminated after each command completes. | |||
*/ | |||
public class PythonScriptExecutor extends PythonExecutorBase { | |||
public class PythonScriptExecutor extends PythonExecutorBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert accidental whitespace change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen Here's part 2 of my review
import java.util.*; | ||
|
||
/** | ||
* Class for executing MUMmer pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a little more detail here about what the pipeline does and which tools it runs
* @param printStdout boolean to trigger displaying to stdout | ||
* @return the ProcessOutput of the run | ||
*/ | ||
public static ProcessOutput runPythonCommand(String script, List<String> scriptArguments, Map<String, String> additionalEnvironmentVars, File stdoutCaptureFile, boolean printStdout){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe runPythonCommand()
is now unused and can be deleted
} | ||
|
||
// FULL_ALIGNMENT tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These FULL_ALIGNMENT mode tests could be unified using a DataProvider
, since the actual test code is the same in every case. The DataProvider
would list the two fasta inputs, plus the expected output for each test case.
@Test | ||
public void testFullAlignmentModeInsertion() throws IOException{ | ||
final File ref1 = new File(getToolTestDataDir() + "hg19mini_chr1indel.fasta"); | ||
final File ref2 = new File(getToolTestDataDir() + "hg19mini.fasta"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here documenting that the order in which these are specified determines whether it's an insertion or deletion
@Test | ||
public void testFullAlignmentModeDeletion() throws IOException{ | ||
final File ref1 = new File(getToolTestDataDir() + "hg19mini.fasta"); | ||
final File ref2 = new File(getToolTestDataDir() + "hg19mini_chr1indel.fasta"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here documenting that the order in which these are specified determines whether it's an insertion or deletion
public void testPrepareMUMmerExecutionDirectory(){ | ||
MummerExecutor exec = new MummerExecutor(); | ||
File executableDirectory = exec.getMummerExecutableDirectory(); | ||
Assert.assertEquals(executableDirectory.listFiles().length, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check for the presence of the individual expected files (nucmer, delta-filter, etc.), by name
@@ -0,0 +1,5 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there supposed to be a blank line at the start of this expected output file?
@@ -0,0 +1,3099 @@ | |||
MD5 Length hg38_better_alt_masked.fa Homo_sapiens_assembly38_masked.fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this DRAGEN expected output used in an actual test?
final File expectedOutput = new File(getToolTestDataDir(), "expected.SNPandINDEL.hg19mini.fasta_hg19mini_snpandindel.fasta.vcf"); | ||
final File actualOutput = new File(output, "hg19mini.fasta_hg19mini_snpandindel.fasta.vcf"); | ||
IntegrationTestSpec.assertEqualTextFiles(actualOutput, expectedOutput); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ideally want a FULL_ALIGNMENT mode test case that comprehensively exercises all of the state transitions in the VCF creation code between SNPs, insertions, and deletions. Eg., something like this:
SNP
SNP
insertion
SNP
deletion
SNP
insertion
insertion
deletion
deletion
insertion
SNP
Is it possible to create a test case with all of the above on a single contig (and in that order)?
final File expectedOutput = new File(getToolTestDataDir(), "expected.hg19mini.fasta_hg19mini_chr2iupacsnps.fasta_snps.tsv"); | ||
final File actualOutput = new File(output, "hg19mini.fasta_hg19mini_chr2iupacsnps.fasta_snps.tsv"); | ||
|
||
IntegrationTestSpec.assertEqualTextFiles(actualOutput, expectedOutput); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an additional FIND_SNPS_ONLY mode test case that runs on a sequence with an indel, just to show that it doesn't explode in that case.
Github actions tests reported job failures from actions build 2849712241
|
CompareReferences FullAlignment integration tests & ExecuteMummer unit test failing currently due to MUMmer build |
…o select the right distribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a patch to add Intel x86_64 builds of MUMmer for Mac and Linux, with code to select the right distribution for the user's system. Branch looks good now, and all comments have been addressed -- will merge once tests are green!
No description provided.