-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sequence, slice, and read schema #83
Conversation
OOC, what's the backstory on these? On first glance, |
Test PASSed. |
Something I've been meaning to drop in for a while, possibly helpful in #54, could replace A use case in Mango came up in our standup on Thursday, and I volunteered to push something for review. |
I like this we could use this as a replacement for most cases of AlignmentRecord and nucleotideContigFragment. Does slicing allow you to reconfigure sequence lengths, perhaps to a smaller set of sequences? |
Slice is a view on a longer sequence, borrowed from the Ensembl Perl API In other words, it is the same as ADAM's ReferenceRegion but with sequence. |
Test PASSed. |
I see that this is tagged as a prerequisite in bigdatagenomics/adam#1048. CC @akmorrow13 @heuermh, do we need this in ADAM 0.20.0 for mango? If so, then we'll want to land this in the next bdg-formats release. |
Test PASSed. |
todo: update this to remove |
Test PASSed. |
Test PASSed. |
I see the use case a bit more clearly now:
def slice(region: ReferenceRegion): Slice { ... } Note: there might be a case for renaming AlignmentRecord to AlignedRead, in which case the method might be |
I think SequenceRDD makes a lot of sense. I am still not clear why we need both SliceRDD and SequenceRDD @heuermh ? |
Implementing the method |
Do these RDD's already exist or should I start implementing them? I imagine slice for SliceRDD would be similar to the current getReferenceString in NucleotideContigFragmentRDD |
I was waiting for review of these schema records first. I imagine implementing SliceRDD would be mostly search & replace in code paths for NucleotideContigFragment. SequenceRDD and ReadRDD could be based on FASTA and FASTQ support in Hadoop-BAM or stuff cribbed from elsewhere in ADAM. I've been in the NucleotideContigFragment code before to implement FASTA output, so if you want to start on this, I should be able to help. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Ping to consider merging this for the 0.10 release. I propose adding these now, with support in ADAM 0.20 or 0.21, and then in bdg-formats 0.11 replacing When thinking on a converter API, allowing for non-Apache licensed extensions, it would be nice to have these schema for easier conversion to and from third party libraries such as Biojava. |
@fnothaft @akmorrow13 @jpdna Sorry, another ping on this one. These would be useful for the NMDP/Be The Match collaboration, in that they have a need for modeling protein sequences and short peptide fragments. |
OK, making a review pass now. As long as there's an application for this, I am cool with the new schemas. |
Protein alphabet. | ||
*/ | ||
PROTEIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have an OTHER
enum as well? E.g., for DNA + methylation? IDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-0 Guess I would wait until a use case arises. An OTHER
wouldn't necessarily map into any specific alphabet in Biojava, which supports arbitrary alphabets.
|
||
/** | ||
Sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs could be better here ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More concretely, this is a record that describes a DNA contig, or an RNA transcript, or a protein amino acid sequence, right? If so, the docs should say this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Name of this sequence. | ||
*/ | ||
union { null, string } name = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to our discussion on bigdatagenomics/adam#1198 RE: what is name and what is description, it'd prolly be good to flesh out the docs here. Again, I think that the sequence metadata (database IDs) should fall into the description
field.
|
||
/** | ||
View on a contiguous region of a sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
View on a
reads a bit funny to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
View of a
better?
|
||
/** | ||
Name of the sequence this slice views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the same as the name of the Sequence
, right? In the interest of verbosity, let's doc that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. If extension were possible in Avro schema, then Slice
and Read
would extend Sequence
Description for the sequence this slice views. | ||
*/ | ||
union { null, string } description = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we expect the slice
record to be used? If we expect to use it for short slices of sequence (e.g., <500bp), then carrying around metadata is going to be inefficient, and we should probably drop this field.
Actually, we may want to drop this field from Sequence
as well, and just store the sequence metadata in an associated Contig
record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind keeping the field and setting it to null where appropriate. The use case for slice
is the same as NucleotideContigFragment
, and also for view a region of a sequence with features, say a typical GenBank record, or a slice
in the Ensembl Perl APIs.
|
||
/** | ||
Strand for this slice, if any. Defaults to Strand.Independent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefer single space after periods for consistency with other documentation.
Sequence with quality scores. | ||
*/ | ||
record Read { // extends Sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really much of a fan of adding this record, as I don't see what it really adds over our AlignmentRecord
and Fragment
schemas. There isn't much of an overhead to having the AlignmentRecord
schema with the additional alignment metadata fields (from prior measurement, those are <1% of the space on disk, read sequence is about 30% of space, quality scores are about 60%, and the remaining 10% is contig name and metadata/map qual --> note that these numbers were before we un-nested Contig
, so the contig name/metadata portion has gone down), and we'd need to add and maintain import/export/convert paths for another schema. I see the Fragment
schema as useful because it is a natural view over the data --> we create said view during duplicate marking, and would have use for it in variant calling as well.
If we were to add it, I would like the field names to be consistent with AlignmentRecord
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a world before samtools. ;)
Storing unaligned reads in SAM or related formats may have some practical benefits, but is conceptually strange for say assembly-specific or metagenomic workflows. Thus I don't mind the similar representations, at least for a short time. As with Slice
, which is mostly equivalent to NucleotideSequenceFragment
, there isn't really a way to start exploring refactoring in the ADAM codebase until the (experimental) schema make it into formats.
I haven't really looked at the Fragment
schema and code enough to know how it relates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 on having a read record. Lot's of my flavor of genomics never touches a bam file. I would also be in favor of modeling groups of reads (e.g., pairs from paired end, or clustered reads), which has been discussed for some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(e.g., in #54)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to pull request a groups-of-reads record to this pull request if you have something in mind.
Minus cleaning up the nits, I would be +1 on merging this without the unaligned |
What about moving these to a separate .avro file, package |
DNA alphabet. | ||
*/ | ||
DNA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this strict? (GATC only?) Or IUPAC? Or should we have both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, probably not. Sounds good as is.
Alphabet for this sequence, defaults to Alphabet.DNA. | ||
*/ | ||
union { Alphabet, null } alphabet = "DNA"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the order of Alphabet
and null
reversed here deliberately? Is this something unique to enums?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of the default value should come first in a union. I.e., if the default value is null
, then union { null, Alphabet }
would be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to default to Alphabet.DNA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah...
Length of this sequence. | ||
*/ | ||
union { null, long } length = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just an optimization to explicitly model the length? Or is this for cases where you have an unknown sequence with a known length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An optimization of sorts, similar to storing Variant.end
even though it can be calculated from Variant.start
and Variant.referenceAllele
.
FASTQ sequence format variant. | ||
*/ | ||
enum FastqVariant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the wrong name to me. If I understand correctly, this isn't really a variant of fastq, but rather a variant of how the quality scores are represented. Maybe QualityScoresVariant
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally the same as Biojava enum FastqVariant per the OBF FASTQ paper. Given that I generalized the Biojava Fastq
class name to Read
though, I suppose I could be persuaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, I guess I'm ambivalent then if we're (partly) trying to be consistent with Biojava etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any non-FASTQ-format quality scores we might want to use in the Read
record?
Test PASSed. |
Merged! Thanks @heuermh! |
Add sequence, slice, and read schema.
@akmorrow13:
Please let me know if you think these might be easier to work with than
NucleotideContigFragment
.