Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence, slice, and read schema #83

Merged
merged 1 commit into from
Nov 2, 2016

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented May 27, 2016

Add sequence, slice, and read schema.

@akmorrow13:
Please let me know if you think these might be easier to work with than NucleotideContigFragment.

@fnothaft
Copy link
Member

OOC, what's the backstory on these?

On first glance, Sequence/Slice do indeed seem preferable to NucleotideContigFragment.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/98/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented May 27, 2016

Something I've been meaning to drop in for a while, possibly helpful in #54, could replace NucleotideContigFragment in FASTA support, etc.

A use case in Mango came up in our standup on Thursday, and I volunteered to push something for review.

@akmorrow13
Copy link
Contributor

I like this we could use this as a replacement for most cases of AlignmentRecord and nucleotideContigFragment. Does slicing allow you to reconfigure sequence lengths, perhaps to a smaller set of sequences?

@heuermh
Copy link
Member Author

heuermh commented May 28, 2016

Slice is a view on a longer sequence, borrowed from the Ensembl Perl API
http://pre.ensembl.org/info/docs/api/core/core_tutorial.html#slices

In other words, it is the same as ADAM's ReferenceRegion but with sequence.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/99/
Test PASSed.

@fnothaft
Copy link
Member

I see that this is tagged as a prerequisite in bigdatagenomics/adam#1048. CC @akmorrow13 @heuermh, do we need this in ADAM 0.20.0 for mango? If so, then we'll want to land this in the next bdg-formats release.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/105/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Jun 27, 2016

todo: update this to remove NucleotideContigFragment

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/107/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/113/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Jul 19, 2016

I see the use case a bit more clearly now:

  • Read short sequences in FASTA format, each a full sequence

      
    loadSequences(...): SequenceRDD
  • Read short or long sequences in FASTA format, partitioned into slices (say 10kB each)

      
    loadSlices(...): SliceRDD
  • Read unaligned sequences with quality scores in FASTQ format

      
    loadReads(...): ReadRDD
  • Read aligned sequences with quality scores in SAM/BAM format

      
    loadAlignments(...): AlignmentRecordRDD

SequenceRDD and SliceRDD both extend from/implement ReferenceFile, such that they provide a method

def slice(region: ReferenceRegion): Slice { ... }

Note: there might be a case for renaming AlignmentRecord to AlignedRead, in which case the method might be loadAlignedReads(...): AlignedReadRDD; but of course a SAM/BAM file may contain unaligned reads, and could in fact contain only unaligned reads

@akmorrow13
Copy link
Contributor

I think SequenceRDD makes a lot of sense. I am still not clear why we need both SliceRDD and SequenceRDD @heuermh ?

@heuermh
Copy link
Member Author

heuermh commented Jul 20, 2016

SliceRDD handles the use case that NucleotideContigFragmentRDD is currently handling, where individual sequences might be too long to partition effectively and need to be chopped up into bite size pieces on load.

Implementing the method slice(ReferenceRegion) on SequenceRDD would be trivial, whereas it needs to be more clever on SliceRDD, recombining adjacent slices if necessary.

@akmorrow13
Copy link
Contributor

Do these RDD's already exist or should I start implementing them? I imagine slice for SliceRDD would be similar to the current getReferenceString in NucleotideContigFragmentRDD

@heuermh
Copy link
Member Author

heuermh commented Jul 20, 2016

I was waiting for review of these schema records first.

I imagine implementing SliceRDD would be mostly search & replace in code paths for NucleotideContigFragment.

SequenceRDD and ReadRDD could be based on FASTA and FASTQ support in Hadoop-BAM or stuff cribbed from elsewhere in ADAM.

I've been in the NucleotideContigFragment code before to implement FASTA output, so if you want to start on this, I should be able to help.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/115/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/118/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/122/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/126/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Sep 30, 2016

Ping to consider merging this for the 0.10 release. I propose adding these now, with support in ADAM 0.20 or 0.21, and then in bdg-formats 0.11 replacing NucleotideContigFragment by Slice, with refactoring in ADAM 0.21.

When thinking on a converter API, allowing for non-Apache licensed extensions, it would be nice to have these schema for easier conversion to and from third party libraries such as Biojava.

@heuermh
Copy link
Member Author

heuermh commented Oct 7, 2016

@fnothaft @akmorrow13 @jpdna Sorry, another ping on this one. These would be useful for the NMDP/Be The Match collaboration, in that they have a need for modeling protein sequences and short peptide fragments.

@fnothaft
Copy link
Member

fnothaft commented Oct 7, 2016

OK, making a review pass now. As long as there's an application for this, I am cool with the new schemas.

Protein alphabet.
*/
PROTEIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have an OTHER enum as well? E.g., for DNA + methylation? IDK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-0 Guess I would wait until a use case arises. An OTHER wouldn't necessarily map into any specific alphabet in Biojava, which supports arbitrary alphabets.


/**
Sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs could be better here ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More concretely, this is a record that describes a DNA contig, or an RNA transcript, or a protein amino acid sequence, right? If so, the docs should say this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Name of this sequence.
*/
union { null, string } name = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to our discussion on bigdatagenomics/adam#1198 RE: what is name and what is description, it'd prolly be good to flesh out the docs here. Again, I think that the sequence metadata (database IDs) should fall into the description field.


/**
View on a contiguous region of a sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View on a reads a bit funny to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View of a better?


/**
Name of the sequence this slice views.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the same as the name of the Sequence, right? In the interest of verbosity, let's doc that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. If extension were possible in Avro schema, then Slice and Read would extend Sequence

Description for the sequence this slice views.
*/
union { null, string } description = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we expect the slice record to be used? If we expect to use it for short slices of sequence (e.g., <500bp), then carrying around metadata is going to be inefficient, and we should probably drop this field.

Actually, we may want to drop this field from Sequence as well, and just store the sequence metadata in an associated Contig record.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind keeping the field and setting it to null where appropriate. The use case for slice is the same as NucleotideContigFragment, and also for view a region of a sequence with features, say a typical GenBank record, or a slice in the Ensembl Perl APIs.


/**
Strand for this slice, if any. Defaults to Strand.Independent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer single space after periods for consistency with other documentation.

Sequence with quality scores.
*/
record Read { // extends Sequence
Copy link
Member

@fnothaft fnothaft Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really much of a fan of adding this record, as I don't see what it really adds over our AlignmentRecord and Fragment schemas. There isn't much of an overhead to having the AlignmentRecord schema with the additional alignment metadata fields (from prior measurement, those are <1% of the space on disk, read sequence is about 30% of space, quality scores are about 60%, and the remaining 10% is contig name and metadata/map qual --> note that these numbers were before we un-nested Contig, so the contig name/metadata portion has gone down), and we'd need to add and maintain import/export/convert paths for another schema. I see the Fragment schema as useful because it is a natural view over the data --> we create said view during duplicate marking, and would have use for it in variant calling as well.

If we were to add it, I would like the field names to be consistent with AlignmentRecord.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a world before samtools. ;)

Storing unaligned reads in SAM or related formats may have some practical benefits, but is conceptually strange for say assembly-specific or metagenomic workflows. Thus I don't mind the similar representations, at least for a short time. As with Slice, which is mostly equivalent to NucleotideSequenceFragment, there isn't really a way to start exploring refactoring in the ADAM codebase until the (experimental) schema make it into formats.

I haven't really looked at the Fragment schema and code enough to know how it relates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on having a read record. Lot's of my flavor of genomics never touches a bam file. I would also be in favor of modeling groups of reads (e.g., pairs from paired end, or clustered reads), which has been discussed for some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e.g., in #54)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to pull request a groups-of-reads record to this pull request if you have something in mind.

@fnothaft
Copy link
Member

fnothaft commented Oct 7, 2016

Minus cleaning up the nits, I would be +1 on merging this without the unaligned Read schema.

@heuermh
Copy link
Member Author

heuermh commented Oct 9, 2016

What about moving these to a separate .avro file, package org.bdgenomics.formats.avro.experimental or something similar?

DNA alphabet.
*/
DNA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strict? (GATC only?) Or IUPAC? Or should we have both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I'm using it to map to and from Biojava Alphabet, which is based on IUPAC, but entirely extensible. Here I see it as simply informational. Unless of course we want to fully model sequences as symbol lists where symbols come from alphabets, which was discussed elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, probably not. Sounds good as is.

Alphabet for this sequence, defaults to Alphabet.DNA.
*/
union { Alphabet, null } alphabet = "DNA";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order of Alphabet and null reversed here deliberately? Is this something unique to enums?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the default value should come first in a union. I.e., if the default value is null, then union { null, Alphabet } would be correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to default to Alphabet.DNA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah...

Length of this sequence.
*/
union { null, long } length = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just an optimization to explicitly model the length? Or is this for cases where you have an unknown sequence with a known length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimization of sorts, similar to storing Variant.end even though it can be calculated from Variant.start and Variant.referenceAllele.

FASTQ sequence format variant.
*/
enum FastqVariant {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong name to me. If I understand correctly, this isn't really a variant of fastq, but rather a variant of how the quality scores are represented. Maybe QualityScoresVariant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

@heuermh heuermh Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally the same as Biojava enum FastqVariant per the OBF FASTQ paper. Given that I generalized the Biojava Fastq class name to Read though, I suppose I could be persuaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I guess I'm ambivalent then if we're (partly) trying to be consistent with Biojava etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any non-FASTQ-format quality scores we might want to use in the Read record?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/128/
Test PASSed.

@fnothaft fnothaft merged commit d90606a into bigdatagenomics:master Nov 2, 2016
@fnothaft
Copy link
Member

fnothaft commented Nov 2, 2016

Merged! Thanks @heuermh!

@heuermh heuermh deleted the sequence branch November 2, 2016 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants