Add sequence, slice, and read schema #83

heuermh · 2016-05-27T21:48:06Z

Add sequence, slice, and read schema.

@akmorrow13:
Please let me know if you think these might be easier to work with than NucleotideContigFragment.

fnothaft · 2016-05-27T21:50:21Z

OOC, what's the backstory on these?

On first glance, Sequence/Slice do indeed seem preferable to NucleotideContigFragment.

AmplabJenkins · 2016-05-27T21:52:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/98/
Test PASSed.

heuermh · 2016-05-27T21:56:48Z

Something I've been meaning to drop in for a while, possibly helpful in #54, could replace NucleotideContigFragment in FASTA support, etc.

A use case in Mango came up in our standup on Thursday, and I volunteered to push something for review.

akmorrow13 · 2016-05-28T18:22:35Z

I like this we could use this as a replacement for most cases of AlignmentRecord and nucleotideContigFragment. Does slicing allow you to reconfigure sequence lengths, perhaps to a smaller set of sequences?

heuermh · 2016-05-28T21:16:36Z

Slice is a view on a longer sequence, borrowed from the Ensembl Perl API
http://pre.ensembl.org/info/docs/api/core/core_tutorial.html#slices

In other words, it is the same as ADAM's ReferenceRegion but with sequence.

AmplabJenkins · 2016-05-28T21:17:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/99/
Test PASSed.

fnothaft · 2016-06-15T18:39:49Z

I see that this is tagged as a prerequisite in bigdatagenomics/adam#1048. CC @akmorrow13 @heuermh, do we need this in ADAM 0.20.0 for mango? If so, then we'll want to land this in the next bdg-formats release.

AmplabJenkins · 2016-06-21T22:47:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/105/
Test PASSed.

heuermh · 2016-06-27T19:55:03Z

todo: update this to remove NucleotideContigFragment

AmplabJenkins · 2016-06-28T20:02:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/107/
Test PASSed.

AmplabJenkins · 2016-07-18T22:07:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/113/
Test PASSed.

heuermh · 2016-07-19T20:37:33Z

I see the use case a bit more clearly now:

Read short sequences in FASTA format, each a full sequence

→ loadSequences(...): SequenceRDD
Read short or long sequences in FASTA format, partitioned into slices (say 10kB each)

→ loadSlices(...): SliceRDD
Read unaligned sequences with quality scores in FASTQ format

→ loadReads(...): ReadRDD
Read aligned sequences with quality scores in SAM/BAM format

→ loadAlignments(...): AlignmentRecordRDD

SequenceRDD and SliceRDD both extend from/implement ReferenceFile, such that they provide a method

def slice(region: ReferenceRegion): Slice { ... }

Note: there might be a case for renaming AlignmentRecord to AlignedRead, in which case the method might be loadAlignedReads(...): AlignedReadRDD; but of course a SAM/BAM file may contain unaligned reads, and could in fact contain only unaligned reads

akmorrow13 · 2016-07-20T04:02:17Z

I think SequenceRDD makes a lot of sense. I am still not clear why we need both SliceRDD and SequenceRDD @heuermh ?

heuermh · 2016-07-20T15:34:18Z

SliceRDD handles the use case that NucleotideContigFragmentRDD is currently handling, where individual sequences might be too long to partition effectively and need to be chopped up into bite size pieces on load.

Implementing the method slice(ReferenceRegion) on SequenceRDD would be trivial, whereas it needs to be more clever on SliceRDD, recombining adjacent slices if necessary.

akmorrow13 · 2016-07-20T16:03:01Z

Do these RDD's already exist or should I start implementing them? I imagine slice for SliceRDD would be similar to the current getReferenceString in NucleotideContigFragmentRDD

heuermh · 2016-07-20T16:20:53Z

I was waiting for review of these schema records first.

I imagine implementing SliceRDD would be mostly search & replace in code paths for NucleotideContigFragment.

SequenceRDD and ReadRDD could be based on FASTA and FASTQ support in Hadoop-BAM or stuff cribbed from elsewhere in ADAM.

I've been in the NucleotideContigFragment code before to implement FASTA output, so if you want to start on this, I should be able to help.

AmplabJenkins · 2016-07-26T18:47:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/115/
Test PASSed.

AmplabJenkins · 2016-08-24T19:02:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/118/
Test PASSed.

AmplabJenkins · 2016-09-01T00:12:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/122/
Test PASSed.

AmplabJenkins · 2016-09-14T18:07:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/126/
Test PASSed.

heuermh · 2016-09-30T14:57:18Z

Ping to consider merging this for the 0.10 release. I propose adding these now, with support in ADAM 0.20 or 0.21, and then in bdg-formats 0.11 replacing NucleotideContigFragment by Slice, with refactoring in ADAM 0.21.

When thinking on a converter API, allowing for non-Apache licensed extensions, it would be nice to have these schema for easier conversion to and from third party libraries such as Biojava.

heuermh · 2016-10-07T20:53:53Z

@fnothaft @akmorrow13 @jpdna Sorry, another ping on this one. These would be useful for the NMDP/Be The Match collaboration, in that they have a need for modeling protein sequences and short peptide fragments.

fnothaft · 2016-10-07T22:19:38Z

OK, making a review pass now. As long as there's an application for this, I am cool with the new schemas.

fnothaft · 2016-10-07T22:21:08Z

src/main/resources/avro/bdg.avdl

+   Protein alphabet.
+   */
+  PROTEIN


Would it make sense to have an OTHER enum as well? E.g., for DNA + methylation? IDK.

-0 Guess I would wait until a use case arises. An OTHER wouldn't necessarily map into any specific alphabet in Biojava, which supports arbitrary alphabets.

fnothaft · 2016-10-07T22:21:26Z

src/main/resources/avro/bdg.avdl

+
+/**
+ Sequence.


The docs could be better here ;)

More concretely, this is a record that describes a DNA contig, or an RNA transcript, or a protein amino acid sequence, right? If so, the docs should say this.

fnothaft · 2016-10-07T22:25:04Z

src/main/resources/avro/bdg.avdl

+   Name of this sequence.
+   */
+  union { null, string } name = null;


Coming back to our discussion on bigdatagenomics/adam#1198 RE: what is name and what is description, it'd prolly be good to flesh out the docs here. Again, I think that the sequence metadata (database IDs) should fall into the description field.

fnothaft · 2016-10-07T22:26:10Z

src/main/resources/avro/bdg.avdl

+
+/**
+ View on a contiguous region of a sequence.


View on a reads a bit funny to me.

View of a better?

fnothaft · 2016-10-07T22:26:42Z

src/main/resources/avro/bdg.avdl

+
+  /**
+   Name of the sequence this slice views.


This should be the same as the name of the Sequence, right? In the interest of verbosity, let's doc that.

Yeah. If extension were possible in Avro schema, then Slice and Read would extend Sequence

fnothaft · 2016-10-07T22:28:36Z

src/main/resources/avro/bdg.avdl

+   Description for the sequence this slice views.
+   */
+  union { null, string } description = null;


How do we expect the slice record to be used? If we expect to use it for short slices of sequence (e.g., <500bp), then carrying around metadata is going to be inefficient, and we should probably drop this field.

Actually, we may want to drop this field from Sequence as well, and just store the sequence metadata in an associated Contig record.

I don't mind keeping the field and setting it to null where appropriate. The use case for slice is the same as NucleotideContigFragment, and also for view a region of a sequence with features, say a typical GenBank record, or a slice in the Ensembl Perl APIs.

fnothaft · 2016-10-07T22:29:16Z

src/main/resources/avro/bdg.avdl

+
+  /**
+   Strand for this slice, if any.  Defaults to Strand.Independent.


Nit: prefer single space after periods for consistency with other documentation.

fnothaft · 2016-10-07T22:31:09Z

src/main/resources/avro/bdg.avdl

+ Sequence with quality scores.
+ */
+record Read { // extends Sequence


I'm not really much of a fan of adding this record, as I don't see what it really adds over our AlignmentRecord and Fragment schemas. There isn't much of an overhead to having the AlignmentRecord schema with the additional alignment metadata fields (from prior measurement, those are <1% of the space on disk, read sequence is about 30% of space, quality scores are about 60%, and the remaining 10% is contig name and metadata/map qual --> note that these numbers were before we un-nested Contig, so the contig name/metadata portion has gone down), and we'd need to add and maintain import/export/convert paths for another schema. I see the Fragment schema as useful because it is a natural view over the data --> we create said view during duplicate marking, and would have use for it in variant calling as well.

If we were to add it, I would like the field names to be consistent with AlignmentRecord.

There was a world before samtools. ;)

Storing unaligned reads in SAM or related formats may have some practical benefits, but is conceptually strange for say assembly-specific or metagenomic workflows. Thus I don't mind the similar representations, at least for a short time. As with Slice, which is mostly equivalent to NucleotideSequenceFragment, there isn't really a way to start exploring refactoring in the ADAM codebase until the (experimental) schema make it into formats.

I haven't really looked at the Fragment schema and code enough to know how it relates.

I'm +1 on having a read record. Lot's of my flavor of genomics never touches a bam file. I would also be in favor of modeling groups of reads (e.g., pairs from paired end, or clustered reads), which has been discussed for some time.

(e.g., in #54)

Feel free to pull request a groups-of-reads record to this pull request if you have something in mind.

fnothaft · 2016-10-07T22:36:06Z

Minus cleaning up the nits, I would be +1 on merging this without the unaligned Read schema.

heuermh · 2016-10-09T19:46:01Z

What about moving these to a separate .avro file, package org.bdgenomics.formats.avro.experimental or something similar?

laserson · 2016-10-19T20:05:05Z

src/main/resources/avro/bdg.avdl

+   DNA alphabet.
+   */
+  DNA,


Is this strict? (GATC only?) Or IUPAC? Or should we have both?

Currently I'm using it to map to and from Biojava Alphabet, which is based on IUPAC, but entirely extensible. Here I see it as simply informational. Unless of course we want to fully model sequences as symbol lists where symbols come from alphabets, which was discussed elsewhere.

Hmm, probably not. Sounds good as is.

laserson · 2016-10-19T20:05:53Z

src/main/resources/avro/bdg.avdl

+   Alphabet for this sequence, defaults to Alphabet.DNA.
+   */
+  union { Alphabet, null } alphabet = "DNA";


Is the order of Alphabet and null reversed here deliberately? Is this something unique to enums?

The type of the default value should come first in a union. I.e., if the default value is null, then union { null, Alphabet } would be correct.

The intention is to default to Alphabet.DNA

laserson · 2016-10-19T20:06:43Z

src/main/resources/avro/bdg.avdl

+   Length of this sequence.
+   */
+  union { null, long } length = null;


Is this just an optimization to explicitly model the length? Or is this for cases where you have an unknown sequence with a known length?

An optimization of sorts, similar to storing Variant.end even though it can be calculated from Variant.start and Variant.referenceAllele.

laserson · 2016-10-19T20:12:16Z

src/main/resources/avro/bdg.avdl

+ FASTQ sequence format variant.
+ */
+enum FastqVariant {


This seems like the wrong name to me. If I understand correctly, this isn't really a variant of fastq, but rather a variant of how the quality scores are represented. Maybe QualityScoresVariant?

Intentionally the same as Biojava enum FastqVariant per the OBF FASTQ paper. Given that I generalized the Biojava Fastq class name to Read though, I suppose I could be persuaded.

lol, I guess I'm ambivalent then if we're (partly) trying to be consistent with Biojava etc.

Are there any non-FASTQ-format quality scores we might want to use in the Read record?

AmplabJenkins · 2016-10-24T17:22:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/128/
Test PASSed.

fnothaft · 2016-11-02T21:31:35Z

Merged! Thanks @heuermh!

heuermh force-pushed the sequence branch from d28fb6c to b38316a Compare May 28, 2016 21:14

heuermh mentioned this pull request Jun 7, 2016

Release ADAM version 0.20.0 bigdatagenomics/adam#1048

Closed

61 tasks

heuermh mentioned this pull request Jun 15, 2016

Add schema for sample #84

Merged

heuermh force-pushed the sequence branch from b38316a to f165b8c Compare June 21, 2016 22:43

heuermh force-pushed the sequence branch from f165b8c to bf0cb94 Compare June 28, 2016 19:57

heuermh mentioned this pull request Jun 28, 2016

[BDG-FORMATS-54] Generalizing the Fragment type #56

Closed

heuermh force-pushed the sequence branch from bf0cb94 to e31e3fd Compare July 18, 2016 22:06

heuermh mentioned this pull request Jul 19, 2016

refactored ReferenceFile to require SequenceDictionary bigdatagenomics/adam#1086

Closed

heuermh force-pushed the sequence branch from e31e3fd to ceeb32d Compare July 26, 2016 18:42

heuermh force-pushed the sequence branch from ceeb32d to 81b302a Compare August 24, 2016 18:57

heuermh force-pushed the sequence branch from 81b302a to d643458 Compare September 1, 2016 00:07

heuermh force-pushed the sequence branch from d643458 to b0e3f1a Compare September 14, 2016 18:03

fnothaft requested changes Oct 7, 2016

View reviewed changes

fnothaft mentioned this pull request Oct 19, 2016

Support for INSDC Sequence records (i.e., Genbank/EMBL format)? bigdatagenomics/adam#1219

Closed

laserson reviewed Oct 19, 2016

View reviewed changes

Add sequence, slice, and read schema

264a135

heuermh force-pushed the sequence branch from b0e3f1a to 264a135 Compare October 24, 2016 17:18

fnothaft approved these changes Nov 2, 2016

View reviewed changes

fnothaft merged commit d90606a into bigdatagenomics:master Nov 2, 2016

heuermh deleted the sequence branch November 2, 2016 22:13


		/**
		Strand for this slice, if any. Defaults to Strand.Independent.

Add sequence, slice, and read schema #83

Add sequence, slice, and read schema #83

Conversation

heuermh commented May 27, 2016

fnothaft commented May 27, 2016

AmplabJenkins commented May 27, 2016

heuermh commented May 27, 2016

akmorrow13 commented May 28, 2016

heuermh commented May 28, 2016

AmplabJenkins commented May 28, 2016

fnothaft commented Jun 15, 2016

AmplabJenkins commented Jun 21, 2016

heuermh commented Jun 27, 2016

AmplabJenkins commented Jun 28, 2016

AmplabJenkins commented Jul 18, 2016

heuermh commented Jul 19, 2016 • edited Loading

akmorrow13 commented Jul 20, 2016

heuermh commented Jul 20, 2016

akmorrow13 commented Jul 20, 2016

heuermh commented Jul 20, 2016

AmplabJenkins commented Jul 26, 2016

AmplabJenkins commented Aug 24, 2016

AmplabJenkins commented Sep 1, 2016

AmplabJenkins commented Sep 14, 2016

heuermh commented Sep 30, 2016 • edited Loading

heuermh commented Oct 7, 2016

fnothaft commented Oct 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft commented Oct 7, 2016

heuermh commented Oct 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh Oct 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Oct 24, 2016

fnothaft commented Nov 2, 2016

heuermh commented Jul 19, 2016 •

edited

Loading

heuermh commented Sep 30, 2016 •

edited

Loading

fnothaft Oct 7, 2016 •

edited

Loading

heuermh Oct 19, 2016 •

edited

Loading