[BDG-FORMATS-54] Generalizing the Fragment type #56

laserson · 2015-05-05T18:58:07Z

I think storing reads after a "groupby" operation has occurred is an excellent idea, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random
barcodes is another.

Here is a summary of my proposal, based in part on @ilveroluca's work:

Support a fastq-like object Sequence, though I don't think this is strictly necessary.
Rename to Bucket, as it sounds more general to my ears.
Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects.
Remove fragmentSize, as it's specific to one use case and it's rather easily computable.
Support for multiple types of grouped objects. What's the best way to deal with this? union somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types.
Sequence and quality information from alignments should be retrieved from AlignmentRecords.
I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in Genotype, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the Sequence object.

Fixes #54.

AmplabJenkins · 2015-05-05T19:02:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/bdg-formats-prb/62/
Test PASSed.

ilveroluca · 2015-05-06T15:20:40Z

I think the Bucket and the Fragment record types represent two different concepts.

The Bucket represents a generic group of reads, akin to the recordGroup concept that is present in SAM/BAM and has already been transferred to the AlignmentRecord.

The Fragment on the other hand is a single piece of DNA which has been read in parts. The parts are ordered and they may overlap.

As you said, the "Fragment" could be seen as a special case of the Bucket, as long as the Bucket is ordered (which in the case of your patch, it is). However, if a program starts processing a data set and finds a Bucket, how should it know whether to treat the sequences within the group as belonging to a single DNA fragment (e.g., important for alignment) or as a group of single-end reads? You'd have to tell it, somehow, and complicate processing to address the different possible behaviours in the code. Or you'd have to restrict the software to only work with "Fragment buckets", taking away the user's ability to use Buckets for logical read groups -- which wouldn't be nice.

I suppose that when dealing with single-end reads, or perhaps in downstream analyses, knowing whether two reads come from the same fragment might not be very useful. But in upstream analysis it is something that happens often enough (every flowcell!) to warrant it's own record type.

My two cents.

laserson · 2015-05-07T01:46:59Z

I think the Bucket and the Fragment record types represent two different concepts.

Agreed, but I'd say that Fragment is a subclass of Bucket.

The Bucket represents a generic group of reads, akin to the recordGroup concept that is present in SAM/BAM and has already been transferred to the AlignmentRecord.

Except the BAM format is pre-groupBy. The key useful feature here is that the Bucket stores all the reads in a group together physically, while the readGroup annotation in BAM simply stores the group key with the read, but the reads can be scattered in the file.

if a program starts processing a data set and finds a Bucket, how should it know whether to treat the sequences within the group as belonging to a single DNA fragment (e.g., important for alignment) or as a group of single-end reads? You'd have to tell it, somehow, and complicate processing to address the different possible behaviours in the code. Or you'd have to restrict the software to only work with "Fragment buckets", taking away the user's ability to use Buckets for logical read groups -- which wouldn't be nice.

I'm not sure I understand the concern here. Say there are two use cases for Bucket, one called Fragment and another called Droplet. Say Fragment is always composed of exactly two reads in a particular orientation, and say that Droplet is composed of a variable/unknown number of reads where the order doesn't matter. You write a program that reads Buckets and assumes they're only composed of two reads, and I write a program that reads Bucketss and assumes that each Bucket may have a different number of reads. No problem.

Actually, you could model your Fragment object as a Scala class or as a separate Avro object that's internal to your program. So when you load the RDD[Bucket], you immediately convert it to an RDD[Fragment].

laserson · 2015-09-08T18:55:31Z

I don't want to hold up bigdatagenomics/adam#815, but after that merges, I'd like to resolve this before we add more functionality based on the Fragment type.

ryan-williams · 2015-09-14T21:36:10Z

src/main/resources/avro/bdg.avdl

-   The DNA fragment that is was targeted by the sequencer, resulting in
-   one or more reads.
+   A set of objects that should be physically analyzed/stored together.
+   Typically the result of some type of groupby operation.


two nits on this comment:

"A set of objects" feels vague bordering on cryptic. They're always "reads", right?

I'm confused by the idea that you'd get these as a result of a groupby; I assume we're all talking about taking an RDD[AlignmentRecord] and doing a .map(r => r.getReadName -> r).groupByKey to it… but then where does Bucket.sequences come from / get populated? Is the answer just that it's a little more complicated than that? i.e.

def groupReadsIntoBuckets(rdd: RDD[AlignmentRecord]): RDD[Bucket] = rdd .map(r => r.getReadName -> r) .groupByKey .mapValues(reads => Bucket .newBuilder .setBucketName(reads.head.getReadName) .setSequences(reads.map(_.getSequence)) // is this the right field? .setAlignments(reads) .build )

"A set of objects" feels vague bordering on cryptic. They're always "reads", right?

That's correct.

I'm confused by the idea that you'd get these as a result of a groupby; I assume we're all talking about taking an RDD[AlignmentRecord] and doing a .map(r => r.getReadName -> r).groupByKey to it… but then where does Bucket.sequences come from / get populated?

I'm not entirely sure of the utility of storing a separate array<Sequence>, as you should just be able to pull the sequences from the AlignmentRecord objects. I see something like this:

def readsToBuckets(rdd: RDD[AlignmentRecord], key: (AlignmentRecord) => String): RDD[Bucket] = { rdd .keyBy(key) .groupByKey .map((name, reads) => Bucket .newBuilder .setBucketName(name) .setAlignments(reads) .build ) }

I'm not entirely sure of the utility of storing a separate array, as you should just be able to pull the sequences from the AlignmentRecord objects. I see something like this:

I think as @ryan-williams was noting here, there's a bit of confusion around the array<Sequence>. To copy my comments from the other thread, I was thinking we'd null out the sequence and quality in the nested AlignmentRecords and then repopulate them on-demand.

One the other hand, @ryan-williams, perhaps there is interest in letting a Bucket hold an array of Buckets. Does Avro support this? May be getting too fancy though, and not clear that there is any immediate demand.

@ilveroluca

What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.

AmplabJenkins · 2015-09-14T22:52:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/bdg-formats-prb/68/
Test PASSed.

laserson · 2015-09-14T22:53:09Z

@ryan-williams Looking again at my comments, I think there may be a use for keeping both an array of Sequence and an array of AlignmentRecord. In my comment, I specified that exactly one of these should be non-null. I guess this is from the perspective where you might be working with reads that don't need to be mapped in the same way. Still, I'd be fine making it only AlignmentRecord as long as it's easy to work with non-mapped AlignmentRecords.

ryan-williams · 2015-09-14T23:00:00Z

I think that we have some ability and precedent for letting ARs be mapped or unmapped.

Let me know if there's some case for having both that I've missed; so far the two I've heard (on adam#815) are:

normalize the sequence for efficiency in the case of chimeric reads
store hard-clipped bases that aren't currently stored in ARs.

The first is not persuasive to me, and the second would be better solved by putting that info into ARs if we actually care about it.

Separately from all of this, I'd be inclined to agree that if we had both arrays we should try to enforce exactly one of them being non-null at all times, but that's still a kind of gross solution and I'd much rather avoid it if we don't have a compelling need for it.

Finally, if there was a concrete case where we want to bucket things besides ARs, that would also warrant consideration here, but it doesn't sound like we have one. Moreover, in the event that we had one, that might be best served by a new Bucket-like record-class that is specific to that thing; there's no use making Bucket so general that anyone that uses it has to layer assumptions all over it at runtime… just make a new record type at that point!

I think Bucket as basically an object that wraps an array of ARs is a nice bite-size abstraction to add.

laserson · 2015-09-15T00:12:41Z

sgtm with all your points. I'll push another commit to remove the Sequence array.

AmplabJenkins · 2015-09-15T00:17:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/bdg-formats-prb/69/
Test PASSed.

heuermh · 2016-06-28T20:34:17Z

Coming back to this one after having proposed Alphabet, Sequence, Slice, and Read in #83.

Might the proposal here be updated to create two new record types:

record Read[s, Bucket, Group, Fragment, Set, List, ...] {
  union { null, string } name = null;
  array<Read> reads = [];
}
record AlignmentRecord[s, Bucket, Group, Fragment, Set, List, ...] {
  union { null, string } name = null;
  array<AlignmentRecord> alignments = [];
}

I prefer Group over Bucket, even though that is slightly overloaded by BAM's recordGroup.

heuermh · 2019-07-02T04:47:43Z

Closing as WontFix

laserson mentioned this pull request May 5, 2015

Platform-specific data #55

Closed

laserson force-pushed the BDG-FORMATS-54 branch from 4b4c866 to 8597644 Compare May 5, 2015 18:59

fnothaft mentioned this pull request Jul 22, 2015

Remove mate alignment information from AlignmentRecord #53

Closed

ryan-williams reviewed Sep 14, 2015
View reviewed changes

laserson added 2 commits September 14, 2015 15:47

Fixes

c8449ec

laserson force-pushed the BDG-FORMATS-54 branch from 8597644 to c8449ec Compare September 14, 2015 22:49

Remove array<Sequence> from Bucket type

4821cee

heuermh added this to the 0.14.0 milestone Jul 2, 2019

heuermh closed this Jul 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BDG-FORMATS-54] Generalizing the Fragment type #56

[BDG-FORMATS-54] Generalizing the Fragment type #56

laserson commented May 5, 2015

AmplabJenkins commented May 5, 2015

ilveroluca commented May 6, 2015

laserson commented May 7, 2015

laserson commented Sep 8, 2015

ryan-williams Sep 14, 2015

laserson Sep 14, 2015

fnothaft Sep 14, 2015

laserson Sep 14, 2015

AmplabJenkins commented Sep 14, 2015

laserson commented Sep 14, 2015

ryan-williams commented Sep 14, 2015

laserson commented Sep 15, 2015

AmplabJenkins commented Sep 15, 2015

heuermh commented Jun 28, 2016

heuermh commented Jul 2, 2019

[BDG-FORMATS-54] Generalizing the Fragment type #56

[BDG-FORMATS-54] Generalizing the Fragment type #56

Conversation

laserson commented May 5, 2015

AmplabJenkins commented May 5, 2015

ilveroluca commented May 6, 2015

laserson commented May 7, 2015

laserson commented Sep 8, 2015

ryan-williams Sep 14, 2015

Choose a reason for hiding this comment

laserson Sep 14, 2015

Choose a reason for hiding this comment

fnothaft Sep 14, 2015

Choose a reason for hiding this comment

laserson Sep 14, 2015

Choose a reason for hiding this comment

AmplabJenkins commented Sep 14, 2015

laserson commented Sep 14, 2015

ryan-williams commented Sep 14, 2015

laserson commented Sep 15, 2015

AmplabJenkins commented Sep 15, 2015

heuermh commented Jun 28, 2016

heuermh commented Jul 2, 2019