Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BDG-FORMATS-54] Generalizing the Fragment type #56

Closed
wants to merge 3 commits into from

Conversation

laserson
Copy link
Contributor

@laserson laserson commented May 5, 2015

I think storing reads after a "groupby" operation has occurred is an excellent idea, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random
barcodes is another.

Here is a summary of my proposal, based in part on @ilveroluca's work:

  • Support a fastq-like object Sequence, though I don't think this is strictly necessary.
  • Rename to Bucket, as it sounds more general to my ears.
  • Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects.
  • Remove fragmentSize, as it's specific to one use case and it's rather easily computable.
  • Support for multiple types of grouped objects. What's the best way to deal with this? union somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types.
  • Sequence and quality information from alignments should be retrieved from AlignmentRecords.
  • I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in Genotype, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the Sequence object.

Fixes #54.

@laserson laserson mentioned this pull request May 5, 2015
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/bdg-formats-prb/62/
Test PASSed.

@ilveroluca
Copy link
Contributor

I think the Bucket and the Fragment record types represent two different concepts.

The Bucket represents a generic group of reads, akin to the recordGroup concept that is present in SAM/BAM and has already been transferred to the AlignmentRecord.

The Fragment on the other hand is a single piece of DNA which has been read in parts. The parts are ordered and they may overlap.

As you said, the "Fragment" could be seen as a special case of the Bucket, as long as the Bucket is ordered (which in the case of your patch, it is). However, if a program starts processing a data set and finds a Bucket, how should it know whether to treat the sequences within the group as belonging to a single DNA fragment (e.g., important for alignment) or as a group of single-end reads? You'd have to tell it, somehow, and complicate processing to address the different possible behaviours in the code. Or you'd have to restrict the software to only work with "Fragment buckets", taking away the user's ability to use Buckets for logical read groups -- which wouldn't be nice.

I suppose that when dealing with single-end reads, or perhaps in downstream analyses, knowing whether two reads come from the same fragment might not be very useful. But in upstream analysis it is something that happens often enough (every flowcell!) to warrant it's own record type.

My two cents.

@laserson
Copy link
Contributor Author

laserson commented May 7, 2015

I think the Bucket and the Fragment record types represent two different concepts.

Agreed, but I'd say that Fragment is a subclass of Bucket.

The Bucket represents a generic group of reads, akin to the recordGroup concept that is present in SAM/BAM and has already been transferred to the AlignmentRecord.

Except the BAM format is pre-groupBy. The key useful feature here is that the Bucket stores all the reads in a group together physically, while the readGroup annotation in BAM simply stores the group key with the read, but the reads can be scattered in the file.

if a program starts processing a data set and finds a Bucket, how should it know whether to treat the sequences within the group as belonging to a single DNA fragment (e.g., important for alignment) or as a group of single-end reads? You'd have to tell it, somehow, and complicate processing to address the different possible behaviours in the code. Or you'd have to restrict the software to only work with "Fragment buckets", taking away the user's ability to use Buckets for logical read groups -- which wouldn't be nice.

I'm not sure I understand the concern here. Say there are two use cases for Bucket, one called Fragment and another called Droplet. Say Fragment is always composed of exactly two reads in a particular orientation, and say that Droplet is composed of a variable/unknown number of reads where the order doesn't matter. You write a program that reads Buckets and assumes they're only composed of two reads, and I write a program that reads Bucketss and assumes that each Bucket may have a different number of reads. No problem.

Actually, you could model your Fragment object as a Scala class or as a separate Avro object that's internal to your program. So when you load the RDD[Bucket], you immediately convert it to an RDD[Fragment].

@laserson
Copy link
Contributor Author

laserson commented Sep 8, 2015

I don't want to hold up bigdatagenomics/adam#815, but after that merges, I'd like to resolve this before we add more functionality based on the Fragment type.

The DNA fragment that is was targeted by the sequencer, resulting in
one or more reads.
A set of objects that should be physically analyzed/stored together.
Typically the result of some type of groupby operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nits on this comment:

  1. "A set of objects" feels vague bordering on cryptic. They're always "reads", right?
  2. I'm confused by the idea that you'd get these as a result of a groupby; I assume we're all talking about taking an RDD[AlignmentRecord] and doing a .map(r => r.getReadName -> r).groupByKey to it… but then where does Bucket.sequences come from / get populated? Is the answer just that it's a little more complicated than that? i.e.
def groupReadsIntoBuckets(rdd: RDD[AlignmentRecord]): RDD[Bucket] =
  rdd
    .map(r => r.getReadName -> r)
    .groupByKey
    .mapValues(reads => 
      Bucket
        .newBuilder
        .setBucketName(reads.head.getReadName)
        .setSequences(reads.map(_.getSequence))  // is this the right field?
        .setAlignments(reads)
        .build
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A set of objects" feels vague bordering on cryptic. They're always "reads", right?

That's correct.

I'm confused by the idea that you'd get these as a result of a groupby; I assume we're all talking about taking an RDD[AlignmentRecord] and doing a .map(r => r.getReadName -> r).groupByKey to it… but then where does Bucket.sequences come from / get populated?

I'm not entirely sure of the utility of storing a separate array<Sequence>, as you should just be able to pull the sequences from the AlignmentRecord objects. I see something like this:

def readsToBuckets(rdd: RDD[AlignmentRecord], key: (AlignmentRecord) => String): RDD[Bucket] = {
  rdd
    .keyBy(key)
    .groupByKey
    .map((name, reads) =>
      Bucket
        .newBuilder
        .setBucketName(name)
        .setAlignments(reads)
        .build
    )
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure of the utility of storing a separate array, as you should just be able to pull the sequences from the AlignmentRecord objects. I see something like this:

I think as @ryan-williams was noting here, there's a bit of confusion around the array<Sequence>. To copy my comments from the other thread, I was thinking we'd null out the sequence and quality in the nested AlignmentRecords and then repopulate them on-demand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One the other hand, @ryan-williams, perhaps there is interest in letting a Bucket hold an array of Buckets. Does Avro support this? May be getting too fancy though, and not clear that there is any immediate demand.

What I think is an excellent idea is storing reads after a "groupby" operation
has occurred, as the reads will likely always be analyzed as a group.  Having
multiple reads from a single fragment of DNA is one such use case, but there
are others.  Droplet-seq is one that I am interested in.  Incorporating random
barcodes is another.

Here is a summary of my proposal, based in part on @ilveroluca's work:

* Support a fastq-like object `Sequence`, though I don't think this is strictly necessary.
* Rename to `Bucket`, as it sounds more general to my ears.
* Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects.
* Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable.
* Support for multiple types of grouped objects.  What's the best way to deal with this?  `union` somehow?  I envision that we may add more types in the future that we'll want to persist as grouped objects.  At the moment, there is just a set of arrays for the type of objects that could be grouped.  This could be extended as we desire the ability to group other object types.
* Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s.
* I don't think platform-specific information should be propagated through the entire chain of data types.  Why don't we include it in `Genotype`, then?  In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage.  Therefore, I've moved platform-specific metadata into the `Sequence` object.

Fixes bigdatagenomics#54.
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/bdg-formats-prb/68/
Test PASSed.

@laserson
Copy link
Contributor Author

@ryan-williams Looking again at my comments, I think there may be a use for keeping both an array of Sequence and an array of AlignmentRecord. In my comment, I specified that exactly one of these should be non-null. I guess this is from the perspective where you might be working with reads that don't need to be mapped in the same way. Still, I'd be fine making it only AlignmentRecord as long as it's easy to work with non-mapped AlignmentRecords.

@ryan-williams
Copy link
Member

I think that we have some ability and precedent for letting ARs be mapped or unmapped.

Let me know if there's some case for having both that I've missed; so far the two I've heard (on adam#815) are:

  1. normalize the sequence for efficiency in the case of chimeric reads
  2. store hard-clipped bases that aren't currently stored in ARs.

The first is not persuasive to me, and the second would be better solved by putting that info into ARs if we actually care about it.

Separately from all of this, I'd be inclined to agree that if we had both arrays we should try to enforce exactly one of them being non-null at all times, but that's still a kind of gross solution and I'd much rather avoid it if we don't have a compelling need for it.

Finally, if there was a concrete case where we want to bucket things besides ARs, that would also warrant consideration here, but it doesn't sound like we have one. Moreover, in the event that we had one, that might be best served by a new Bucket-like record-class that is specific to that thing; there's no use making Bucket so general that anyone that uses it has to layer assumptions all over it at runtime… just make a new record type at that point!

I think Bucket as basically an object that wraps an array of ARs is a nice bite-size abstraction to add.

@laserson
Copy link
Contributor Author

sgtm with all your points. I'll push another commit to remove the Sequence array.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/bdg-formats-prb/69/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jun 28, 2016

Coming back to this one after having proposed Alphabet, Sequence, Slice, and Read in #83.

Might the proposal here be updated to create two new record types:

record Read[s, Bucket, Group, Fragment, Set, List, ...] {
  union { null, string } name = null;
  array<Read> reads = [];
}
record AlignmentRecord[s, Bucket, Group, Fragment, Set, List, ...] {
  union { null, string } name = null;
  array<AlignmentRecord> alignments = [];
}

I prefer Group over Bucket, even though that is slightly overloaded by BAM's recordGroup.

@heuermh heuermh added this to the 0.14.0 milestone Jul 2, 2019
@heuermh
Copy link
Member

heuermh commented Jul 2, 2019

Closing as WontFix

@heuermh heuermh closed this Jul 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Fragment type to support more general "read-bucketing"
6 participants