-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BDG-FORMATS-54] Generalizing the Fragment type #56
Conversation
Test PASSed. |
I think the The The As you said, the "Fragment" could be seen as a special case of the I suppose that when dealing with single-end reads, or perhaps in downstream analyses, knowing whether two reads come from the same fragment might not be very useful. But in upstream analysis it is something that happens often enough (every flowcell!) to warrant it's own record type. My two cents. |
Agreed, but I'd say that
Except the BAM format is pre-groupBy. The key useful feature here is that the
I'm not sure I understand the concern here. Say there are two use cases for Actually, you could model your |
I don't want to hold up bigdatagenomics/adam#815, but after that merges, I'd like to resolve this before we add more functionality based on the |
The DNA fragment that is was targeted by the sequencer, resulting in | ||
one or more reads. | ||
A set of objects that should be physically analyzed/stored together. | ||
Typically the result of some type of groupby operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two nits on this comment:
- "A set of objects" feels vague bordering on cryptic. They're always "reads", right?
- I'm confused by the idea that you'd get these as a result of a groupby; I assume we're all talking about taking an
RDD[AlignmentRecord]
and doing a.map(r => r.getReadName -> r).groupByKey
to it… but then where doesBucket.sequences
come from / get populated? Is the answer just that it's a little more complicated than that? i.e.
def groupReadsIntoBuckets(rdd: RDD[AlignmentRecord]): RDD[Bucket] =
rdd
.map(r => r.getReadName -> r)
.groupByKey
.mapValues(reads =>
Bucket
.newBuilder
.setBucketName(reads.head.getReadName)
.setSequences(reads.map(_.getSequence)) // is this the right field?
.setAlignments(reads)
.build
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A set of objects" feels vague bordering on cryptic. They're always "reads", right?
That's correct.
I'm confused by the idea that you'd get these as a result of a groupby; I assume we're all talking about taking an RDD[AlignmentRecord] and doing a .map(r => r.getReadName -> r).groupByKey to it… but then where does Bucket.sequences come from / get populated?
I'm not entirely sure of the utility of storing a separate array<Sequence>
, as you should just be able to pull the sequences from the AlignmentRecord
objects. I see something like this:
def readsToBuckets(rdd: RDD[AlignmentRecord], key: (AlignmentRecord) => String): RDD[Bucket] = {
rdd
.keyBy(key)
.groupByKey
.map((name, reads) =>
Bucket
.newBuilder
.setBucketName(name)
.setAlignments(reads)
.build
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure of the utility of storing a separate array, as you should just be able to pull the sequences from the AlignmentRecord objects. I see something like this:
I think as @ryan-williams was noting here, there's a bit of confusion around the array<Sequence>
. To copy my comments from the other thread, I was thinking we'd null out the sequence and quality in the nested AlignmentRecord
s and then repopulate them on-demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One the other hand, @ryan-williams, perhaps there is interest in letting a Bucket
hold an array of Buckets
. Does Avro support this? May be getting too fancy though, and not clear that there is any immediate demand.
What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.
8597644
to
c8449ec
Compare
Test PASSed. |
@ryan-williams Looking again at my comments, I think there may be a use for keeping both an array of |
I think that we have some ability and precedent for letting ARs be mapped or unmapped. Let me know if there's some case for having both that I've missed; so far the two I've heard (on adam#815) are:
The first is not persuasive to me, and the second would be better solved by putting that info into ARs if we actually care about it. Separately from all of this, I'd be inclined to agree that if we had both arrays we should try to enforce exactly one of them being non-null at all times, but that's still a kind of gross solution and I'd much rather avoid it if we don't have a compelling need for it. Finally, if there was a concrete case where we want to bucket things besides ARs, that would also warrant consideration here, but it doesn't sound like we have one. Moreover, in the event that we had one, that might be best served by a new Bucket-like record-class that is specific to that thing; there's no use making Bucket so general that anyone that uses it has to layer assumptions all over it at runtime… just make a new record type at that point! I think Bucket as basically an object that wraps an array of ARs is a nice bite-size abstraction to add. |
sgtm with all your points. I'll push another commit to remove the |
Test PASSed. |
Coming back to this one after having proposed Might the proposal here be updated to create two new record types:
I prefer Group over Bucket, even though that is slightly overloaded by BAM's recordGroup. |
Closing as WontFix |
I think storing reads after a "groupby" operation has occurred is an excellent idea, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random
barcodes is another.
Here is a summary of my proposal, based in part on @ilveroluca's work:
Sequence
, though I don't think this is strictly necessary.Bucket
, as it sounds more general to my ears.fragmentSize
, as it's specific to one use case and it's rather easily computable.union
somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types.AlignmentRecord
s.Genotype
, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into theSequence
object.Fixes #54.