-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sequence, slice, and read schema #83
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1196,7 +1196,7 @@ record Feature { | |
|
||
/** | ||
Sample. | ||
*/ | ||
*/ | ||
record Sample { | ||
|
||
/** | ||
|
@@ -1221,4 +1221,167 @@ record Sample { | |
*/ | ||
map<string> attributes = {}; | ||
} | ||
} | ||
|
||
/** | ||
Alphabet. | ||
*/ | ||
enum Alphabet { | ||
|
||
/** | ||
DNA alphabet. | ||
*/ | ||
DNA, | ||
|
||
/** | ||
RNA alphabet. | ||
*/ | ||
RNA, | ||
|
||
/** | ||
Protein alphabet. | ||
*/ | ||
PROTEIN | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense to have an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -0 Guess I would wait until a use case arises. An |
||
} | ||
|
||
/** | ||
Contiguous sequence from an alphabet, e.g. a DNA contig, an RNA transcript, | ||
or a protein translation. | ||
*/ | ||
record Sequence { | ||
|
||
/** | ||
Name of this sequence. | ||
*/ | ||
union { null, string } name = null; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coming back to our discussion on bigdatagenomics/adam#1198 RE: what is name and what is description, it'd prolly be good to flesh out the docs here. Again, I think that the sequence metadata (database IDs) should fall into the |
||
|
||
/** | ||
Description for this sequence. | ||
*/ | ||
union { null, string } description = null; | ||
|
||
/** | ||
Alphabet for this sequence, defaults to Alphabet.DNA. | ||
*/ | ||
union { Alphabet, null } alphabet = "DNA"; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the order of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type of the default value should come first in a union. I.e., if the default value is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The intention is to default to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yeah... |
||
|
||
/** | ||
Sequence. | ||
*/ | ||
union { null, string } sequence = null; | ||
|
||
/** | ||
Length of this sequence. | ||
*/ | ||
union { null, long } length = null; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this just an optimization to explicitly model the length? Or is this for cases where you have an unknown sequence with a known length? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An optimization of sorts, similar to storing |
||
} | ||
|
||
/** | ||
View of a contiguous region of a sequence. | ||
*/ | ||
record Slice { // extends Sequence | ||
|
||
/** | ||
Name of the sequence this slice views. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be the same as the name of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. If extension were possible in Avro schema, then |
||
*/ | ||
union { null, string } name = null; | ||
|
||
/** | ||
Description for the sequence this slice views. | ||
*/ | ||
union { null, string } description = null; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we expect the Actually, we may want to drop this field from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't mind keeping the field and setting it to null where appropriate. The use case for |
||
|
||
/** | ||
Alphabet for the sequence this slice views, defaults to Alphabet.DNA. | ||
*/ | ||
union { Alphabet, null } alphabet = "DNA"; | ||
|
||
/** | ||
Sequence for this slice. | ||
*/ | ||
union { null, string } sequence = null; | ||
|
||
/** | ||
Start position for this slice on the sequence this slice views, in 0-based coordinate | ||
system with closed-open intervals. | ||
*/ | ||
union { null, long } start = null; | ||
|
||
/** | ||
End position for this slice on the sequence this slice views, in 0-based coordinate | ||
system with closed-open intervals. | ||
*/ | ||
union { null, long } end = null; | ||
|
||
/** | ||
Strand for this slice, if any, defaults to Strand.Independent. | ||
*/ | ||
union { Strand, null } strand = "Independent"; | ||
|
||
/** | ||
Length of this slice. | ||
*/ | ||
union { null, long } length = null; | ||
} | ||
|
||
/** | ||
Quality score variant. | ||
*/ | ||
enum QualityScoreVariant { | ||
|
||
/** | ||
Sanger and Illumina version >= 1.8 FASTQ quality score variant. | ||
*/ | ||
FASTQ_SANGER, | ||
|
||
/** | ||
Solexa and Illumina version 1.0 FASTQ quality score variant. | ||
*/ | ||
FASTQ_SOLEXA, | ||
|
||
/** | ||
Illumina version >= 1.3 and < 1.8 FASTQ quality score variant. | ||
*/ | ||
FASTQ_ILLUMINA | ||
} | ||
|
||
/** | ||
Sequence with quality scores. | ||
*/ | ||
record Read { // extends Sequence | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not really much of a fan of adding this record, as I don't see what it really adds over our If we were to add it, I would like the field names to be consistent with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a world before samtools. ;) Storing unaligned reads in SAM or related formats may have some practical benefits, but is conceptually strange for say assembly-specific or metagenomic workflows. Thus I don't mind the similar representations, at least for a short time. As with I haven't really looked at the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm +1 on having a read record. Lot's of my flavor of genomics never touches a bam file. I would also be in favor of modeling groups of reads (e.g., pairs from paired end, or clustered reads), which has been discussed for some time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (e.g., in #54) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feel free to pull request a groups-of-reads record to this pull request if you have something in mind. |
||
|
||
/** | ||
Name of this read. | ||
*/ | ||
union { null, string } name = null; | ||
|
||
/** | ||
Description for this read. | ||
*/ | ||
union { null, string } description = null; | ||
|
||
/** | ||
Alphabet for this read, defaults to Alphabet.DNA. | ||
*/ | ||
union { Alphabet, null } alphabet = "DNA"; | ||
|
||
/** | ||
Sequence for this read. | ||
*/ | ||
union { null, string } sequence = null; | ||
|
||
/** | ||
Length of this read. | ||
*/ | ||
union { null, long } length = null; | ||
|
||
/** | ||
Quality scores for this read. | ||
*/ | ||
union { null, string } qualityScores = null; | ||
|
||
/** | ||
Quality score variant for this read, defaults to QualityScoreVariant.FASTQ_SANGER. | ||
*/ | ||
union { QualityScoreVariant, null } qualityScoreVariant = "FASTQ_SANGER"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this strict? (GATC only?) Or IUPAC? Or should we have both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I'm using it to map to and from Biojava Alphabet, which is based on IUPAC, but entirely extensible. Here I see it as simply informational. Unless of course we want to fully model sequences as symbol lists where symbols come from alphabets, which was discussed elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, probably not. Sounds good as is.