Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence, slice, and read schema #83

Merged
merged 1 commit into from
Nov 2, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 165 additions & 2 deletions src/main/resources/avro/bdg.avdl
Original file line number Diff line number Diff line change
Expand Up @@ -1196,7 +1196,7 @@ record Feature {

/**
Sample.
*/
*/
record Sample {

/**
Expand All @@ -1221,4 +1221,167 @@ record Sample {
*/
map<string> attributes = {};
}
}

/**
Alphabet.
*/
enum Alphabet {

/**
DNA alphabet.
*/
DNA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strict? (GATC only?) Or IUPAC? Or should we have both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I'm using it to map to and from Biojava Alphabet, which is based on IUPAC, but entirely extensible. Here I see it as simply informational. Unless of course we want to fully model sequences as symbol lists where symbols come from alphabets, which was discussed elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, probably not. Sounds good as is.


/**
RNA alphabet.
*/
RNA,

/**
Protein alphabet.
*/
PROTEIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have an OTHER enum as well? E.g., for DNA + methylation? IDK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-0 Guess I would wait until a use case arises. An OTHER wouldn't necessarily map into any specific alphabet in Biojava, which supports arbitrary alphabets.

}

/**
Contiguous sequence from an alphabet, e.g. a DNA contig, an RNA transcript,
or a protein translation.
*/
record Sequence {

/**
Name of this sequence.
*/
union { null, string } name = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to our discussion on bigdatagenomics/adam#1198 RE: what is name and what is description, it'd prolly be good to flesh out the docs here. Again, I think that the sequence metadata (database IDs) should fall into the description field.


/**
Description for this sequence.
*/
union { null, string } description = null;

/**
Alphabet for this sequence, defaults to Alphabet.DNA.
*/
union { Alphabet, null } alphabet = "DNA";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order of Alphabet and null reversed here deliberately? Is this something unique to enums?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the default value should come first in a union. I.e., if the default value is null, then union { null, Alphabet } would be correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to default to Alphabet.DNA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah...


/**
Sequence.
*/
union { null, string } sequence = null;

/**
Length of this sequence.
*/
union { null, long } length = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just an optimization to explicitly model the length? Or is this for cases where you have an unknown sequence with a known length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimization of sorts, similar to storing Variant.end even though it can be calculated from Variant.start and Variant.referenceAllele.

}

/**
View of a contiguous region of a sequence.
*/
record Slice { // extends Sequence

/**
Name of the sequence this slice views.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the same as the name of the Sequence, right? In the interest of verbosity, let's doc that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. If extension were possible in Avro schema, then Slice and Read would extend Sequence

*/
union { null, string } name = null;

/**
Description for the sequence this slice views.
*/
union { null, string } description = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we expect the slice record to be used? If we expect to use it for short slices of sequence (e.g., <500bp), then carrying around metadata is going to be inefficient, and we should probably drop this field.

Actually, we may want to drop this field from Sequence as well, and just store the sequence metadata in an associated Contig record.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind keeping the field and setting it to null where appropriate. The use case for slice is the same as NucleotideContigFragment, and also for view a region of a sequence with features, say a typical GenBank record, or a slice in the Ensembl Perl APIs.


/**
Alphabet for the sequence this slice views, defaults to Alphabet.DNA.
*/
union { Alphabet, null } alphabet = "DNA";

/**
Sequence for this slice.
*/
union { null, string } sequence = null;

/**
Start position for this slice on the sequence this slice views, in 0-based coordinate
system with closed-open intervals.
*/
union { null, long } start = null;

/**
End position for this slice on the sequence this slice views, in 0-based coordinate
system with closed-open intervals.
*/
union { null, long } end = null;

/**
Strand for this slice, if any, defaults to Strand.Independent.
*/
union { Strand, null } strand = "Independent";

/**
Length of this slice.
*/
union { null, long } length = null;
}

/**
Quality score variant.
*/
enum QualityScoreVariant {

/**
Sanger and Illumina version &gt;= 1.8 FASTQ quality score variant.
*/
FASTQ_SANGER,

/**
Solexa and Illumina version 1.0 FASTQ quality score variant.
*/
FASTQ_SOLEXA,

/**
Illumina version &gt;= 1.3 and &lt; 1.8 FASTQ quality score variant.
*/
FASTQ_ILLUMINA
}

/**
Sequence with quality scores.
*/
record Read { // extends Sequence
Copy link
Member

@fnothaft fnothaft Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really much of a fan of adding this record, as I don't see what it really adds over our AlignmentRecord and Fragment schemas. There isn't much of an overhead to having the AlignmentRecord schema with the additional alignment metadata fields (from prior measurement, those are <1% of the space on disk, read sequence is about 30% of space, quality scores are about 60%, and the remaining 10% is contig name and metadata/map qual --> note that these numbers were before we un-nested Contig, so the contig name/metadata portion has gone down), and we'd need to add and maintain import/export/convert paths for another schema. I see the Fragment schema as useful because it is a natural view over the data --> we create said view during duplicate marking, and would have use for it in variant calling as well.

If we were to add it, I would like the field names to be consistent with AlignmentRecord.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a world before samtools. ;)

Storing unaligned reads in SAM or related formats may have some practical benefits, but is conceptually strange for say assembly-specific or metagenomic workflows. Thus I don't mind the similar representations, at least for a short time. As with Slice, which is mostly equivalent to NucleotideSequenceFragment, there isn't really a way to start exploring refactoring in the ADAM codebase until the (experimental) schema make it into formats.

I haven't really looked at the Fragment schema and code enough to know how it relates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on having a read record. Lot's of my flavor of genomics never touches a bam file. I would also be in favor of modeling groups of reads (e.g., pairs from paired end, or clustered reads), which has been discussed for some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e.g., in #54)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to pull request a groups-of-reads record to this pull request if you have something in mind.


/**
Name of this read.
*/
union { null, string } name = null;

/**
Description for this read.
*/
union { null, string } description = null;

/**
Alphabet for this read, defaults to Alphabet.DNA.
*/
union { Alphabet, null } alphabet = "DNA";

/**
Sequence for this read.
*/
union { null, string } sequence = null;

/**
Length of this read.
*/
union { null, long } length = null;

/**
Quality scores for this read.
*/
union { null, string } qualityScores = null;

/**
Quality score variant for this read, defaults to QualityScoreVariant.FASTQ_SANGER.
*/
union { QualityScoreVariant, null } qualityScoreVariant = "FASTQ_SANGER";
}
}