-
Notifications
You must be signed in to change notification settings - Fork 112
Inheritance in GA4GH schemas #264
Comments
Hi @AAMargolin, It's been a topic for quite some time. We don't need to use Avro IDL. We can either change it - which many companies do internally - or use something like WebIDL which provides inheritance capabilities: If we keep working around the limitations of Avro, the schema will become cumbersome to support and expand upon. If we do want to stay with Avro, we can consider super-classes as separate files which we then import and play around with the namespaces to match the inheritance. Paul |
I've been playing around with a 'work around' (discussed on #254). In this method, there are fields for connections to the different levels (links to Sample, Individual, Feature, Variant) that are null unless needed.
Then the 'type' string would define which of the fields are linked by the association, ie "Variant,Phenotype". Is this the "GA4GH style"? |
what I was trying to suggest on the DWG call this morning was to use a union, which will be treated generically like an Object. if we "type" the association, this could be used to disambiguate the type of the subject, like: record Association { string id; org.ga4gh.models.OntologyTerm associationType; Evidence evidence; union {null, float} weight = null; #also need a unit union {null, array<org.ga4gh.models.Feature>,org.ga4gh.models.Individual,org.ga4gh.models.Sample} subject = null; org.ga4gh.models.Phenotype phenotype ; #not null } the associationType could be leveraged to cast the subject into it's proper class. so, instead of creating enums of permissible combinations, these could be defined with proper semantics/logical definitions in an association ontology, and implementing systems could could handle the class-inheritance at implementation time. @cmungall has put some thought into how to model associations in OBAN. from what i understand in the avro documentation, each implementation does not have to deal with all of the kinds of Object types; so if one implementation only has patients and another only has variants, each can contribute data using their relevant schema, since it can handle differences in the reader/writer unions through recursive resolution. this would allow just a single model for Associations, and not require class-inheritance to be encoded in the Avro schema. |
Here's a possible approach. It works by preprocessing a schema with subclassing declarations into fully expanded definitions that are usable with Avro. With the following JSON Avro schema: [{
"name": "superclass",
"type": "record",
"fields": [{ "name": "example", "type": "another_record"}]
},
{
"name": "subclass",
"type": "record",
"extends": "superclass",
"specialize": {"another_record": "more_specific_record"},
"fields": [{ "name": "example2", "type": "int"}]
}] "extends" copies the records from the superclass into the subclass. "specialize" replaces instances of some type in the superclass schema with some other type. Here's the expanded defintion of "subclass" after applying the "extends" and "specialize" fields: {
"name": "superclass",
"type": "record",
"fields": [{ "name": "example", "type": "another_record"}]
}
{
"name": "subclass",
"type": "record",
"fields": [
{ "name": "example", "type": "more_specific_record"},
{ "name": "example2", "type": "int"}]
} Here's some Python code to implement the "extends" and "specialize" fields: def specialize(items, spec):
if isinstance(items, dict):
for n in ("type", "items", "values"):
if n in items:
items[n] = specialize(items[n], spec)
return items
if isinstance(items, list):
n = []
for i in items:
n.append(specialize(i, spec))
return n
if isinstance(items, basestring):
if items in spec:
return spec[items]
return items
def extend_avro(items):
types = {t["name"]: t for t in items}
n = []
for t in items:
if "extends" in t:
r = copy.deepcopy(types[t["extends"]])
r["name"] = t["name"]
if "specialize" in t:
r["fields"] = specialize(r["fields"], t["specialize"])
r["fields"].extend(t["fields"])
types[t["name"]] = r
t = r
n.append(t)
return n
avro_schema = extend_avro(json.load(example_schema)) |
(Currently the GA4GH isn't using the JSON version of Avro schema, however this approach can probably be applied to the Avro IDL syntax with a bit more effort). |
Hi all. I agree with the comment by @delagoya on the call and with @pgrosu on this thread that:
The proposed workarounds are reasonable within the constraints of the language, but I think any of these workarounds would result in Frankenstein code, especially as we plan to expand the association concept -- e.g. first to different types of feature/association/phenotype, then to computationally inferred associations, then to even more general association types that have been discussed. The solution by @tetron amounts to essentially creating an extension of Avro IDL -- e.g. a new format that supports inheritance and a parser to convert this new format into current Avro IDL. Perhaps this is a worthy endeavor if someone in our group wanted to take it on and propose it to be added to Apache Avro (no idea how difficult this is). I think this is an important discussion topic as our new use cases are clearly hitting the limits of Avro. @pgrosu , perhaps you can expand on some of the options you mentioned in your initial post? Anyone who has experience with potential solutions for inheritance modeling, please add ideas. |
I missed the call, so I may not have the full context. However, trying to work around this with some kind of extension that compiles down to Avro, or custom python extensions seems to me to be asking for trouble, especially if this not implemented across the whole GA4GH. We don't want to end up implementing our own custom ORM. Mapping OO inheritance hierarchies to 'flat' relational structures has a long history, see for example some of the design patterns in Martin Fowler's Patterns of Enterprise Architecture, e,g, single table inheritance. We shouldn't be reinventing the wheel here - although there may be some differences - e.g. @nlwashington's elegant solution using a union to defer binding wouldn't be possible in a RDBMS. |
@AAMargolin I will bring my approach to the Avro mailing list and see what the response is. A longer term solution (moving away from Avro) would be to consider a linked data API which is expressed using json-ld, and from which an RDF graph can automatically be extracted (permitting full reasoning about ontologies). However, the tooling and standards around this such as Hydra (http://www.markus-lanthaler.com/hydra/) are relatively immature. |
@tetron I really like this approach, and it seems poised to become more common in the future. It would change the work here to focus more on developing a standard set of ontological relationships for genomic/health data. This would seem to suggest moving away from an API SOAP-like approach which is currently being followed towards something focused on discoverable, interchangeable data and REST-ful endpoints as an API. |
We're fans of semantic graph modeling, json-ld etc in our group. However, what you're proposing seems more in line with what the W3C HCLS group are doing and not really compatible with decisions taken thus far in the GA4GH. |
This will be a little long, but will hopefully provide an overview of what is possible. It will cover a lot of aspects that tie together different parts of the GA4GH system, which hopefully will spark ideas of what is possible in terms of implementation via a set of examples near the end. The scope is to start from very general ideas, and then drill down to specifics. The Data ModelsSo before even tackling an API, the most fundamental design component of the GA4GH initiative are the transmitted data models. These define the minimum amount of information required to be communicated between the APIs residing on clients and server(s) - and possibly clients-to-clients - in order fulfill the different API requirements (Reference variants, Read data, Metadata, etc.) requirements and their interaction. Basically we would be defining the interfaces and protocols. Keep in mind that these are the ones that are transmitted, and there will be a different set of models for what is stored. Each will have their own set of optimizations (i.e. for storage, for transmission, etc.). Regarding each type of transmitted data model, there would be methods that would request data and others that would receive a response to these requests. Afterwards it would be the API's job to assemble these into coherent structures. But the key here is that we want to be as minimal in the transmission as possible, since there will be a lot of communication. This might not matter much if it is from a local server to another local server, but it has a huge impact when many clients are transmitting data. Having said that there are quite a lot of data serialization formats, and below is a list: http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats When designing the data model, it is natural to think about how it is transmitted. Thus the language of the interface between two or more computers - for communicating by it - would need to be a standard for understanding each other. The more established the standard the more ubiquitous and adopted the standard becomes (i.e. SAM, VCF, HTML). This is where the Interface Description Language (IDL) comes in. We are using Avro, since that is something common and decided before I joined the discussions last May, but it also makes sense since it is easy to learn and has a good set of features. It also has the JSON formatting that @tetron mentioned. Apache's Thrift is also nice in that it provides a richer set of features, but only has inheritance for services and the documentation is a little sparse. Protocol Buffers could be all-encompassing - which were developed by Google - and allow for inheritance and nested types, though the features are compacted to their most fundamental. For instance a list in Thrift would be a http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro If you want to know more about Thrift, below is a nice link: http://diwakergupta.github.io/thrift-missing-guide/ Below are the features of Protocol Buffers: https://developers.google.com/protocol-buffers/docs/proto So Avro is the easiest, Thrift is richer but less documented and Protocol Buffers offers just about everything one would need, which is what I'm leaning more towards for us. The key idea is that it is most important to keep the big picture - of what the whole system might look like - in mind at all the times. Having this system-view at all times will help make concrete the types of features we want to have in our possible future implementation, and is something which I will talk about shortly. Even if these are modeled via UML or very general Protocol Buffers, it would keep us on track for the whole project. The link I put previously for WebIDL incorporates inheritance, but that is more for web browsers. There is YAML and HDF5 which are nice, but might be not ideal. Usually in industry the approach of data modeling is performed via tools such ER/Studio, but the purpose there is different. Here the scope is much larger and global. Again keep in mind that JSON schemas are associated to data, which means that schemas are meant for interpretation and validation of the data. The data does not have to be stored in this fashion. Again this all stems from the fact we want to transmit the minimal necessary and sufficient information. So ideally regarding the data models, it would be ideal to keep them as minimally necessary and sufficient as possible. This way we do not transmit as much - and the computation would be quicker - but if limitations get in the way, and we really need more complex structures, just write them how you think is most natural. This can be via UML or a diagram/drawing. Then we take that and determine the optimal way to structure and implement it. We have plenty of expertise here in this group. Keep in mind that we're not always transmitting data, but sometimes just sending a request to the server - that already has the data - such as for initiating an analysis. For instance, we might send a request to the GA4GH Alignment Subsystem with a The Whole GA4GH EcosystemNow having said that, it is really important to think at the very beginning about the possible infrastructure, since that will dictate the models and expectations of what we want to do with the data. Here it is very important to think really BIG, to ensure future growth in terms of capacity-planning of features, and then specifics as a prioritized list of features to implement. As an example, I might want to select subregions of the genome and it would dynamically show disease prevalence as data is constantly is being uploaded, processed and annotated. We are talking many gigabytes and possibly terabytes of transmission. Ideally we would need to plan for petabytes of throughput, which you probably heard me mention too many times over the past year. Keep in mind this large collection of systems would most likely contain different services that are optimized for their specific tasks, but would also communicate with each other. For instance, we might have a graph server that is optimized for performing analysis and searches in that domain, like performing path mapping - as noted here: #275, GFA extension, and #183. We could also have an annotation server. Then another server for just uploading and processing reads. Another one for just processing variants. Below is a diagram to drive this idea: So for quite many years people worked on the next phase of data management systems, which are dataspaces. In this approach we have a large collection of data sources. These can be databases, files, etc. There are also relationship describing these datasources. Then from these relations the query managers interface with them. We can even separate the data from schemas. Below is a figure describing this One important goal is that we want data to be consistent between user sessions. Thus, we cannot rely on clients to be online one day, and offline another. This would mean that clients probably only perform the following tasks: 1) Query the data on server servers. 2) Upload, annotate, delete and change access to data on servers. 3) Initiate analysis of data on servers. Now let's think about the data models, through the lens of some of the key requirements of DWG - I will list the encompassing needs without loss of generality: 1) Data models would have the ability to be updated over time. This can also fall under the category of called schema evolution for data-warehouses/databases, which means that schemas associated to the various types of data can change. The grammar is checked by the schema parser, and usually a default value is needed to ensure that the system keeps functioning. 2) Data and schemas are decoupled, and can be stored on separate systems. This means we do not need to reload the data. 3) We want the ability to analyze data, store the data and analysis results. Analysis results can be used to query other parts of the system. For instance, I have a variant's call and I want to know all samples also containing that call. 4) To ensure consistency, the data would be stored on the servers. The reason being is that we do not want data to appear in one session when connecting to a client, and then disappear if the client loses power. 5) Ideally the servers should get requests for processing data, in order to ensure reproducibility. For example, a user can ask: take this One key aspect is data freshness, in that data might take time to propagate throughout the whole system, which most likely would be distributed. The eventual distributed files that these would be stored to would contain the Access Control List for permissions, which would be stored at the file level or stored object/bucket. Thus if users would attempt access the data at a later time, then on the first accessed file, it would quickly return an error if access is denied. Again we want a scalable implementation, and I will talk about some of them next. The NoSQL (Not only SQL) UniverseSo when we are dealing with such large volumes of data - and might be semi-structured - we are interested in querying and analyzing them, while also being scalable. This falls under the umbrella of NoSQL. These fall into four different categories:
Columnar storage can provide efficiencies in storage and queries. If one wants, one can even have protocol buffers associated to such stored data in order to aid with streaming, transforming and accessing the data. Examples of databases having this storage type are: Google's BigTable, Apache Hbase, Cloudera Impala with Apache Parquet, and Apache Hive also implements some of these concepts.
We most likely will either go with a combination of 1 and 2. Any graph implementations in our systems would probably be customized and optimized for our needs having its own subsystem. In fact we can have tables with structures such in the following example. This is using Apache Hive, which I will demonstrate in the next section:
We can even scale to a massive scale with MapR, to ensure scalability and dependability: http://doc.mapr.com/display/MapR/MapR+Overview Even though there are many possibilities, we should still try to to keep the data models as simple as possible. Applied Example using Apache HiveNow having said all that, let's make this an applied example using a very simple scalable implementation through Apache Hive. The scalability implementation in Apache Hive can be highly customized, and I just used only the bare minimum in my example. Our possible future implementation will be much more elaborate - would contain different implemented components, where Hive might be one of them - but this will hopefully help illustrate most of the concepts I was making earlier:
Below is the JSON version of
Below are the results:
Below are the results in Hive:
If I want to, I can even alter the original table. In our implementation for GA4GH we would probably need to rewrite some code to perform schema evolution for the Avro table. One thing to keep in mind is that what is presented to the user, and how it is stored do not need to be the same way, since there are optimizations in storage and schema updates we can take advantage of. This would mean that new data columns can be added with a default value, or we can remove columns by just ignoring them and later garbage-collecting them. Keep in mind that data does not have to stay as JSON in the database(s)/datawarehouse(s) - it can be any format - preferably columnar-based (tree-like). We can always have converter processes/subsystems, which can tranform the data into JSON, which I will demonstrate next. In any case, below is how one can alter an existing table by adding a new column called
But really this is not as flexible as running the system as a service, with which many programs then can interact with and perform post-processing as necessary. Below is an example of a client interacting with the Hive server-service:
First I start Hive as a server:
Next I wrote two programs:
Below is how I went about performing this:
Below is the result:
Below is the actual code: For GetDataFromHive2JSON.java: import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
import java.io.IOException;
import org.codehaus.jackson.JsonGenerationException;
import org.codehaus.jackson.map.JsonMappingException;
import org.codehaus.jackson.map.ObjectMapper;
public class GetDataFromHive2JSON {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "Position";
String sql = "select * from " + tableName ;
System.out.println("Running: " + sql);
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
//System.out.println(res.getString(1) + "\t" + res.getString(2) + "\t" + res.getInt(3));
@SuppressWarnings("deprecation")
Position position = new Position( res.getString(1), res.getString(2), Integer.valueOf(res.getString(3)));
ObjectMapper mapper = new ObjectMapper();
try
{
System.out.println( mapper.defaultPrettyPrintingWriter().writeValueAsString( position ));
} catch (JsonGenerationException e)
{
e.printStackTrace();
} catch (JsonMappingException e)
{
e.printStackTrace();
} catch (IOException e)
{
e.printStackTrace();
}
}
}
} For Position.java: public class Position
{
private String sequenceId;
private String referenceName;
private Integer position;
public Position(){
}
public Position(String sequenceId, String referenceName, Integer position){
this.sequenceId = sequenceId;
this.referenceName = referenceName;
this.position = position;
}
public String getsequenceId()
{
return sequenceId;
}
public void setsequenceId(String sequenceId)
{
this.sequenceId = sequenceId;
}
public String getreferenceName()
{
return referenceName;
}
public void setreferenceName(String referenceName)
{
this.referenceName = referenceName;
}
public Integer getposition()
{
return position;
}
public void setposition(Integer position)
{
this.position = position;
}
@Override
public String toString()
{
return "Position [sequenceId=" + sequenceId + ", referenceName=" + referenceName + ", " +
"position=" + String.valueOf( position ) + "]";
}
} So this was a bit long, but hopefully this will spark ideas of what might be possible in the implementation of the GA4GH system and all of its component subsystems. Please let me know what you think, since hashing out ideas at this stage will help us eventually have the best and most customizable, scalable implementation for the future, allowing us flexibility in the features we wish to incorporate. Thank you, |
@pgrosu Google's protobuf serialization was not chosen as the serialization system for the GA4GH because it is controlled exclusively by Google. While Google has been gracious enough to share the code under an open-source license, there has not been a single commit to the source repo from a non-Google employee. In contrast, Apache Avro (and Thrift) has broad community involvement and welcome anyone to make changes to these systems, including GA4GH members. The GA4GH has a goal of being vendor-neutral -- we don't want to lock ourselves into technology controlled by a single corporation. We value community as much (maybe more) than code. There is no software silver bullet here: any system we use will have its limitation and quirks. The nice thing about Avro (and Thrift, Parquet, etc) is that we are welcome to make changes to address those issues. |
So just a bit of context for my comment: I think that Avro is fine as a basic structure and data serialization I also think that if you guys want to encode business logic in the data If I were to have a preference, I would encode a flexible and simplistic
@pgrosu apologies I have not read your full email yet, I've been traveling -angel On Mon, Apr 6, 2015 at 9:38 AM, Matt Massie notifications@github.com
|
Matt - I whole-heartedly agree with being vendor-neutral, though I thought that since Google is part of GA4GH we all might work at it to adapt it to our needs. I understand and agree with using open and well-supported implementations, though I'm fairly comfortable with tweaking large-scale source code and having been in a course of designing programming languages and interpreters/compilers I'm okay with reinventing a better wheel, if the old wheels don't quite fit. I'm not pushing this, only saying that I'm flexible. I agree that there's no magic solution, and I was just using Hive as a simple, illustrative scalable example. Our system begs to have quite a variety of customized interconnected components. I'm just presenting what's currently out there and suggesting that we need to think of all the moving parts - and how they affect each other on whole-system level - before we set too much in stone. Angel - No apologies necessary - I totally understand and no rush :) Thank you for the GMOD Sequence database schema link. The implementation is similar to something else I worked with in the past. I definitely agree that having a controlled vocabulary, like in their implementation will systematize things and keep things within scope. The schema I think will also help with inspiring ideas of how we might want to approach some aspects of our implementation. Possibly parts of Feature Locations and Feature Location Graph might prove useful, but I need to think a little more about them. I agree that keeping our data models at a more fundamental level - without encoding business logic - would steamline our critical path. As we are still in the design phase, it might be easier to draw out what we need across the whole system, in a general way. Then we can think of what parts will run where, to determine what should be serialized and what would be implemented in other parts of the system. As a minor side-note, based on the following link, it seems that luckily more languages - aside from Java - have implemented RPC protocols using Avro. I am really happy we are having this discussion. Thanks, |
Angel Pizarro notifications@github.com writes:
Very much agreed. The discussion of extending Avro has been I do find the lack of polymorphism in Avro frustrating. The web API is not about storage or programming, it's about Instead of trying to optimize the programming model by adding While the code-generation for an API is helpful, these still Stick to the plan. |
I agree with @diekhans here. Avro has its problems (lack of inheritance, spotty cross-language support and more), but it works well enough as a low-level serialisation format. In the reference implementation, we've written a lot of code that assumes certain patterns are followed in the schema, and it would be an awful lot of work for us (for no gain) if these patterns were changed. Can we not follow the existing pattern for inheritance? For example, In the end, it doesn't really matter to our code in the reference server or client whether this is elegantly encoded in the Avro schema via a natural inheritance mechanism or if it's nastily bashed in (as it is). We process the Avro schemas and generate our own Python code (see https://github.com/ga4gh/server/blob/develop/scripts/process_schemas.py), which properly captures the inheritance structure (and a bunch of other things which are useful for us, but wouldn't make any sense in general). As @diekhans says, developers will interact with the client side libraries, not the raw protocol. We should concentrate our energies on making the client side library programming models as easy to use as possible, using language-specific idioms. If we try to make a single programming model that will work well in Java, R, C, Perl and Python (for example) we will fail. |
I for one am strongly in favor of protocol buffers. It may be true that the
core implementation is maintained exclusively by Google, but the community
adoption is widest, the on-wire format most efficient, and the feature set
most precise of any similar platform.
Additionally protobuf version 3 introduces a number of primitives that are
quite useful such as maps.
I am concerned that in this project we are seeing the perfect, ideal
solution being preferred over the expedient for political reasons. Give the
wide adoption I do not think it matters that protobuf is embedded in
Google. If they were to take it closed source an army of hackers would rise
up and fork it.
Some of the concerns that led to avro are reminiscent of those that
motivated OLPC (one laptop per child) to build a window manager and GUI in
Python. That group decided on language by following an ideology that put
the capacity of the system to be modified by its users above all other
concerns. The result was possibly the most uncomfortable user experience
ever seen in a production laptop. People who had never seen a computer
before turned away in boredom as simple apps took minutes to start. To my
knowledge very few if any students ever modified the system, and those who
did probably would have been just as capable if it had been written in C as
python.
|
Given the year that we have spent with Avro without it having been used seriously by anyone, Is it too contrary to suggest a parallel protobuf representation/implementation of the same basic Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
What would we actually gain by switching to protobuf @richarddurbin? I don't particularly like Avro, but there is a definite and large cost to switching to a different IDL language now. There is nothing stopping us using protobuf as an on-the-wire format for our existing protocol, defined using Avro IDL. Once we have correctly isolated client developers from the details of the protocol using language specific libraries, the on-the-wire format can be easily changed. This is where we can make the protocol more user and developer friendly. |
First, fundamentally, there's not that huge of a difference between Avro, Thrift, and Protobuf. It's not really correct to say that no one uses Avro; Avro is arguably biggest in the Hadoop world, where Thrift is also pretty big. Each one has their pros and cons; it's unlikely that paying the cost of rewriting our schemas would be paid off by any benefits we would see from making the move. Neither Avro, Thrift, or Protobuf are silver bullets. Second, the choice of serialization environment seems like a distraction from the real issue here (esp. since the GA4GH REST API uses JSON on the wire). The way I see it, the real problem people are running into is the classical issue of developing a schema for semi-structured data. However, since our semi-structured schemas have a fairly small number of fields we can handle this fairly efficiently with nested schemas that make use of optional fields. E.g., exactly what @kellrott proposed in the 3rd comment on this thread.
The various data serialization implementations all make optional fields fairly efficient on the wire. We don't need to switch to a different IDL nor do we don't need to build ingenious approaches for compiling the schemas. |
Historic note: It was Google that originally suggested Avro to the Berkeley AMPLab, which was how it became part of GA4GH. @richarddurbin I don't believe the lack of serious use is not related to Avro. It's the fact that we are defining schemas but not building applications to validate them. We need a release policy that includes implementation in the reference server and conformance suite. The only way to know if an API is right is to use it. |
@ekg I just don't see what protocol buffers buys us over Avro. Protocol buffers doesn't support polymorphism either. Avro already has maps and is more dynamic, with one being able to do introspection on the schema. The lack of normative documentation for the schema a far bigger problem than the IDL being used. My biggest wish for Avro right now is a much better documentation generator (say a Sphinx plugin). |
I withdraw my not-sufficiently-informed comments about Avro. I am still concerned about uptake and use of our designs so far, notwithstanding Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
I tend to agree with @diekhans - unless protobuf has some much needed features that are absent from avro, it would be good to stick to avro. My 2c. |
I agree with @fnothaft about the underlying problem, but take some issue with the proposed example from @kellrott and would prefer to have the refinement proposed by @nlwashington in the fourth comment (copied below for the lazy) record Association {
string id;
org.ga4gh.models.OntologyTerm associationType;
Evidence evidence;
union {null, float} weight = null; #also need a unit
union {null, array<org.ga4gh.models.Feature>,org.ga4gh.models.Individual,org.ga4gh.models.Sample} subject = null;
org.ga4gh.models.Phenotype phenotype ; #not null
} The only issue I have for either solution is for us to make certain that the set of [Variant, Feature, Individual, Sample, Phenotype] represents the full set of |
One thing I'll add from my vantage point at Cloudera is that Avro is definitely widely used across industry, along with Thrift too. I find that Avro tends to be used more for data analytics/storage, and not at all for RPC. Thrift seems to be used less for data storage, but is widely used for RPC. Another Avro advantage(?) is that it doesn't require code-gen. IMO, the docs are lacking for both of them. |
I also think @kellrott's proposal would be ok, but lament that you lose some of the advantages of static typing. It means your code will now be riddled with checks for the type of the objects, and it'll be harder to reason about what the code is doing. |
I have to apologize that I have not brought my proposal to the Avro community yet, but want to chime in that we should not conflate two distinct issues here, which is data modeling vs data representation. Inheritance is a useful semantic relationship that can be represented a variety of ways. Avro is good for describing representations, less so for relationships. So I don't think it is unreasonable to adopt a semantically richer data modeling framework that gets translated down to a base Avro schema. |
In general (not commenting on this particular design issue), Uri Laserson notifications@github.com writes:
|
I agree with @diekhans, to test if Avro will remain viable for the future of GA4GH data, it must be tested in real life implementations. I've been and discussing / developing the Matchmaker Exchange API, and implementing it on the Decipher project over the past 4-5 months. We looked at the AVRO scehams, and we cosndiered trying to stay close, or at least make it possible to convert our docs to AVRO, which we did do, but I'm not sure anyone could see a benifit for using the schemas, for the MME project. I aprecaite that it wouldn't make sense for the Beacon project to use said scheams, but MME is returning data. I understand (if I'm correct) that the schemas are desigened to make implementing data exchange easier between parties, but if a project that is supposed to be an exemplar feels, as a group, unable to use them, then do we as a group need to evaluate that fact and re-assess? (Obviously this is a bigger question than can be addressed via github issues). I don't see the full picture, so I'm not sure if other groups are activly exchanging data using these schemas with real production data. If not, then our focus must be on the example implementation. I'm sure we all know that we learn the practicality of theory when we test its real world applications. I have no doubt that the schema design is extreamly well informed! You people are experts in your fields! However I think the process we have undertaken within the MME project, represents a sort of accelerated process of what GA4GH could expect on a larger, longer scale. Our first version of our API specification looked ok, but when it came to actual implementation, it was immediatly clear that we needed to change parts. I storngly feel that the should be focusing on developing the example implementation server as a priority, as secondary issues will then be better informed. |
How is avro used to transmit streams of unbounded size? Billions of objects
|
@ekg correct me if I'm wrong, Avro is about the data structure and not the transmission. Few articles of people doing this seem to confirm (http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial https://connect.liveperson.com/community/developers/blog/2013/09/30/how-to-stream-avro-encoded-messages-with-kafka-and-storm ) |
@Relequestual, I think you're noticing the issues, as different protocols are optimized for different purposes. Also not everything has to be transmitted over the wire, and we want to keep the same minimal response-time as data grows. That usually means more machines. Basically it comes down to how long are you willing to wait for something to transmit, process and return since at least for responses currently we have it designed on pages and tokens - and some peer-2-peer architectures would not be optimal in this instance. The faster it is, then it will enable much more complex analysis such as: "Build me a probabilistic molecular evolution model from these samples", "Are there other modeled collections of k-samples which correlate with parts of my model or other models", "Has their distribution become stationary"? But having said all that, I'm a practical man. Would you by any chance have a coded data-driven implementation for Matchmaker with different topologies which we can benchmark for different implementations, with something along these lines: https://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2 https://amplab.cs.berkeley.edu/benchmark/ https://github.com/IBMStreams/benchmarks I think if we actually perform a real test on a cluster, Amazon, Google, etc - which I mentioned previously here - it would help solidify the implementation(s) we would optimally want to design for. Paul |
@pgrosu One of the immediate things we noticed with MME is that each group has their own different data, and structure. Some only have gene information, some only have phenotypic information. As for us (Decipher), we have genotypic and phenotypic data, but only varaint level data (not whole sequence). I think what I'm trying to say is, what a benchmark might reveal for one group, may not hold true for another. Our problem isn't the amount of data in transit, it's the extracting that data and calculating meta data (like a match score) on the fly, on a web server, which doesn't have access to our farm. Seralising that data and transit is almost a secondary issue for MME right now. |
@Relequestual, not sure if this might help, but several months ago I put a post with some simple code using a few concepts from Google's MillWheel: Fault-Tolerant Stream Processing at Internet Scale, that you can download the original paper via the following link: http://research.google.com/pubs/archive/41378.pdf There I described how to process data on a stream - only via performing a checksum for validation purposes - but like you noticed from other links, there many ways integration with Kafka, Storm, Spark, etc. It might be easier to show us some code you are running into trouble with, when trying to score in parallel on a stream as you are trying to scale. Thanks, |
@pgrosu I don't have the level of data in the project I work on that warrants streaming data. The costly time in the request is processing, not transmission. We aren't running into any problems here. We spent a few days optomising the scoring code. It's currently not run in parallel, which would obviously be a huge speed increase if speed was an issue for us. |
@Relequestual Just curious, where is the scoring code located? Thanks, |
@pgrosu Information on the phenotype scoring can be found at https://github.com/MatchMakerExchange/mme-apis/wiki/Phenotype-matching |
@Relequestual Thank you for the link and information. No worries - whenever it's ready to be released is fine :) I was just was curious of the code in case maybe some ideas might come to me, that might be beneficial to the group. Thanks, |
Re the historical note from @diekhans -- here's the original machine-readable schema: Avro or protobuf? thread from March 2014. My summary (with quotes from @massie):
Two more recent thoughts on our Google-specific implementation:
|
Hi David, Thank you for pointing us to the original discussion - since I didn't see it before - and I definitely enjoyed reading the fence jumping adventure :) Maybe we could discuss here different ideas regarding wire-serialization and storage implementations we could experiment with. I'm sure others might be interested in that as well. I would definitely be interested in hearing more of the streaming approach you guys have implemented. Something quick that comes to mind that might work here would be a distributed hash table (DHT). In that instance the peers would connect to each other (P2P) on demand for the stream of reads/variants that are being requested. Then one can just perform an offset with a limit into the stream that they would like, which would be pretty neat. I think Mathias Buus gave a talk about this last year. I put together an example below, which illustrates a subset of that idea:
$ cat readstats.avro
{
"type" : "record",
"name" : "ReadStats",
"fields" : [ {
"name" : "alignedReadCount",
"type" : [ "null", "long" ],
"doc" : "The number of aligned reads.",
"default" : null
}, {
"name" : "unalignedReadCount",
"type" : [ "null", "long" ],
"doc" : "The number of unaligned reads.",
"default" : null
}, {
"name" : "baseCount",
"type" : [ "null", "long" ],
"doc" : "The total number of bases.\n This is equivalent to the sum of `alignedSequence.length` for all reads.",
"default" : null
} ]
}
$
import sys
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import subprocess
schema = avro.schema.parse(open("readstats.avro").read())
dist_hash_table = {}
dht_indices = {}
for i in range(0, 3):
filename = "readstats_" + str(i) + ".json"
writer = DataFileWriter(open( filename, "w"), DatumWriter(), schema)
writer.append({"alignedReadCount": 10*i, "unalignedReadCount": 30, "baseCount": 50})
writer.close()
system_command = subprocess.Popen(['md5sum', filename], stdout=subprocess.PIPE)
system_out, system_err = system_command.communicate()
dist_hash_table[ system_out.split()[0] ] = filename
dht_indices[ i ] = system_out.split()[0]
def seek_readstats( offset, limit ):
for i in range( offset, offset + limit ):
if( i in dht_indices.keys() ):
reader = DataFileReader(open(dist_hash_table[ dht_indices[i] ], "r"), DatumReader())
for readstats_contents in reader:
print readstats_contents
reader.close()
def main():
print("\nReturn records starting at index " + sys.argv[1] + " with a limit of " + sys.argv[2] + ":\n")
seek_readstats( int(sys.argv[1]), int(sys.argv[2]) )
print("")
if __name__ == '__main__':
main()
$ python stream_readstats.py 0 2
Return records starting at index 0 with a limit of 2:
{u'unalignedReadCount': 30, u'alignedReadCount': 0, u'baseCount': 50}
{u'unalignedReadCount': 30, u'alignedReadCount': 10, u'baseCount': 50}
$ python stream_readstats.py 2 12
Return records starting at index 2 with a limit of 12:
{u'unalignedReadCount': 30, u'alignedReadCount': 20, u'baseCount': 50}
$ The datastores that have these hashed key-value paired subsets of the data can be distributed and replicated. The clients and servers would thus be able to connect to each other for on-demand streaming of the portions they are requesting. In theory it should become faster the more clients and servers are online. I'm sure your implementation is much more elaborate :) Looking forward to seeing it, |
Hi David, I noticed than in my earlier post I forgot to address how to code streams in Avro. We could implement the fundamentals of Lisp/Scheme/Functional Programming, and then we have streams as a natural extension: For instance the
We know that
I didn't include the code for So voilà, we have streams in Avro! :) Yes it is super-general and maybe we should consider other approaches of modeling our schemas, since Avro is made mainly for how to structure data and not for modeling the specifics of how data would be processed. Let me know what you think. Thanks, |
Just noting that we have now officially moved to protocol buffers. The topic branch will merge to master very shortly. Does that mean we can close this issue? |
Sounds like it. Closing. |
A topic that has come up in several task teams is the need to use inheritance concepts to model class hierarchies in our data models. This has come up specifically in the G2P and annotation task teams, as each are trying to define representations of abstract concepts (e.g. genes, genomic features, transcripts) that clearly relate to each other and are best represented using object models that support subclassing.
The current reliance on Avro does not support inheritance, so this thread is intended to discuss if or how GA4GH schemas may support inheritance. Possibilities suggested during task team discussions have included: 1) limit scope to data serialization and not modeling of class hierarchies; 2) use a parallel system for defining our object models, including inheritance, and sync this system up with Avro to define data serialization; 3) use a system other than Avro that can represent both class hierarchies and data serialization. Of course, please suggest other possibilities as well.
The text was updated successfully, but these errors were encountered: