graph subsetting operations and serialization format #275

ekg · 2015-04-02T13:59:39Z

I'd like to query a part of the graph using coordinates against an arbitrary reference path encoded in it.

I have a implementation of this kind of feature that can serve as a conceptual starting point. In vg you'd do this like vg find -n 314159 -c 10 x.vg >q.vg, and q.vg would then contain a subgraph with all nodes and edges reached in a breadth first search of 10 steps starting at node 314159. I use a context size defined in nodes rather than edges because it is simpler in many situations, and in practice I construct graphs with a maximum node size of 50bp using vg construct ... -m 50, but it seems possible to do this using an actual base distance rather than a number of steps in a BFS. You can also do vg find -p chr19:100-200 x.vg to get the graph overlapping a particular range in any named path that has been stored in the graph (in this case chr19). vg find can also take a specific kmer and return the subgraph associated with that kmer (which may not be fully-connected).

There are other typical operations that I've found useful when generating subsets of variation graphs, which are perhaps best summarized by the interface to vg find:

usage: vg find [options] <graph.vg> >sub.vg
options:
    -n, --node ID          find node, return 1-hop context as graph
    -f, --edges-from ID    return edges from node with ID
    -t, --edges-to ID      return edges from node with ID
    -k, --kmer STR         return a graph of edges and nodes matching this kmer
    -T, --table            instead of a graph, return a table of kmers
    -c, --context STEPS    expand the context of the kmer hit subgraphs
    -s, --sequence STR     search for sequence STR using --kmer-size kmers
    -j, --kmer-stride N    step distance between succesive kmers in sequence (default 1)
    -z, --kmer-size N      split up --sequence into kmers of size N
    -C, --kmer-count       report approximate count of kmer (-k) in db
    -p, --path TARGET      find the node(s) in the specified path range TARGET=path[:pos1[-pos2]]
    -P, --position-in PATH find the position of the node (specified by -n) in the given path
    -r, --node-range N:M   get nodes from N to M
    -d, --db-name DIR      use this db (defaults to <graph>.index/)

My point is that the natural way to interrogate the graph involves queries over kmers, sequences, path positions, node ids, edge ids, and node id ranges. The return from such a query should be a subgraph, and ideally one with a 1:1 mapping to components in the source graph.

Things are pretty straightforward in variation graphs (as in vg). These are collections of three types of objects: (1) nodes with sequences, (2) edges between node ends representing allowed linkages, and (3) paths over nodes and edges which define coordinate systems and names. Making subsets of this structure is trivial. You just collect the nodes, connected edges, and path annotations to create a subgraph that is a strict subset of the components of the original graph.

But how do we generate subgraphs of a side graph?

There seem to be some conceptual hurdles to generating subsets from the side graph model. Useful side graph subsets are not directly equivalent to subsets of the components original side graph. If nodes are long (and some in the side graph are effectively as long as whole chromosomes) then we need to generate new synthetic nodes that start and end within those from the graph we are querying.

As joins are defined with node-relative coordinates, then the entire coordinate system within the subgraph needs to be modified relative to the original graph in order to maintain a consistent structure.
Alternatively, we could maintain offsets relative to the original nodes in our subgraph representation to allow mapping back into the original coordinate space.
Yet another possibility would be to define the subset as a side graph built on the original graph. This would seem to introduce new problems by making it impossible to relate sides within this directly back to the original graph.

These are not insignificant issues. We can pick a solution among these and it will work. However, it will add nontrivial complexity to the model. Subsetting the graph is just about the most basic thing you can do, and it should not present a conceptual hurdle to users and developers.

This is not just a theoretical critique. Some months ago I began to implement (2) but stopped as it required the addition of additional fields defining relative offsets to "real" nodes to each component of the graph. The semantics of the graph would then change whether or not a node was in "offset" space. The coordination of these offsets seemed extremely likely to generate bugs. (As described above, I resolved the problem of graph subsets by breaking up the graph into pieces of a consistent maximum size.)

The side graph makes great sense to me as a way to define a coordinate system, but very little sense as the basis for the data model that's used to store, query, and transmit the graph.

There is a reasonable alternative. We can maintain a side graph coordinate system to provide distinct names for all the sequences and paths in the graph. However, this can be a purely virtual thing, in that we can allow any underlying graph structure to be returned from queries. Provided the nodes and edges in a sequence/variation graph are annotated with their coordinates in the side graph, we will get all the benefits of the side graph without the awkwardness of defining standard algorithms to manipulate it.

The text was updated successfully, but these errors were encountered:

adamnovak · 2015-04-02T18:58:14Z

I think you're missing an option (4) in your list, which is to just work in the coordinate space of the original graph, even when looking only at a subgraph. The subgraph isn't made out of the same types as the full graph (since e.g. two regions of the same Sequence can't be different Sequences in the subgraph), but do regions and reference sets really need to support the same operations?

My proposed subgraphRadius() method in #269 just returns the Sequences and Joins that are touched by the subgraph you are asking about. So if part of a Sequence is involved in your subgraph, you get the full Sequence record (which isn't any larger than a Segment would be).

Since that makes it a bit difficult to know what part of the Sequences you get back was actually in the range you were asking about (without walking the returned graph again yourself), we might want a type that defines a subgraph, consisting of multiple Segments (in the coordinate space of the underlying full graph) and some Joins.

lh3 · 2015-04-02T19:21:20Z

Conceptually, I more like to think a side graph as a graph with each base as a node. Each node/base in this graph has a unique and persistent label such as chr1:1012 (perhaps plus orientation). When we do a vg-like traversal, the command line would be sg find -s chr1:1012:5 -c 10. This means we start from the 5'-end of base chr1:1012 and extend over at most 10 bifurcations with BFS. Or we may replace -c 10 with -l 5 to extend at most 5bp. Suppose we have a SNP at chr1:1014. ~~With (2), the output should be:~~

S={"chr1_1012_1017":AGCCG,"snp1":T}
J={<"snp1:0:5","chr1_1012_1017:1:3">,<"snp1:0:3","chr1_1012_1017:3:5">}

~~Here as this 5bp sequence is named as chr1_1012_1017, each base on it has the same label as the reference graph.~~

@adamnovak's solution (4) may be better (his comment just popped up). We can change the definition of S a little and get such a subgraph:

S={(chr1,1012,AGCCG),(snp1,0,T)}
J={<"snp1:0:5","chr1:1013:3">,<"snp1:0:3","chr1:1015:5">}

Now S is a set of 3-tuples (name,offset,sequence). (S,J) describes the local topology.

EDIT: simplified the example a little bit.

lh3 · 2015-04-02T19:24:25Z

In addition, @ekg, how can we know node 314159 in vg is a reference or a named allele? ~~How should we specify the reference coordinate to search its local topology?~~ When we add new variants or merge vg graphs, will node 314159 be split? i.e., is node immutable with incremental graph construction?

benedictpaten · 2015-04-03T01:01:23Z

On Thu, Apr 2, 2015 at 12:21 PM, Heng Li notifications@github.com wrote:

Conceptually, I more like to think a side graph as a graph with each base
as a node. Each node/base in this graph has a unique and persistent label
such as chr1:1012 (perhaps plus orientation). When we do a vg-like
traversal, the command line would be sg find -s chr1:1012:5 -c 10. This
means we start from the 5'-end of base chr1:1012 and extend over at most 10
bifurcations with BFS. Or we may replace -c 10 with -l 1000 to extend at
most 1000bp. Suppose we have a SNP at chr1:1512. With (2), the output
should be:

S={"chr1_1012_2012":1kb-sequence,"snp1":T}
J={<"snp1:0:5","chr1_1012_2012:499:3">,<"snp1:0:3","chr1_1012_2012:501:5">}

Here as this 1kb-sequence is named as chr1_1012_2012, each base on it has
the same label as the reference graph.

@adamnovak https://github.com/adamnovak's solution (4) may be better
(his comment just popped up). We can change the definition of S a little
and get such a subgraph:

S={(chr1,1012,1kb-sequence),(snp1,0,T)}
J={<"snp1:0:5","chr1:1511:3">,<"snp1:0:3","chr1:1513:5">}

Now S is a set of 3-tuples (name,offset,sequence). (S,J) describes the
local topology.

Yes, I like Adam's suggestion too - a subgraph in a sidegraph is a set of
segments and a set of joins, such that all the joins connect positions
within the set of segments (no edges to nowhere). Adam, I suggest changing
your pull request to reflect that.

—
Reply to this email directly or view it on GitHub
#275 (comment).

haussler · 2015-04-03T03:23:21Z

This sounds really sensible.

On Thu, Apr 2, 2015 at 6:01 PM, Benedict Paten notifications@github.com
wrote:

On Thu, Apr 2, 2015 at 12:21 PM, Heng Li notifications@github.com wrote:

Conceptually, I more like to think a side graph as a graph with each base
as a node. Each node/base in this graph has a unique and persistent label
such as chr1:1012 (perhaps plus orientation). When we do a vg-like
traversal, the command line would be sg find -s chr1:1012:5 -c 10. This
means we start from the 5'-end of base chr1:1012 and extend over at most
10
bifurcations with BFS. Or we may replace -c 10 with -l 1000 to extend at
most 1000bp. Suppose we have a SNP at chr1:1512. With (2), the output
should be:

S={"chr1_1012_2012":1kb-sequence,"snp1":T}

J={<"snp1:0:5","chr1_1012_2012:499:3">,<"snp1:0:3","chr1_1012_2012:501:5">}

Here as this 1kb-sequence is named as chr1_1012_2012, each base on it has
the same label as the reference graph.

@adamnovak https://github.com/adamnovak's solution (4) may be better
(his comment just popped up). We can change the definition of S a little
and get such a subgraph:

S={(chr1,1012,1kb-sequence),(snp1,0,T)}
J={<"snp1:0:5","chr1:1511:3">,<"snp1:0:3","chr1:1513:5">}

Now S is a set of 3-tuples (name,offset,sequence). (S,J) describes the
local topology.

Yes, I like Adam's suggestion too - a subgraph in a sidegraph is a set of
segments and a set of joins, such that all the joins connect positions
within the set of segments (no edges to nowhere). Adam, I suggest changing
your pull request to reflect that.

—
Reply to this email directly or view it on GitHub
#275 (comment).

—
Reply to this email directly or view it on GitHub
#275 (comment).

ekg · 2015-04-03T10:43:08Z

I think you're missing an option (4) in your list, which is to just work in the coordinate space of the original graph, even when looking only at a subgraph.

@adamnovak, it seems this is the same as my:

(2) Alternatively, we could maintain offsets relative to the original nodes in our subgraph representation to allow mapping back into the original coordinate space.

I'm starting from the same premise, but focusing on the fact that we need a way to maintain the same coordinate space even if we are working on a tiny fragment which is not strictly identical to any piece of the larger graph.

This introduces some practical problems, such as when I align something to the subgraph, I'll need to transform its mapping coordinates into the global space to put it in the context of the entire graph in my particular version of the reference. Also, it makes merging subgraphs from the same system cumbersome. If I only ever work with whole nodes in my subsets, then I can just take the union of nodes and edges (using internal ids as identifiers) in a set of graphs and make a larger one that is strictly part of the source graph.

@lh3 answered this concern by noting that nodes could be single bases. If so, then we could say we are working with a true subset. There are some practical reasons not to force nodes to be single bases, but in the long term they may tend to become smaller. If we collapse successive bases into single nodes when there are no bifurcations, we see about 11bp per node in 1000Gp3 + GRCh37. In other species with large effective population sizes (such as Anopheles), we already have higher rates of variation and consequently smaller nodes in the graph.

The subgraph isn't made out of the same types as the full graph (since e.g. two regions of the same Sequence can't be different Sequences in the subgraph), but do regions and reference sets really need to support the same operations?

Yes, they should support the same operations, and they should have the same serialization format. Having different models will limit the kinds of synergies that are possible. For example, if the two are the same, then I can extract a portion of a graph, write it to another index, and then treat it like a new reference without any problem. If they are different, then we need conversions between the two models for this to be possible.

ekg · 2015-04-03T11:03:01Z

@lh3 asked:

how can we know node 314159 in vg is a reference or a named allele?

(In vg) we can know if it's a named allele if it resides in any paths. A path is just a series of nodes and edges. Each node can have many (or no) names. Named paths are effectively annotations on nodes.

As I understand, paths of this form can be equivalent to sides. I can encode a side graph coordinate system in them without any change to the underlying graph in order to maintain a common coordinate space with other graphs.

When we add new variants or merge vg graphs, will node 314159 be split? i.e., is node immutable with incremental graph construction?

A node id is not meant to be something stable or canonical. It's just an internal thing that may be useful for working on a particular version of the reference system. If new things are added, then it is likely to change. I apologize that this might have made things confusing in this example.

Although it is not required, in my implementation the node id is exploited as a partially-ordered coordinate. This is extremely helpful in a number of situations and has a natural relationship to the coordinate systems that have been discussed. But, the rank in the partial order can't be a stable identifier if the graph changes.

richarddurbin · 2015-04-03T12:49:42Z

What I like about side graphs is that the components are stable as you add more variation. We can all refer to well-known sequences which we call References.
In Erik's vg graphs his nodes are not stable, they can be split. He gets stability in his paths. i.e. for him References are paths. As he says, the nodes are internal.
If Erik restricts himself to non-overlapping Reference paths then the models are very similar. Erik can keep an index for each of his nodes as to which path it is in,
with which offset, and hence provide a side graph interface by returning these not his nodes, and can read in side graphs and split them into his internal nodes
keeping the longer side sequences as paths.

I don't see Erik's problems with subgraph operations on side graphs. There are two solutions, either (1) having offsets as Adam and Heng propose, or (2) just
returning a true subgraph including whole chromosome objects.

(1) I do quite like the suggestion of sequences in the side graphs having an offset, i.e. being a segment. I think this corresponds to what Heng is saying. If we
go this way, why not extend this to all sequences? From memory Adam added something like this to the definition of Reference, so he could refer to a segment
of a chromosome in a local ReferenceSet. This allows coordinates to be stable within subgraph primary sequence objects that do not extend the length of the
whole chromosome.

(2) The other alternative is to return as a subgraph of a side graph a true subset of sequences and joins. This would mean that you get a sequence object
for the whole primary reference chromosome when you ask for a local subgraph around a locus. But so what? It is just a structure, the same size as a sequence
object for a SNP. You don't need to get the entire 200 million base sequence corresponding to the object. In this world all radius/diameter operations would
be in base units. I would not allow the -c option, which would get all the SNPs all the way along the whole chromosome. I think this is natural and correct -
what you care about is base distance context/size, not graph complexity. Why care about how many graph steps you are away? This way combining subgraphs
is natural - you just take the union of the set of sequences and the union of the set of joins. When you want to realise the base strings you could either be lazy
and get whole chromosomes, or be a bit more sophisticated and keep track of the window of interest and retrieve just the segment of interest and keep track of
coordinates as in (1), but this would be a client choice. Erik's client working internally in vg would support the local representation.

I slightly prefer (2) as conceptually simpler. Seen another way, (1) exposes some internal workings that are more complex and not strictly necessary.

Finally, Erik's point about the nodes having a natural order is important. For him this gives the partial order on his graph. For the side graph we also wanted
this to give unique natural paths back to primary references. For this reason, I would require sequences in a graph to have a positive integer identifier, to
enforce this order.

On 3 Apr 2015, at 12:03, Erik Garrison notifications@github.com wrote:

@lh3 https://github.com/lh3 asked:

how can we know node 314159 in vg is a reference or a named allele?

(In vg) we can know if it's a named allele if it resides in any paths. A path is just a series of nodes and edges. Each node can have many (or no) names. Named paths are effectively annotations on nodes.

As I understand, paths of this form can be equivalent to sides. I can encode a side graph coordinate system in them without any change to the underlying graph in order to maintain a common coordinate space with other graphs.

When we add new variants or merge vg graphs, will node 314159 be split? i.e., is node immutable with incremental graph construction?

A node id is not meant to be something stable or canonical. It's just an internal thing that may be useful for working on a particular version of the reference system. If new things are added, then it is likely to change. I apologize that this might have made things confusing in this example.

Although it is not required, in my implementation the node id is exploited as a partially-ordered coordinate. This is extremely helpful in a number of situations and has a natural relationship to the coordinate systems that have been discussed. But, the rank in the partial order can't be a stable identifier if the graph changes.

—
Reply to this email directly or view it on GitHub #275 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

lh3 · 2015-04-03T13:57:31Z

@ekg your (2) and @adamnovak's (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

ekg · 2015-04-03T14:10:48Z

@lh3 I think there is a misunderstanding. I'm saying that if you have a serialized version of a subset the side graph, you'll want to specify the source offset in the original graph, or it won't be possible to project things from the subset back into the general graph coordinate space. This offset effectively propagates across all the entities in the subgraph, whether you record it everywhere or not. I can't use the original coordinates on a subset of the graph any more than we can use the whole-genome relative coordinates when working with substrings from the reference genome. So as far as I can tell, this is in principle exactly the same as what @adamnovak is suggesting.

The difference is one of perspective. In practice "just marking the sequence as a subsequence" must be accounted for within each method that works with the graph entity. This adds complexity without obvious benefit. Now every element would need to have an optional offset field, or we need to have two different types of elements, one with offsets and one without.

ekg · 2015-04-03T14:39:58Z

@richarddurbin observes that

If Erik restricts himself to non-overlapping Reference paths then the models are very similar.

It seems there is a conflict here because I would like to encode many different overlapping coordinate systems in the graph. A simple case would be two canonical reference sequences. Encoding both in the same context is beneficial because it means that everything on the graph can be surjected into either reference. A more complex situation would be an overlapping set of references relative to exons, gene starts, and other genomic features. Both of these needs can be met provided we have named paths with annotations. I've implemented this in vg, and I've been discussing the correct way to extend GFA format to support it with @lh3 and @jts.

How can I encode this in a side graph? It does not seem possible without adding another annotation mechanism. As far as I can tell, such a mechanism would be equivalent to vg paths, which are sufficient to define any number of virtual side graphs on a sequence graph.

The current approach mixes the coordinate system and the graph representation. The result is less flexible than if we decoupled them.

richarddurbin · 2015-04-06T07:38:57Z

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome
references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented
path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak
propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li notifications@github.com wrote:

@ekg https://github.com/ekg your (2) and @adamnovak https://github.com/adamnovak's (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak https://github.com/adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin https://github.com/richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

—
Reply to this email directly or view it on GitHub #275 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

haussler · 2015-04-06T16:54:47Z

I would agree with @richarddurbin https://github.com/richarddurbin that
because of its simplicity, the "simple subgraph" of a side graph that is
merely a subset of the sequences and a subset of the joins will be very
popular. As he points out, you can combine these by taking a simple union.
Sure they include some extra bases due to the fact that you always have
whole sequences, not just the parts you need, but this small drawback is
outweighed by their simplicity. I think we should definitely support these,
because it is inevitable that they will be popular.

@ekg https://github.com/ekg is bringing up other very important issues
that we should discuss. These revolve around operations that go back and
forth between paths and subgraphs. He is right that another important type
of subgraph, commonly used in graph manipulations, is the the minimal
subgraph supporting a given set of paths. This could also be given as a
simple subgraph (i.e. set of sequences and joins), allowing as above this
subgraph structure to also contain extra bases as needed so as not to break
sequences.

However, @ekg https://github.com/ekg points out another issue, which is
the need for an operation to transform from one set of reference sequences
to another, either virtually or literally. He mentions conversion from one
reference version to another. I also ran into this need when working with
canonical forms and isomorphism. Starting with a given side graph, and a
set of disjoint paths on that side graph, one would like to be able to
define a new side graph in which the roles of sequences and paths were
swapped: what originally were paths become sequences, and what originally
were sequences become paths. Note: here we must be able to convert explicit
joins to implicit joins and vice-versa. I think somebody will inevitably
want to support this too, and efficiency will perhaps be an issue.

The deeper issue, I think also broached by @ekg https://github.com/ekg,
is what to do when the paths are not disjoint, but rather share a base
(i.e. node). If you require that what was originally a path become a
sequence, this forces a split. When a node is split in a graph, all its
adjacent edges are copied, and when two nodes are merged, any pair of
repeated edges that are produced are merged as well. These operations are
not recursive. We could imagine somebody might be supporting them on side
graphs at some time in the future, and we might want to make their job
easier. When is the time to discuss this?

-D

On Mon, Apr 6, 2015 at 12:39 AM, Richard Durbin notifications@github.com
wrote:

@lh3 The only sequences I can see as a problem for obtaining the full
sequence with my proposal are the well-known primary chromosome
references, and I am happy to cache them locally. Then surely my proposal
is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you
need to remap to a new set of paths. You have implemented
path mapping, which is powerful, but I suggest that this is not simpler
than having an offset on sequence segments as @lh3 and @adamnovak
propose. In any case, I still think that given you have this, mapping
between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li notifications@github.com wrote:

@ekg https://github.com/ekg your (2) and @adamnovak <
https://github.com/adamnovak>'s (4) are different. In my understanding,
you offset everything, including sequences, joins and the coordinate
system. You need to apply offset everywhere when we translate between the
whole graph and the subgraph. @adamnovak https://github.com/adamnovak
just marks a sequence in the subgraph as a subsequence of the whole graph.
We don't need to offset joins and most importantly we use the same
coordinate system. @richarddurbin https://github.com/richarddurbin
wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a
subgraph, we would usually like to retrieve subsequences included in this
subgraph. Having an offset field will make this easier. Of course you can
retrieve subsequence later with your proposal but getting the coordinates
needs extra work.

—
Reply to this email directly or view it on GitHub <
https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
#275 (comment).

ekg · 2015-04-07T18:13:10Z

I guess I am still confused about the design decisions forced by side graphs. If a labeled sequence graph (nodes have sequences/labels, edges are linkages) with named paths is enough to provide a general, stable coordinate system, then what are we gaining by introducing the novel side graph concept?

The former is extremely easy to explain and has an existing open source implementation which satisfies the basic requirements driving this group's work. It also has a deep relationship with a large body of work on assembly graphs and multiple sequence alignments. The latter has no existing implementation and it remains to be seen if it is a sensible way to arrange things.

Side graphs seem to complicate the process of adding multiple coordinate systems to the same sequence graph. They also appear to make it more complex to extract regions from the graph in that the resulting subgraphs are not semantically equivalent to subsets of the original side graph. However, they do seem to be an ideal way to generate the names of paths in a labeled graph, which is something that has not been done before.

In summary, I'm skeptical that we should use side graphs as a literal interchange format.

Perhaps we should table this discussion and come back to it when there is a working side graph implementation that we can use to align genomes. I suspect the process of implementing one will explain a lot of my concerns in a more convincing way than I can here. If it can be shown that they are a more effective way to work with pan-genome representations, I will be happy to change what I am doing to match. We have to convince a community to follow this effort, and I don't think it will if we aren't proposing something sensible.

benedictpaten · 2015-04-07T18:28:03Z

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to
formulate a sub-graph of a side graph in practice as a set of segments and
joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the
subsequences within the radius - there is no need to recompute this on the
client.

(2) We can use segment+join subgraphs to describe annotations, such as
exons or introns, that include the variation that's present in the
population. The sequence+join subgraph, which does not bound the
subsequences, does not make it easy to express this.

The cons (as I see it) are:
A small increase in the amount of data that needs to be transferred between
server and client, because you need to specify the additional subsequence
data.

Did I miss anything compelling as to why we wouldn't make the job of the
client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin notifications@github.com
wrote:

@lh3 The only sequences I can see as a problem for obtaining the full
sequence with my proposal are the well-known primary chromosome
references, and I am happy to cache them locally. Then surely my proposal
is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you
need to remap to a new set of paths. You have implemented
path mapping, which is powerful, but I suggest that this is not simpler
than having an offset on sequence segments as @lh3 and @adamnovak
propose. In any case, I still think that given you have this, mapping
between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li notifications@github.com wrote:

@ekg https://github.com/ekg your (2) and @adamnovak <
https://github.com/adamnovak>'s (4) are different. In my understanding,
you offset everything, including sequences, joins and the coordinate
system. You need to apply offset everywhere when we translate between the
whole graph and the subgraph. @adamnovak https://github.com/adamnovak
just marks a sequence in the subgraph as a subsequence of the whole graph.
We don't need to offset joins and most importantly we use the same
coordinate system. @richarddurbin https://github.com/richarddurbin
wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a
subgraph, we would usually like to retrieve subsequences included in this
subgraph. Having an offset field will make this easier. Of course you can
retrieve subsequence later with your proposal but getting the coordinates
needs extra work.

—
Reply to this email directly or view it on GitHub <
https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
#275 (comment).

ekg · 2015-04-07T20:47:59Z

@benedictpaten What are the benefits of using side graphs as a serialization format?

What can we do with a side graph that we can't with a labeled graph where labels may be sequences, names, or other annotations?

The labeled graph does not have the restriction of being hierarchical. This is a fine property for a coordinate system but I think we should be cautious about making the side graph the basic currency of the system because this will limit applications. It may also add complexity to the processing of the reference system and subsets of it.

anovak · 2015-04-07T20:58:07Z

Hi Gents,

I seem to be getting looped in as anovak (Anthony Novak), so Adam Novak may not be getting these notifications. Cheers

benedictpaten · 2015-04-08T00:54:47Z

So sorry - will use correct handle in future, which is @adamnovak

On Apr 7, 2015 1:58 PM, "anovak" notifications@github.com wrote:

Hi Gents,

I seem to be getting looped in as anovak (Anthony Novak), so Adam Novak
may not be getting these notifications. Cheers

—
Reply to this email directly or view it on GitHub
#275 (comment).

benedictpaten · 2015-04-08T01:13:23Z

On Tue, Apr 7, 2015 at 1:48 PM, Erik Garrison notifications@github.com
wrote:

@benedictpaten https://github.com/benedictpaten What are the benefits
of using side graphs as a serialization format?

I think:

(1) Sufficiency. Ability to properly represent connections between the ends
of molecules. This is not a unique property of side graphs. String graphs,
side graphs, insert graphs, break point graphs, bidirected graphs - all
these graphs can be used to represent the fact that DNA strings have two
ends (5' and 3'), and distinguish between connections to each end. I think
@lh3 made this point earlier in the thread, but graphs in which nodes are
labeled with sequences and edges are simple pairs or pair-sets do not have
that property. I interpret "labeled graph" to fall into this latter
category.

(2) Compactness. If we assume that we ultimately want a graph with all
"common" (I appreciate this is a nebulous term) snps, indels and structural
variations then, assuming snps outnumber everything else, we end up with a
graph with around 50 million sequences, and 100 million joins. Not too bad.
If we break every time we have a SNP, as in a string graph, then we have 4x
more sequences, roughly speaking. Perhaps I'm way off, not sure.

(3) Natural extension. The existing reference is a special case of the side
graph, containing just the primary reference sequences. This gives us a
nice backwards compatible path from where we are today to a richer graph
world tomorrow.

(4) Translatable. There are some simple algorithms to go from a string
graph to a side graph and back again, resulting in the same original string
graph. @lh3 made this point, and exposed the issue with zero length
sequences (which are not allowed in side graphs), that were present for
insert graphs and which made this problematic.

(5) Equivalence testing. @haussler has pointed out that ranked side graphs
can be tested for equivalence quickly - which is a very nice property to
have. No general subgraph isomorphism problem in that case.

Disagree? Did I miss some?

What can we do with a side graph that we can't with a labeled graph where

labels may be sequences, names, or other annotations?

The labeled graph does not have the restriction of being hierarchical.
This is a fine property for a coordinate system but I think we should be
cautious about making the side graph the basic currency of the system
because this will limit applications. It may also add complexity to the
processing of the reference system and subsets of it.

—
Reply to this email directly or view it on GitHub
#275 (comment).

lh3 · 2015-04-08T02:06:20Z

@ekg I have not thought of multiple coordinate systems. It is a good catch. However, I am not sure about the use case of that. It would be overcomplicated if we encode multiple versions of the genome. If you are thinking of the coexistence of genomic and transcriptomic coordinates, I think it is equally nature to annotate a side graph.

Side graph and properly path-labeled string graph are conceptually equivalent and can be easily converted to each other. A major difference is that side graph has one coordinate system throughout, but labeled string graph has two, the native unstable node coordinate and the added stable path coordinate. The two coordinate systems lead to two concerns. Firstly, most of time we don't care about the unstable node coordinates. We want to query with "chr1:12345" and get results back in a similar way. Having the node coordinate in the way complicates some operations. We don't have this problem with side graph. Secondly, because the path coordinates are not native, we can generate illegal paths in various ways (e.g. looped paths, disconnected paths, overlapping paths and nodes not on any paths) and end up with an illegal coordinate system. In contrast, side graph disallows the many ways to shoot ourselves in the foot. This is to me a critical advantage. I always believe the right representation should naturally impose the constraints we want to apply.

I can now read a side graph into C objects (see lh3/sdg). The implementation is more complex than an string/assembly graph but probably as complex as path-labeled string graph. In addition, you can implement a side graph on top of your vg data structure once vg also supports orientation. You just hide the internal node IDs away from endusers.

lh3 · 2015-04-08T03:52:47Z

The question boils down to: do we allow a base to be identified with two or more labels at the same time (e.g. chr1v1:12345 and chr1v2:23456)? If the answer is no, side graph is clearly the right representation because it is naturally imposes this constraint. If the answer is yes, path+string graph shows its strength because the asymmetry of sequence and path in a side graph is problematic.

richarddurbin · 2015-04-08T09:16:43Z

@lh3 Surely the right thing to do with respect to alternate reference systems is to have a function that maps a side graph plus a set of disjoint paths on that side graph
representing a new coordinate system to a new side graph representing the same topology but based on the new coordinate system. I assert that the complexity in
doing this is comparable to the complexity of handling path coordinates on an @ekg style sequence graph.

@ekg It would be interesting if you could represent the GRCh38 reference chromosomes as paths in your current 1000 Genomes vg graph, and hence map all the
1000 Genomes phase 3 variation into GRCh38? What are the issues in doing that? My guess is that a lot of it will work well, but that there will be complications at
places where the chromosomes diverge.

Richard

On 8 Apr 2015, at 04:52, Heng Li notifications@github.com wrote:

The question boils down to: do we allow a base to be identified with two or more labels at the same time (e.g. chr1v1:12345 and chr1v2:23456)? If the answer is no, side graph is clearly the right representation because it is naturally imposes this constraint. If the answer is yes, path+string graph shows its strength because the asymmetry of sequence and path in a side graph is problematic.

—
Reply to this email directly or view it on GitHub #275 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

lh3 · 2015-04-08T11:57:18Z

A side graph has a native coordinate system. Accessing with this coordinate system is much easier than alternative systems. This asymmetry seems awkward. Each path coordinate systems in a string graph is equivalent. It is a more natural way to describe parallel coordinates. @haussler talked about transforming one side graph coordinate to another, but in my understanding, the transformed side graph still has one coordinate system. We don't have two parallel systems.

It is worth noting that liftover between genome builds is nontrivial. UCSC and NCBI are providing different mappings. Parallel coordinate systems may sound easier with path+string graph in theory, but in practice, I doubt any solutions so far work.

ekg · 2015-04-08T12:32:55Z

I agree with @lh3 that path+string graphs will not solve the difficulties of the liftover problem. However, they will at least enable us to represent particular mappings between different assemblies in one system.

richarddurbin · 2015-04-08T16:00:53Z

Yes - the side graphs come with a single reference coordinate system. I think this is OK/good.
We may want to transform coordinate systems, but as all agree this is a complex issue and I think it is fine for it to be a separate operation that generates a new side graph.

We still need paths and the ability to do things with them.

Richard

On 8 Apr 2015, at 12:57, Heng Li notifications@github.com wrote:

A side graph has a native coordinate system. Accessing with this coordinate system is much easier than alternative systems. This asymmetry seems awkward. Each path coordinate systems in a string graph is equivalent. It is a more natural way to describe parallel coordinates. @haussler https://github.com/haussler talked about transforming one side graph coordinate to another, but in my understanding, the transformed side graph still has one coordinate system. We don't have two parallel systems.

It is worth noting that liftover between genome builds is nontrivial. UCSC and NCBI are providing different mappings. Parallel coordinate systems may sound easier with path+string graph in theory, but in practice, I doubt any solutions so far work.

—
Reply to this email directly or view it on GitHub #275 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

haussler · 2015-04-08T18:25:02Z

As I understand it, the controversy of

subgraph = set of sequences and joins

vs

subgraph = set of segments of sequences and joins

revolves around where and how the coordinate transforms are done. I'm
afraid if we specify subgraph = set of segments of sequences and joins we
start down a slippery slope. It will bake in the coordinate transform
required when going from having a sequence establish the basic coordinate
system to having a segment of that sequence establish the basic coordinate
system. However, almost as frequently we will want to concatenate two
sequences and make a single coordinate system (i.e. sequence) for that
concatenation. Most generally, we will want to concatenate a series of
segments and make a single coordinate system for that object. I think we
should make all these coordinate transforms, including sequence -> segment,
one and the same, for simplicity. That means having (just) two distinct
notions:

(simple) subgraph = set of sequences and joins taken from an existing
side graph. Defining a simple subgraph is an operation that does not create
a new side graph, it only references part of an existing side graph.
subgraph transform = creation of a new side graph that has its own new
coordinate system but is demonstrably homomorphic to a subgraph of (the
most granular form of) an existing side graph.

Here by "most granular form" I mean the side graph obtained by making every
sequence have length 1. By "demonstrably homomorphic to" I mean that there
is a defined procedure for going from a side s in the new side graph S to
the corresponding side s' in the original side graph S' that satisfies the
conditions of a graph homomorphism, namely that if there is a join between
sides s and t in S, then there is a join between the corresponding sides s'
and t' in S'. (Also, obviously, we require that the two sides of a base in
S are the two sides of a corresponding base in S' with the same label.) A
homomorphism will essentially be a coordinate transform. Note that a
homomorphism can map two sides in S to a single side in S'. If it does not,
then it is a (subgraph) isomorphism.

A subgraph transform is a very useful thing. As a special case, one could
transform the whole side graph, not just a subgraph of it. So this includes
the operation of "defining a new set of primary paths on a reference side
graph", which is something like a liftover-style mapping from one version
of the reference to another. It also includes the operation of taking the
two paths in a reference side graph S' that represent the two copies of
chromosome 1 in an individual and making a new side graph S representing
the diploid genome of that individual that has two separate sequences, one
for each of these paths. Here we expect the two paths to share many nodes
in the reference S', so this is a homomorphism, not a subgraph
isomorphism.

I like the subgraph transform much better than defining an explicit "split"
operation on nodes of a reference side graph. If you are going to split,
this is messy enough that you should be doing your work on an entirely new
side graph, and keeping track of a homomorphism between the new side graph
you are creating and the original side graph.

-D

On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten notifications@github.com
wrote:

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to
formulate a sub-graph of a side graph in practice as a set of segments and
joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the
subsequences within the radius - there is no need to recompute this on the
client.

(2) We can use segment+join subgraphs to describe annotations, such as
exons or introns, that include the variation that's present in the
population. The sequence+join subgraph, which does not bound the
subsequences, does not make it easy to express this.

The cons (as I see it) are:
A small increase in the amount of data that needs to be transferred between
server and client, because you need to specify the additional subsequence
data.

Did I miss anything compelling as to why we wouldn't make the job of the
client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin notifications@github.com

wrote:

@lh3 The only sequences I can see as a problem for obtaining the full
sequence with my proposal are the well-known primary chromosome
references, and I am happy to cache them locally. Then surely my proposal
is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph
you
need to remap to a new set of paths. You have implemented
path mapping, which is powerful, but I suggest that this is not simpler
than having an offset on sequence segments as @lh3 and @adamnovak
propose. In any case, I still think that given you have this, mapping
between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li notifications@github.com wrote:

@ekg https://github.com/ekg your (2) and @adamnovak <
https://github.com/adamnovak>'s (4) are different. In my understanding,
you offset everything, including sequences, joins and the coordinate
system. You need to apply offset everywhere when we translate between the
whole graph and the subgraph. @adamnovak https://github.com/adamnovak
just marks a sequence in the subgraph as a subsequence of the whole
graph.
We don't need to offset joins and most importantly we use the same
coordinate system. @richarddurbin https://github.com/richarddurbin
wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a
subgraph, we would usually like to retrieve subsequences included in this
subgraph. Having an offset field will make this easier. Of course you can
retrieve subsequence later with your proposal but getting the coordinates
needs extra work.

—
Reply to this email directly or view it on GitHub <
https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
#275 (comment).

—
Reply to this email directly or view it on GitHub
#275 (comment).

richarddurbin · 2015-04-08T23:12:44Z

I am happy with this with respect to the simple subgraph.

With respect to the more complex thing, I am happy with the isomorphism.
You would specify that as an ordered list of non-overlapping paths. Then one takes all the joins linking bases in these paths. This defines a new side graph.
The simple case is segments. But this also handles for you the establishment of a new reference system as we were discussing earlier today. The fact that
the paths are non-overlapping makes it an isomorphism, and allows you to keep track of the mapping and its inverse, which allows you to merge subgraphs.

I am not so happy with the homomorphism where one base in the original side graph is mapped to two (or more) separate bases in the new graph. I am not
sure I need the individual side graph containing two separated haplotypes that you describe. But maybe I don't see valuable things that you see. At any
rate, I would make that a separate step. For the purposes we were discussing previously, the isomorphism is sufficient, and it would just be made of segments
of the original.

Richard

On 8 Apr 2015, at 19:25, haussler notifications@github.com wrote:

As I understand it, the controversy of

subgraph = set of sequences and joins

vs

subgraph = set of segments of sequences and joins

revolves around where and how the coordinate transforms are done. I'm
afraid if we specify subgraph = set of segments of sequences and joins we
start down a slippery slope. It will bake in the coordinate transform
required when going from having a sequence establish the basic coordinate
system to having a segment of that sequence establish the basic coordinate
system. However, almost as frequently we will want to concatenate two
sequences and make a single coordinate system (i.e. sequence) for that
concatenation. Most generally, we will want to concatenate a series of
segments and make a single coordinate system for that object. I think we
should make all these coordinate transforms, including sequence -> segment,
one and the same, for simplicity. That means having (just) two distinct
notions:

(simple) subgraph = set of sequences and joins taken from an existing
side graph. Defining a simple subgraph is an operation that does not create
a new side graph, it only references part of an existing side graph.

subgraph transform = creation of a new side graph that has its own new
coordinate system but is demonstrably homomorphic to a subgraph of (the
most granular form of) an existing side graph.

Here by "most granular form" I mean the side graph obtained by making every
sequence have length 1. By "demonstrably homomorphic to" I mean that there
is a defined procedure for going from a side s in the new side graph S to
the corresponding side s' in the original side graph S' that satisfies the
conditions of a graph homomorphism, namely that if there is a join between
sides s and t in S, then there is a join between the corresponding sides s'
and t' in S'. (Also, obviously, we require that the two sides of a base in
S are the two sides of a corresponding base in S' with the same label.) A
homomorphism will essentially be a coordinate transform. Note that a
homomorphism can map two sides in S to a single side in S'. If it does not,
then it is a (subgraph) isomorphism.

A subgraph transform is a very useful thing. As a special case, one could
transform the whole side graph, not just a subgraph of it. So this includes
the operation of "defining a new set of primary paths on a reference side
graph", which is something like a liftover-style mapping from one version
of the reference to another. It also includes the operation of taking the
two paths in a reference side graph S' that represent the two copies of
chromosome 1 in an individual and making a new side graph S representing
the diploid genome of that individual that has two separate sequences, one
for each of these paths. Here we expect the two paths to share many nodes
in the reference S', so this is a homomorphism, not a subgraph
isomorphism.

I like the subgraph transform much better than defining an explicit "split"
operation on nodes of a reference side graph. If you are going to split,
this is messy enough that you should be doing your work on an entirely new
side graph, and keeping track of a homomorphism between the new side graph
you are creating and the original side graph.

-D

On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten notifications@github.com
wrote:

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to
formulate a sub-graph of a side graph in practice as a set of segments and
joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the
subsequences within the radius - there is no need to recompute this on the
client.

(2) We can use segment+join subgraphs to describe annotations, such as
exons or introns, that include the variation that's present in the
population. The sequence+join subgraph, which does not bound the
subsequences, does not make it easy to express this.

The cons (as I see it) are:
A small increase in the amount of data that needs to be transferred between
server and client, because you need to specify the additional subsequence
data.

Did I miss anything compelling as to why we wouldn't make the job of the
client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin notifications@github.com

wrote:

@lh3 The only sequences I can see as a problem for obtaining the full
sequence with my proposal are the well-known primary chromosome
references, and I am happy to cache them locally. Then surely my proposal
is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph
you
need to remap to a new set of paths. You have implemented
path mapping, which is powerful, but I suggest that this is not simpler
than having an offset on sequence segments as @lh3 and @adamnovak
propose. In any case, I still think that given you have this, mapping
between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li notifications@github.com wrote:

@ekg https://github.com/ekg your (2) and @adamnovak <
https://github.com/adamnovak>'s (4) are different. In my understanding,
you offset everything, including sequences, joins and the coordinate
system. You need to apply offset everywhere when we translate between the
whole graph and the subgraph. @adamnovak https://github.com/adamnovak
just marks a sequence in the subgraph as a subsequence of the whole
graph.
We don't need to offset joins and most importantly we use the same
coordinate system. @richarddurbin https://github.com/richarddurbin
wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a
subgraph, we would usually like to retrieve subsequences included in this
subgraph. Having an offset field will make this easier. Of course you can
retrieve subsequence later with your proposal but getting the coordinates
needs extra work.

—
Reply to this email directly or view it on GitHub <
https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
#275 (comment).

—
Reply to this email directly or view it on GitHub
#275 (comment).

—
Reply to this email directly or view it on GitHub #275 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

haussler · 2015-04-10T22:52:01Z

This issue is also coming up an another thread
Updating methods to work with side graphs. (#269)

Can we make a connection here? Over there is discussion of string graph -
to - side graph transformation by specifying a disjoint set of paths in a
string graph that then become sequences in the side graph. The way I see
this in general is that there are two kinds of operations:

Declaring a subgraph of an existing graph, which does not demand the
actual creation of or separate existence of a second derived graph to
represent this subgraph
Establishing a relationship between two different graphs, call one the
master (assumed already to exist) and the other the derived graph (which
may already exist or may be being defined). Unlike (1), this demands the
separate existence of the derived graph, and thus if necessary, its
creation.

With regard to the master graph, the most fundamental objects are
sequences, joins and paths. We want everything to be expressed in terms of
these as much as possible. The simplest form of operation (1) is to use as
Richard suggested only the 2 most fundamental objects: sequences and joins,
which is what I think we have now agreed on.

In operation (2), establishing a relationship between a derived graph and a
master graph, as Erik points out, the most fundamental thing is to identify
a path in the master graph with a sequence in the derived graph. When the
consecutive bases of a sequence in the derived graph are mapped to
identical consecutive bases of a path in the master graph we get our
simplest and most fundamental relationship between the master and the
derived graph. We definitely want to have an operation that establishes
this. By mapping oriented bases, this operation establishes a mapping M
from each side s in the derived graph to a corresponding side M(s) in the
master graph. Mapping joins from the derived graph to the master graph then
follows: a join (s,t) in the derived graph must map to the join (M(s),
M(t)) in the master graph. We should be able to add any such joins we want
to the derived graph.

Any mapping M from the derived graph to the master graph in which the
consecutive bases in each sequence of the derived graph map to consecutive
identical bases in a path on the master graph, and any (explicit) join
(s,t) in the derived graph maps to the (implicit or explicit) join (M(s),
M(t)), is a graph homomorphism from the derived graph into the master
graph. Conversely, every graph homomorphism from a derived graph to the
master graph has this form. Since graph homomorphism is the most
fundamental mapping in graph theory, this strongly argues that we should
support it.

Isomorphism is just two homomorphisms, one from the derived graph to the
master, and another from the master to the derived graph. So we can get
isomorphism easily if we have homomorphism. However, we can't get
homomorphism from isomorphism.

For these reasons I suggest that there be an operation that creates a
derived side graph from a master side graph by homomorphism. The input
would be (1) a set P of paths on the master graph (which will define the
homomorphism M) and (2) a set J of joins between sides in the path set P
such that if sides s and t are joined, then the corresponding sides M(s)
and M(t) in the master graph are joined. The output would be a new derived
side graph G(P) and a homomorphism M from G(P) into the master graph such
that (1) each path in P is a separate sequence in G(P), even when these
paths overlap in the master graph, and (2) for each (explicit) join in G(P)
between sides s and t in the paths of P, there is a join in the master
graph between M(s) and M(t).

A homomorphism M is a mapping, so formally it is a set of ordered pairs of
the form (x, M(x)) where x is an object in the derived graph G(P) and M(x)
is the corresponding object in the master graph. To be represented
efficiently, we should probably represent M as a set of pairs (x, M(x))
where x is a segment or a join. These are enough.

Again, isomorphism is just a special case of this relatively clean and
general setup. I gave an example in my previous post where a homomorphism
that is not an isomorphism will be useful.

-D

On Wed, Apr 8, 2015 at 5:11 PM, David Haussler haussler@soe.ucsc.edu
wrote:

On Wed, Apr 8, 2015 at 4:12 PM, Richard Durbin notifications@github.com
wrote:

I am happy with this with respect to the simple subgraph.

With respect to the more complex thing, I am happy with the isomorphism.
You would specify that as an ordered list of non-overlapping paths. Then
one takes all the joins linking bases in these paths. This defines a new
side graph.

Yes, with the recognition that we allow, in an extreme case, a path to be
a single join with no bases, which connects to bases that appear in other
paths that do contain bases. This case is needed, e.g. when the new side
graph you are building consists of two base-containing paths from the
original graph with a join between them somewhere in the middle. The join
is considered to be a third path. The definition of overlap is of course
sharing a base. So this third path does not overlap with the first two
paths.

The simple case is segments. But this also handles for you the
establishment of a new reference system as we were discussing earlier
today. The fact that
the paths are non-overlapping makes it an isomorphism, and allows you to
keep track of the mapping and its inverse,

yes

which allows you to merge subgraphs.

yes but you can just as easily merge graphs defined by homomorphisms as
isomorphisms, it is just that you have a choice as to whether to keep two
nodes that map the same place separate or merge them together. Your rule
here can be tailored to the app.

I am not so happy with the homomorphism where one base in the original
side graph is mapped to two (or more) separate bases in the new graph.

Let's talk of the mapping going the other way, because then it is a
mapping. In a homomorphism, each base in the new graph maps to one and only
one place in the original graph. So it is a legitimate mapping, just not
always one-to-one.

I am not

sure I need the individual side graph containing two separated haplotypes
that you describe. But maybe I don't see valuable things that you see.

My guess is the most popular use of the not-one-to-one homomorphism will
to be when people want to make customized reference structures that are
conveniently defined as side graphs, and use them to map their raw reads,
read pairs or read sets to, or to count things or do other various
analysis. These would be used when you have partial phasing information
for the sample you are working with that allows you to split some nodes of
the standard reference structure. For example, in the standard reference
structure there may be a sequence abcde with a variant b' and a variant d'
coming off as 1 base bubbles. But if you know that the primes are phased
together in the sample you are dealing with, you may prefer to map your raw
data to a custom version of the reference with c split into c and c', so
that there is a primary path abcde with a bubble b'c'd'. You would keep
track of your work via a homomorphism from the new graph to the original
reference structure in which c and c' in the new graph both map to c in the
original reference graph.

At any

rate, I would make that a separate step. For the purposes we were
discussing previously, the isomorphism is sufficient, and it would just be
made of segments
of the original.

But I think when you work it out, the homomorphism requires almost no
additional machinery beyond the isomorphism, and is more flexible. -D

Richard

On 8 Apr 2015, at 19:25, haussler notifications@github.com wrote:

As I understand it, the controversy of

subgraph = set of sequences and joins

vs

subgraph = set of segments of sequences and joins

revolves around where and how the coordinate transforms are done. I'm
afraid if we specify subgraph = set of segments of sequences and joins
we
start down a slippery slope. It will bake in the coordinate transform
required when going from having a sequence establish the basic
coordinate
system to having a segment of that sequence establish the basic
coordinate
system. However, almost as frequently we will want to concatenate two
sequences and make a single coordinate system (i.e. sequence) for that
concatenation. Most generally, we will want to concatenate a series of
segments and make a single coordinate system for that object. I think we
should make all these coordinate transforms, including sequence ->
segment,
one and the same, for simplicity. That means having (just) two distinct
notions:

(simple) subgraph = set of sequences and joins taken from an existing
side graph. Defining a simple subgraph is an operation that does not
create
a new side graph, it only references part of an existing side graph.

subgraph transform = creation of a new side graph that has its own
new
coordinate system but is demonstrably homomorphic to a subgraph of (the
most granular form of) an existing side graph.

Here by "most granular form" I mean the side graph obtained by making
every
sequence have length 1. By "demonstrably homomorphic to" I mean that
there
is a defined procedure for going from a side s in the new side graph S
to
the corresponding side s' in the original side graph S' that satisfies
the
conditions of a graph homomorphism, namely that if there is a join
between
sides s and t in S, then there is a join between the corresponding
sides s'
and t' in S'. (Also, obviously, we require that the two sides of a base
in
S are the two sides of a corresponding base in S' with the same label.)
A
homomorphism will essentially be a coordinate transform. Note that a
homomorphism can map two sides in S to a single side in S'. If it does
not,
then it is a (subgraph) isomorphism.

A subgraph transform is a very useful thing. As a special case, one
could
transform the whole side graph, not just a subgraph of it. So this
includes
the operation of "defining a new set of primary paths on a reference
side
graph", which is something like a liftover-style mapping from one
version
of the reference to another. It also includes the operation of taking
the
two paths in a reference side graph S' that represent the two copies of
chromosome 1 in an individual and making a new side graph S representing
the diploid genome of that individual that has two separate sequences,
one
for each of these paths. Here we expect the two paths to share many
nodes
in the reference S', so this is a homomorphism, not a subgraph
isomorphism.

I like the subgraph transform much better than defining an explicit
"split"
operation on nodes of a reference side graph. If you are going to split,
this is messy enough that you should be doing your work on an entirely
new
side graph, and keeping track of a homomorphism between the new side
graph
you are creating and the original side graph.

-D

On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten <
notifications@github.com>
wrote:

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be
better to
formulate a sub-graph of a side graph in practice as a set of
segments and
joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly
the
subsequences within the radius - there is no need to recompute this
on the
client.

(2) We can use segment+join subgraphs to describe annotations, such as
exons or introns, that include the variation that's present in the
population. The sequence+join subgraph, which does not bound the
subsequences, does not make it easy to express this.

The cons (as I see it) are:
A small increase in the amount of data that needs to be transferred
between
server and client, because you need to specify the additional
subsequence
data.

Did I miss anything compelling as to why we wouldn't make the job of
the
client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin <
notifications@github.com>

wrote:

@lh3 The only sequences I can see as a problem for obtaining the
full
sequence with my proposal are the well-known primary chromosome
references, and I am happy to cache them locally. Then surely my
proposal
is simplest, with no requirement for coordinate offsets or path
mapping.

@ekg You are right that to map between ReferenceSets on the same
graph
you
need to remap to a new set of paths. You have implemented
path mapping, which is powerful, but I suggest that this is not
simpler
than having an offset on sequence segments as @lh3 and @adamnovak
propose. In any case, I still think that given you have this,
mapping
between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li notifications@github.com
wrote:

@ekg https://github.com/ekg your (2) and @adamnovak <
https://github.com/adamnovak>'s (4) are different. In my
understanding,
you offset everything, including sequences, joins and the coordinate
system. You need to apply offset everywhere when we translate
between the
whole graph and the subgraph. @adamnovak <
https://github.com/adamnovak>
just marks a sequence in the subgraph as a subsequence of the whole
graph.
We don't need to offset joins and most importantly we use the same
coordinate system. @richarddurbin <https://github.com/richarddurbin

wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we
retrieve a
subgraph, we would usually like to retrieve subsequences included
in this
subgraph. Having an offset field will make this easier. Of course
you can
retrieve subsequence later with your proposal but getting the
coordinates
needs extra work.

—
Reply to this email directly or view it on GitHub <
https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
<#275 (comment)
.

—
Reply to this email directly or view it on GitHub
#275 (comment).

—
Reply to this email directly or view it on GitHub <
https://github.com/ga4gh/schemas/issues/275#issuecomment-90995341>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
Reply to this email directly or view it on GitHub
#275 (comment).

pgrosu · 2015-04-15T17:39:02Z

It's great to see some mathematical analysis :)

~p

pgrosu mentioned this issue Apr 6, 2015

Inheritance in GA4GH schemas #264

Closed

adamnovak mentioned this issue Oct 22, 2015

Moving over to per-base graphs? #444

Open

david4096 added the graph label Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph subsetting operations and serialization format #275

graph subsetting operations and serialization format #275

ekg commented Apr 2, 2015

adamnovak commented Apr 2, 2015

lh3 commented Apr 2, 2015

lh3 commented Apr 2, 2015

benedictpaten commented Apr 3, 2015

haussler commented Apr 3, 2015

ekg commented Apr 3, 2015

ekg commented Apr 3, 2015

richarddurbin commented Apr 3, 2015

lh3 commented Apr 3, 2015

ekg commented Apr 3, 2015

ekg commented Apr 3, 2015

richarddurbin commented Apr 6, 2015

haussler commented Apr 6, 2015

ekg commented Apr 7, 2015

benedictpaten commented Apr 7, 2015

ekg commented Apr 7, 2015

anovak commented Apr 7, 2015

benedictpaten commented Apr 8, 2015

benedictpaten commented Apr 8, 2015

lh3 commented Apr 8, 2015

lh3 commented Apr 8, 2015

richarddurbin commented Apr 8, 2015

lh3 commented Apr 8, 2015

ekg commented Apr 8, 2015

richarddurbin commented Apr 8, 2015

haussler commented Apr 8, 2015

richarddurbin commented Apr 8, 2015

haussler commented Apr 10, 2015

pgrosu commented Apr 15, 2015

graph subsetting operations and serialization format #275

graph subsetting operations and serialization format #275

Comments

ekg commented Apr 2, 2015

adamnovak commented Apr 2, 2015

lh3 commented Apr 2, 2015

lh3 commented Apr 2, 2015

benedictpaten commented Apr 3, 2015

haussler commented Apr 3, 2015

ekg commented Apr 3, 2015

ekg commented Apr 3, 2015

richarddurbin commented Apr 3, 2015

lh3 commented Apr 3, 2015

ekg commented Apr 3, 2015

ekg commented Apr 3, 2015

richarddurbin commented Apr 6, 2015

haussler commented Apr 6, 2015

ekg commented Apr 7, 2015

benedictpaten commented Apr 7, 2015

ekg commented Apr 7, 2015

anovak commented Apr 7, 2015

benedictpaten commented Apr 8, 2015

benedictpaten commented Apr 8, 2015

lh3 commented Apr 8, 2015

lh3 commented Apr 8, 2015

richarddurbin commented Apr 8, 2015

lh3 commented Apr 8, 2015

ekg commented Apr 8, 2015

richarddurbin commented Apr 8, 2015

haussler commented Apr 8, 2015

richarddurbin commented Apr 8, 2015

haussler commented Apr 10, 2015

pgrosu commented Apr 15, 2015