-
Notifications
You must be signed in to change notification settings - Fork 112
graph subsetting operations and serialization format #275
Comments
I think you're missing an option (4) in your list, which is to just work in the coordinate space of the original graph, even when looking only at a subgraph. The subgraph isn't made out of the same types as the full graph (since e.g. two regions of the same My proposed Since that makes it a bit difficult to know what part of the |
Conceptually, I more like to think a side graph as a graph with each base as a node. Each node/base in this graph has a unique and persistent label such as
@adamnovak's solution (4) may be better (his comment just popped up). We can change the definition of
Now EDIT: simplified the example a little bit. |
In addition, @ekg, how can we know node 314159 in vg is a reference or a named allele? |
On Thu, Apr 2, 2015 at 12:21 PM, Heng Li notifications@github.com wrote:
Yes, I like Adam's suggestion too - a subgraph in a sidegraph is a set of
|
This sounds really sensible. On Thu, Apr 2, 2015 at 6:01 PM, Benedict Paten notifications@github.com
|
@adamnovak, it seems this is the same as my:
I'm starting from the same premise, but focusing on the fact that we need a way to maintain the same coordinate space even if we are working on a tiny fragment which is not strictly identical to any piece of the larger graph. This introduces some practical problems, such as when I align something to the subgraph, I'll need to transform its mapping coordinates into the global space to put it in the context of the entire graph in my particular version of the reference. Also, it makes merging subgraphs from the same system cumbersome. If I only ever work with whole nodes in my subsets, then I can just take the union of nodes and edges (using internal ids as identifiers) in a set of graphs and make a larger one that is strictly part of the source graph. @lh3 answered this concern by noting that nodes could be single bases. If so, then we could say we are working with a true subset. There are some practical reasons not to force nodes to be single bases, but in the long term they may tend to become smaller. If we collapse successive bases into single nodes when there are no bifurcations, we see about 11bp per node in 1000Gp3 + GRCh37. In other species with large effective population sizes (such as Anopheles), we already have higher rates of variation and consequently smaller nodes in the graph.
Yes, they should support the same operations, and they should have the same serialization format. Having different models will limit the kinds of synergies that are possible. For example, if the two are the same, then I can extract a portion of a graph, write it to another index, and then treat it like a new reference without any problem. If they are different, then we need conversions between the two models for this to be possible. |
@lh3 asked:
(In As I understand, paths of this form can be equivalent to sides. I can encode a side graph coordinate system in them without any change to the underlying graph in order to maintain a common coordinate space with other graphs.
A node id is not meant to be something stable or canonical. It's just an internal thing that may be useful for working on a particular version of the reference system. If new things are added, then it is likely to change. I apologize that this might have made things confusing in this example. Although it is not required, in my implementation the node id is exploited as a partially-ordered coordinate. This is extremely helpful in a number of situations and has a natural relationship to the coordinate systems that have been discussed. But, the rank in the partial order can't be a stable identifier if the graph changes. |
What I like about side graphs is that the components are stable as you add more variation. We can all refer to well-known sequences which we call References. I don't see Erik's problems with subgraph operations on side graphs. There are two solutions, either (1) having offsets as Adam and Heng propose, or (2) just (1) I do quite like the suggestion of sequences in the side graphs having an offset, i.e. being a segment. I think this corresponds to what Heng is saying. If we (2) The other alternative is to return as a subgraph of a side graph a true subset of sequences and joins. This would mean that you get a sequence object I slightly prefer (2) as conceptually simpler. Seen another way, (1) exposes some internal workings that are more complex and not strictly necessary. Finally, Erik's point about the nodes having a natural order is important. For him this gives the partial order on his graph. For the side graph we also wanted
The Wellcome Trust Sanger Institute is operated by Genome Research |
@ekg your (2) and @adamnovak's (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin wants to go a step further by even not specifying the offset. @richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work. |
@lh3 I think there is a misunderstanding. I'm saying that if you have a serialized version of a subset the side graph, you'll want to specify the source offset in the original graph, or it won't be possible to project things from the subset back into the general graph coordinate space. This offset effectively propagates across all the entities in the subgraph, whether you record it everywhere or not. I can't use the original coordinates on a subset of the graph any more than we can use the whole-genome relative coordinates when working with substrings from the reference genome. So as far as I can tell, this is in principle exactly the same as what @adamnovak is suggesting. The difference is one of perspective. In practice "just marking the sequence as a subsequence" must be accounted for within each method that works with the graph entity. This adds complexity without obvious benefit. Now every element would need to have an optional offset field, or we need to have two different types of elements, one with offsets and one without. |
@richarddurbin observes that
It seems there is a conflict here because I would like to encode many different overlapping coordinate systems in the graph. A simple case would be two canonical reference sequences. Encoding both in the same context is beneficial because it means that everything on the graph can be surjected into either reference. A more complex situation would be an overlapping set of references relative to exons, gene starts, and other genomic features. Both of these needs can be met provided we have named paths with annotations. I've implemented this in vg, and I've been discussing the correct way to extend GFA format to support it with @lh3 and @jts. How can I encode this in a side graph? It does not seem possible without adding another annotation mechanism. As far as I can tell, such a mechanism would be equivalent to vg paths, which are sufficient to define any number of virtual side graphs on a sequence graph. The current approach mixes the coordinate system and the graph representation. The result is less flexible than if we decoupled them. |
@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome @ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented
The Wellcome Trust Sanger Institute is operated by Genome Research |
I would agree with @richarddurbin https://github.com/richarddurbin that @ekg https://github.com/ekg is bringing up other very important issues However, @ekg https://github.com/ekg points out another issue, which is The deeper issue, I think also broached by @ekg https://github.com/ekg, -D On Mon, Apr 6, 2015 at 12:39 AM, Richard Durbin notifications@github.com
|
I guess I am still confused about the design decisions forced by side graphs. If a labeled sequence graph (nodes have sequences/labels, edges are linkages) with named paths is enough to provide a general, stable coordinate system, then what are we gaining by introducing the novel side graph concept? The former is extremely easy to explain and has an existing open source implementation which satisfies the basic requirements driving this group's work. It also has a deep relationship with a large body of work on assembly graphs and multiple sequence alignments. The latter has no existing implementation and it remains to be seen if it is a sensible way to arrange things. Side graphs seem to complicate the process of adding multiple coordinate systems to the same sequence graph. They also appear to make it more complex to extract regions from the graph in that the resulting subgraphs are not semantically equivalent to subsets of the original side graph. However, they do seem to be an ideal way to generate the names of paths in a labeled graph, which is something that has not been done before. In summary, I'm skeptical that we should use side graphs as a literal interchange format. Perhaps we should table this discussion and come back to it when there is a working side graph implementation that we can use to align genomes. I suspect the process of implementing one will explain a lot of my concerns in a more convincing way than I can here. If it can be shown that they are a more effective way to work with pan-genome representations, I will be happy to change what I am doing to match. We have to convince a community to follow this effort, and I don't think it will if we aren't proposing something sensible. |
@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to The pros of this are: (1) For a given position + radius query the client gets back exactly the (2) We can use segment+join subgraphs to describe annotations, such as The cons (as I see it) are: Did I miss anything compelling as to why we wouldn't make the job of the On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin notifications@github.com
|
@benedictpaten What are the benefits of using side graphs as a serialization format? What can we do with a side graph that we can't with a labeled graph where labels may be sequences, names, or other annotations? The labeled graph does not have the restriction of being hierarchical. This is a fine property for a coordinate system but I think we should be cautious about making the side graph the basic currency of the system because this will limit applications. It may also add complexity to the processing of the reference system and subsets of it. |
Hi Gents, I seem to be getting looped in as anovak (Anthony Novak), so Adam Novak may not be getting these notifications. Cheers |
So sorry - will use correct handle in future, which is @adamnovak On Apr 7, 2015 1:58 PM, "anovak" notifications@github.com wrote:
|
On Tue, Apr 7, 2015 at 1:48 PM, Erik Garrison notifications@github.com
I think: (1) Sufficiency. Ability to properly represent connections between the ends (2) Compactness. If we assume that we ultimately want a graph with all (3) Natural extension. The existing reference is a special case of the side (4) Translatable. There are some simple algorithms to go from a string (5) Equivalence testing. @haussler has pointed out that ranked side graphs Disagree? Did I miss some? What can we do with a side graph that we can't with a labeled graph where
|
@ekg I have not thought of multiple coordinate systems. It is a good catch. However, I am not sure about the use case of that. It would be overcomplicated if we encode multiple versions of the genome. If you are thinking of the coexistence of genomic and transcriptomic coordinates, I think it is equally nature to annotate a side graph. Side graph and properly path-labeled string graph are conceptually equivalent and can be easily converted to each other. A major difference is that side graph has one coordinate system throughout, but labeled string graph has two, the native unstable node coordinate and the added stable path coordinate. The two coordinate systems lead to two concerns. Firstly, most of time we don't care about the unstable node coordinates. We want to query with "chr1:12345" and get results back in a similar way. Having the node coordinate in the way complicates some operations. We don't have this problem with side graph. Secondly, because the path coordinates are not native, we can generate illegal paths in various ways (e.g. looped paths, disconnected paths, overlapping paths and nodes not on any paths) and end up with an illegal coordinate system. In contrast, side graph disallows the many ways to shoot ourselves in the foot. This is to me a critical advantage. I always believe the right representation should naturally impose the constraints we want to apply. I can now read a side graph into C objects (see lh3/sdg). The implementation is more complex than an string/assembly graph but probably as complex as path-labeled string graph. In addition, you can implement a side graph on top of your vg data structure once vg also supports orientation. You just hide the internal node IDs away from endusers. |
The question boils down to: do we allow a base to be identified with two or more labels at the same time (e.g. chr1v1:12345 and chr1v2:23456)? If the answer is no, side graph is clearly the right representation because it is naturally imposes this constraint. If the answer is yes, path+string graph shows its strength because the asymmetry of sequence and path in a side graph is problematic. |
@lh3 Surely the right thing to do with respect to alternate reference systems is to have a function that maps a side graph plus a set of disjoint paths on that side graph @ekg It would be interesting if you could represent the GRCh38 reference chromosomes as paths in your current 1000 Genomes vg graph, and hence map all the Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
A side graph has a native coordinate system. Accessing with this coordinate system is much easier than alternative systems. This asymmetry seems awkward. Each path coordinate systems in a string graph is equivalent. It is a more natural way to describe parallel coordinates. @haussler talked about transforming one side graph coordinate to another, but in my understanding, the transformed side graph still has one coordinate system. We don't have two parallel systems. It is worth noting that liftover between genome builds is nontrivial. UCSC and NCBI are providing different mappings. Parallel coordinate systems may sound easier with path+string graph in theory, but in practice, I doubt any solutions so far work. |
I agree with @lh3 that path+string graphs will not solve the difficulties of the liftover problem. However, they will at least enable us to represent particular mappings between different assemblies in one system. |
Yes - the side graphs come with a single reference coordinate system. I think this is OK/good. We still need paths and the ability to do things with them. Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
As I understand it, the controversy of subgraph = set of sequences and joins vs subgraph = set of segments of sequences and joins revolves around where and how the coordinate transforms are done. I'm
Here by "most granular form" I mean the side graph obtained by making every A subgraph transform is a very useful thing. As a special case, one could I like the subgraph transform much better than defining an explicit "split" -D On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten notifications@github.com
|
I am happy with this with respect to the simple subgraph. With respect to the more complex thing, I am happy with the isomorphism. I am not so happy with the homomorphism where one base in the original side graph is mapped to two (or more) separate bases in the new graph. I am not Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
This issue is also coming up an another thread Can we make a connection here? Over there is discussion of string graph -
With regard to the master graph, the most fundamental objects are In operation (2), establishing a relationship between a derived graph and a Any mapping M from the derived graph to the master graph in which the Isomorphism is just two homomorphisms, one from the derived graph to the For these reasons I suggest that there be an operation that creates a A homomorphism M is a mapping, so formally it is a set of ordered pairs of Again, isomorphism is just a special case of this relatively clean and -D On Wed, Apr 8, 2015 at 5:11 PM, David Haussler haussler@soe.ucsc.edu
|
It's great to see some mathematical analysis :) ~p |
I'd like to query a part of the graph using coordinates against an arbitrary reference path encoded in it.
I have a implementation of this kind of feature that can serve as a conceptual starting point. In
vg
you'd do this likevg find -n 314159 -c 10 x.vg >q.vg
, and q.vg would then contain a subgraph with all nodes and edges reached in a breadth first search of 10 steps starting at node 314159. I use a context size defined in nodes rather than edges because it is simpler in many situations, and in practice I construct graphs with a maximum node size of 50bp usingvg construct ... -m 50
, but it seems possible to do this using an actual base distance rather than a number of steps in a BFS. You can also dovg find -p chr19:100-200 x.vg
to get the graph overlapping a particular range in any named path that has been stored in the graph (in this casechr19
).vg find
can also take a specific kmer and return the subgraph associated with that kmer (which may not be fully-connected).There are other typical operations that I've found useful when generating subsets of variation graphs, which are perhaps best summarized by the interface to
vg find
:My point is that the natural way to interrogate the graph involves queries over kmers, sequences, path positions, node ids, edge ids, and node id ranges. The return from such a query should be a subgraph, and ideally one with a 1:1 mapping to components in the source graph.
Things are pretty straightforward in variation graphs (as in vg). These are collections of three types of objects: (1) nodes with sequences, (2) edges between node ends representing allowed linkages, and (3) paths over nodes and edges which define coordinate systems and names. Making subsets of this structure is trivial. You just collect the nodes, connected edges, and path annotations to create a subgraph that is a strict subset of the components of the original graph.
But how do we generate subgraphs of a side graph?
There seem to be some conceptual hurdles to generating subsets from the side graph model. Useful side graph subsets are not directly equivalent to subsets of the components original side graph. If nodes are long (and some in the side graph are effectively as long as whole chromosomes) then we need to generate new synthetic nodes that start and end within those from the graph we are querying.
These are not insignificant issues. We can pick a solution among these and it will work. However, it will add nontrivial complexity to the model. Subsetting the graph is just about the most basic thing you can do, and it should not present a conceptual hurdle to users and developers.
This is not just a theoretical critique. Some months ago I began to implement (2) but stopped as it required the addition of additional fields defining relative offsets to "real" nodes to each component of the graph. The semantics of the graph would then change whether or not a node was in "offset" space. The coordination of these offsets seemed extremely likely to generate bugs. (As described above, I resolved the problem of graph subsets by breaking up the graph into pieces of a consistent maximum size.)
The side graph makes great sense to me as a way to define a coordinate system, but very little sense as the basis for the data model that's used to store, query, and transmit the graph.
There is a reasonable alternative. We can maintain a side graph coordinate system to provide distinct names for all the sequences and paths in the graph. However, this can be a purely virtual thing, in that we can allow any underlying graph structure to be returned from queries. Provided the nodes and edges in a sequence/variation graph are annotated with their coordinates in the side graph, we will get all the benefits of the side graph without the awkwardness of defining standard algorithms to manipulate it.
The text was updated successfully, but these errors were encountered: