Graph set-operations and bnodes #195

ghost · 2012-03-12T01:03:16Z

Reported by project member gromgull, Jan 27, 2012

When discussing Issue 185 we came across this:

In theory, bnode IDs are only valid inside a single graph. I.e. any merging of graphs or serialization/parsing / sparql-roundtripping should at least: 1. make sure all bnode IDs are unique. And maybe 2. canonalize the bnodes IDs and solve the fun graph-isomorphism problem.

i.e. this would affect set-theoretic operations on graphs add, iadd, etc. (and probably other things. )

On the other hand - the current behaviour is also useful in many settings. If you have an application that throws graphs around, you probably DO want bnode IDs to be stable and remain the same in all graphs.

Also, what exactly is the identity of a graph, i.e. when would we want to trigger the "make sure all bnode IDs are unique code". If two graphs are part of the same ConjunctiveGraph they are probably NOT different "enough".

A semi-related issue is bnode IDs in SPARQL, where they did it "correctly" and bnode IDs are only valid inside a single a result-set, i.e. there is no way to do one query, get some bnodes back, then query for more information about them. (Although many proprietary extensions to SPARQL for this exist)

I vote we change nothing - but document that issue in the doc-strings for add etc, and add a warning that bnodes are handled "naively"

Comment 1 by gjhiggins, Jan 27, 2012

This is a slightly tangled issue and this is something of a long-ish post, sorry about that.

Initially concentrating on Issue 185:

"When rdflib is used in a application that fork the current Python process, for exemple when using flup.server.*_fork, BNode's value generation in these processes: share the same _prefix and use independant serial number generators that start with the same value"

What's "wrong" about that?

It violated at least two users' expectations (the OP's and, I find, mine too).
bnode identifiers themselves are outside the RDF spec and Gunnar is correct to identify a documentation issue here - although some preparations have aleady been made:

i) http://rdflib.readthedocs.org/en/latest/howto.html#merging-graphs
ii) http://rdfextras.readthedocs.org/en/latest/store/bnode_drama.html
iii) http://rdflib.readthedocs.org/en/latest/graphs_bnodes.html

the latter includes URLs for two relevant and highly illuminating posts by Pat Hayes:

http://www.ihmc.us/users/phayes/RDFGraphSyntax.html
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0153.html

Drifting slightly away to the consequences of two graphs having the same bnode id but for different statements...

Gunnar observes:

++ the current behaviour is also useful in many settings. If you have
++ an application that throws graphs around, you probably DO want
++ bnode IDs to be stable and remain the same in all graphs.
++ Also, what exactly is the identity of a graph, i.e. when would we
++ want to trigger the "make sure all bnode IDs are unique code". If
++ two graphs are part of the same ConjunctiveGraph they are probably
++ NOT different "enough".

Just for info, I'll observe that bnode ids in rdflib-parsed graphs are "standardized apart" by default. So it's just the set-theoretic operators which are characterisable as "naive", however ...

There is a solution which seems to tick all the boxes - at least for graphs generated and serialized by RDFLib.

The extant code for generating bnode ids dates back to 2005, prior to the introduction of the uuid module (in Python 2.5). Given that the current code attempts to generate a "(hopefully) unique prefix", we might usefully switch to using uuid (and faking one for Python 2.4).

Using uuid.uuid4() to generate bnode ids would enormously reduce the probability of bnode id collisions between (rdflib-generated) graphs and they could be confidently processed by naive set-theoretic operators, no Skolemization required.

There'd be an associated cost in terms of an increase in storage space requirements but I feel that's worth the gain in robustness.

Returning to Issue 185:

Using bnode ids based on uuid.uuid4() would also obviate any necessity to go mucking about with re-seeding the random seed in forked processes as it fixes Issue 185 as a side-effect. This was demonstrated in the investigative tests that I've been repeatedly committing, apologies for that. I discovered that tests of my putative solution were all passing on 32-bit architecture m/cs but was seeing some failures on 64-bit m/cs.

My vote is: yes it is a documentation issue but we could ameliorate some of the practical issue and at the same time improve the codebase by replacing the bnode id generating code by the uuid module from the standard library.

Cheers,

Graham

gromgull · 2012-04-21T07:52:36Z

Now that #185 is fixed I added some docs here, and it's probably ok.

Now, if you let RDFLib generate all your BNodeIDs this will never cause you any problem.

Of course, you can still shoot yourself in the foot if you try- the BNode constructor lets you pass in the ID, you can create as non-unique IDs as you want.

Also I tested some parsers, both the n3/turtle and rdf/xml parser generate new local IDs for incoming bnodes (although, not using the UUID like we do now, we could align this). The NTriples parser will just keep the given IDs though, I made a new issue for this, see #204.

ghost mentioned this issue Mar 28, 2012

BNode value not random enough #185

Closed

gromgull closed this as completed in 47531a7 Apr 21, 2012

gromgull mentioned this issue Apr 21, 2012

NTriples parsed does not internalise BNodeIDs #204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph set-operations and bnodes #195

Graph set-operations and bnodes #195

ghost commented Mar 12, 2012

gromgull commented Apr 21, 2012

Graph set-operations and bnodes #195

Graph set-operations and bnodes #195

Comments

ghost commented Mar 12, 2012

Comment 1 by gjhiggins, Jan 27, 2012

gromgull commented Apr 21, 2012