You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In theory, bnode IDs are only valid inside a single graph. I.e. any merging of graphs or serialization/parsing / sparql-roundtripping should at least: 1. make sure all bnode IDs are unique. And maybe 2. canonalize the bnodes IDs and solve the fun graph-isomorphism problem.
i.e. this would affect set-theoretic operations on graphs add, iadd, etc. (and probably other things. )
On the other hand - the current behaviour is also useful in many settings. If you have an application that throws graphs around, you probably DO want bnode IDs to be stable and remain the same in all graphs.
Also, what exactly is the identity of a graph, i.e. when would we want to trigger the "make sure all bnode IDs are unique code". If two graphs are part of the same ConjunctiveGraph they are probably NOT different "enough".
A semi-related issue is bnode IDs in SPARQL, where they did it "correctly" and bnode IDs are only valid inside a single a result-set, i.e. there is no way to do one query, get some bnodes back, then query for more information about them. (Although many proprietary extensions to SPARQL for this exist)
I vote we change nothing - but document that issue in the doc-strings for add etc, and add a warning that bnodes are handled "naively"
"When rdflib is used in a application that fork the current Python process, for exemple when using flup.server.*_fork, BNode's value generation in these processes: share the same _prefix and use independant serial number generators that start with the same value"
What's "wrong" about that?
It violated at least two users' expectations (the OP's and, I find, mine too).
bnode identifiers themselves are outside the RDF spec and Gunnar is correct to identify a documentation issue here - although some preparations have aleady been made:
Drifting slightly away to the consequences of two graphs having the same bnode id but for different statements...
Gunnar observes:
++ the current behaviour is also useful in many settings. If you have
++ an application that throws graphs around, you probably DO want
++ bnode IDs to be stable and remain the same in all graphs.
++ Also, what exactly is the identity of a graph, i.e. when would we
++ want to trigger the "make sure all bnode IDs are unique code". If
++ two graphs are part of the same ConjunctiveGraph they are probably
++ NOT different "enough".
Just for info, I'll observe that bnode ids in rdflib-parsed graphs are "standardized apart" by default. So it's just the set-theoretic operators which are characterisable as "naive", however ...
There is a solution which seems to tick all the boxes - at least for graphs generated and serialized by RDFLib.
The extant code for generating bnode ids dates back to 2005, prior to the introduction of the uuid module (in Python 2.5). Given that the current code attempts to generate a "(hopefully) unique prefix", we might usefully switch to using uuid (and faking one for Python 2.4).
Using uuid.uuid4() to generate bnode ids would enormously reduce the probability of bnode id collisions between (rdflib-generated) graphs and they could be confidently processed by naive set-theoretic operators, no Skolemization required.
There'd be an associated cost in terms of an increase in storage space requirements but I feel that's worth the gain in robustness.
Using bnode ids based on uuid.uuid4() would also obviate any necessity to go mucking about with re-seeding the random seed in forked processes as it fixes Issue 185 as a side-effect. This was demonstrated in the investigative tests that I've been repeatedly committing, apologies for that. I discovered that tests of my putative solution were all passing on 32-bit architecture m/cs but was seeing some failures on 64-bit m/cs.
My vote is: yes it is a documentation issue but we could ameliorate some of the practical issue and at the same time improve the codebase by replacing the bnode id generating code by the uuid module from the standard library.
Cheers,
Graham
The text was updated successfully, but these errors were encountered:
Now that #185 is fixed I added some docs here, and it's probably ok.
Now, if you let RDFLib generate all your BNodeIDs this will never cause you any problem.
Of course, you can still shoot yourself in the foot if you try- the BNode constructor lets you pass in the ID, you can create as non-unique IDs as you want.
Also I tested some parsers, both the n3/turtle and rdf/xml parser generate new local IDs for incoming bnodes (although, not using the UUID like we do now, we could align this). The NTriples parser will just keep the given IDs though, I made a new issue for this, see #204.
Reported by project member gromgull, Jan 27, 2012
When discussing Issue 185 we came across this:
In theory, bnode IDs are only valid inside a single graph. I.e. any merging of graphs or serialization/parsing / sparql-roundtripping should at least: 1. make sure all bnode IDs are unique. And maybe 2. canonalize the bnodes IDs and solve the fun graph-isomorphism problem.
i.e. this would affect set-theoretic operations on graphs add, iadd, etc. (and probably other things. )
On the other hand - the current behaviour is also useful in many settings. If you have an application that throws graphs around, you probably DO want bnode IDs to be stable and remain the same in all graphs.
Also, what exactly is the identity of a graph, i.e. when would we want to trigger the "make sure all bnode IDs are unique code". If two graphs are part of the same ConjunctiveGraph they are probably NOT different "enough".
A semi-related issue is bnode IDs in SPARQL, where they did it "correctly" and bnode IDs are only valid inside a single a result-set, i.e. there is no way to do one query, get some bnodes back, then query for more information about them. (Although many proprietary extensions to SPARQL for this exist)
I vote we change nothing - but document that issue in the doc-strings for add etc, and add a warning that bnodes are handled "naively"
Comment 1 by gjhiggins, Jan 27, 2012
This is a slightly tangled issue and this is something of a long-ish post, sorry about that.
Initially concentrating on Issue 185:
"When rdflib is used in a application that fork the current Python process, for exemple when using flup.server.*_fork, BNode's value generation in these processes: share the same _prefix and use independant serial number generators that start with the same value"
What's "wrong" about that?
i) http://rdflib.readthedocs.org/en/latest/howto.html#merging-graphs
ii) http://rdfextras.readthedocs.org/en/latest/store/bnode_drama.html
iii) http://rdflib.readthedocs.org/en/latest/graphs_bnodes.html
the latter includes URLs for two relevant and highly illuminating posts by Pat Hayes:
http://www.ihmc.us/users/phayes/RDFGraphSyntax.html
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0153.html
Drifting slightly away to the consequences of two graphs having the same bnode id but for different statements...
Gunnar observes:
++ the current behaviour is also useful in many settings. If you have
++ an application that throws graphs around, you probably DO want
++ bnode IDs to be stable and remain the same in all graphs.
++ Also, what exactly is the identity of a graph, i.e. when would we
++ want to trigger the "make sure all bnode IDs are unique code". If
++ two graphs are part of the same ConjunctiveGraph they are probably
++ NOT different "enough".
Just for info, I'll observe that bnode ids in rdflib-parsed graphs are "standardized apart" by default. So it's just the set-theoretic operators which are characterisable as "naive", however ...
There is a solution which seems to tick all the boxes - at least for graphs generated and serialized by RDFLib.
The extant code for generating bnode ids dates back to 2005, prior to the introduction of the uuid module (in Python 2.5). Given that the current code attempts to generate a "(hopefully) unique prefix", we might usefully switch to using uuid (and faking one for Python 2.4).
Using uuid.uuid4() to generate bnode ids would enormously reduce the probability of bnode id collisions between (rdflib-generated) graphs and they could be confidently processed by naive set-theoretic operators, no Skolemization required.
There'd be an associated cost in terms of an increase in storage space requirements but I feel that's worth the gain in robustness.
Returning to Issue 185:
Using bnode ids based on uuid.uuid4() would also obviate any necessity to go mucking about with re-seeding the random seed in forked processes as it fixes Issue 185 as a side-effect. This was demonstrated in the investigative tests that I've been repeatedly committing, apologies for that. I discovered that tests of my putative solution were all passing on 32-bit architecture m/cs but was seeing some failures on 64-bit m/cs.
My vote is: yes it is a documentation issue but we could ameliorate some of the practical issue and at the same time improve the codebase by replacing the bnode id generating code by the uuid module from the standard library.
Cheers,
Graham
The text was updated successfully, but these errors were encountered: