-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend XML::Reader with more LibXML methods #5740
Conversation
src/xml/reader.cr
Outdated
LibXML.xmlTextReaderNext(@reader) == 1 | ||
end | ||
|
||
def next_sibling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libXML docs say this and next
are the same, are they? More docs in general would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not the same.
See this example:
<root>
<childa/>
<childb/><!-- current reader node -->
</root>
If you call reader.next
would return true
and it will move to </root>
, but reader.next_sibling
would return false
, because there is no other sibling under the root node. The comment that next_sibling
only works on documents means that it can't be called before the first call to reader.read
which sets the document, while reader.next
would move to the root node in that case.
See:
That's the theory at least, the system /usr/lib/libxml2.2.dylib
on my mac which should be 2.9.4 according to xml2-config --version
always returns -1
, looks like I'll have to recompile against the latest version from homebrew to see if it's a bug in that version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that appears to be a bug in libxml2, it works after applying the following patch to master:
diff --git a/xmlreader.c b/xmlreader.c
index 4053269b..c195d875 100644
--- a/xmlreader.c
+++ b/xmlreader.c
@@ -2037,7 +2037,7 @@ int
xmlTextReaderNextSibling(xmlTextReaderPtr reader) {
if (reader == NULL)
return(-1);
- if (reader->doc == NULL) {
+ if (reader->doc != NULL) {
/* TODO */
return(-1);
}
I created GNOME/libxml2#13 to get this fixed in libxml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a workaround that makes next_sibling
work even with the incomplete xmlTextReaderNextSibling()
.
Note: The libxml2 patch posted above is outdated and broken, see GNOME/libxml2#13 for the revised version.
I'm fine with this but I think the new methods should have docs and specs. Even if the surrounding methods don't have docs. If you want to document the whole class though, that would be fantastic :) |
Yeah, none of the I could copy over the comments from the libxml docs and adjust for the slightly different return values. |
If the function is broken on all released versions of libxml2, what's the point in adding it? Adding in a function that we know is going to be buggy for many years to come because of out-of-date distros is just going to be painful. Can we work around this bug in crystal? |
We could implement our own version of the next_sibling method. I think I'll have to add some more bindings for that, but it should work. |
This adds a couple of method bindings that come in handy when doing pull parsing or hybrid parsing (search with pull then expand node).
The current implementation of xmlTextReaderNextSibling() only works on preparsed documents, so we need to detect the error returned if the reader is not using a preparsed document and implement our own next sibling by looking at reader internals.
a11bb01
to
f08e8a1
Compare
I'll start working on specs for |
This avoids segfaults when those methods are called before the first or after the last read.
This fixes a problem where XML::Reader#node_type would return zero before the first or after the last read, which previously had no mapping in the XML::Type enum, so the value couldn't be checked.
@RX14 Please review. As suggested in your earlier review I've documented all the methods in I've also fixed a few methods that could crash, if called before the first or after the last call to Would it be better to adjust the code so that they raised That would reduce the number of nil checks required when working with them and with proper usage the |
d092c9f
to
c336759
Compare
c336759
to
ee945e6
Compare
I think that should be a separate PR, since it's a breaking change. Or actually - if they currently segfault in this condition, then it's not really a breaking change to make them raise. |
instead return an empty string if the methods are called in an invalid reader state (before the first or after the last read). A special case is the #value method, which could also return nil if called on a node without a text value, like `<tag>`, but here an empty string also makes sense.
and implement behavior similar to XML::Node.
I changed the getter methods that could return either a String or Nil to always return a String. In the cases where nil was returned previosuly, I now return an empty string. This avoids breaking backwards compatibility and makes more sense than raising an error. I also renamed the |
Returning an empty string is unacceptable. Lets revert all the changes to behaviour in other methods, and keep this PR focussed on docs and adding new methods. We can discuss those methods in another PR, otherwise this one will just get delayed by deciding on how to change behaviour. |
Oh come on, unacceptable? I had that idea by looking at the API of XML::Node which does exactly that, see:
So please enlighten me why you think it is unacceptable in this case? I think it is a good idea since it avoids nil checks in user code just to handle one edge case. |
Let's recap why I chose empty strings instead of raising:
So I urge you to reconsider your point of view of allowing empty strings in these methods. |
Oh and if you're worried about this PR getting too far off-topic, I can simply slice it up into multiple PRs once we're happy with the outcome. |
Please slice it up already. It's too confusing to discuss several topics in one place. Let's have one discussion about adding new stuff and one about changing existing methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping unchecked NULL pointers in stdlib is what's unacceptable. Returning an empty String is a valid solution; yet, reading your arguments, what about raising in #name
for reporting the error (invalid state)?
def name
name = LibXML.xmlTextReaderConstName(@reader)
raise Error.new("Can't get name: no currently on a node") unless name
String.new(name)
end
Having #value
return an empty string seems valid, but maybe we could introduce a #value?
version to report this difference, and maybe detect and raise on an error (invalid state)?
def value
value? || ""
end
def value?
if value = LibXML.xmlTextReaderConstValue(@reader)
String.new(value)
elsif !empty_element?
raise Error.new("Can't get value: no currently on a node")
end
end
I think special casing There are a lot of methods that do something like I think either all reader methods should raise on error or none. The way you usually use the reader api is like this: reader = XML::Reader.new(io)
while reader.read
# do stuff
end So you will never hit those edge cases unless you use the api in a completely wrong way. |
It should raise then. |
@straight-shoota Look, I added tests for the entire If the entires API of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think special casing
#name
and#value
to raise an error, while all the other existingXML::Reader
methods silenty swallow errors is not a good idea, since it makes the API inconsistent.
Good point. Swallowing the error and returning an empty string is acceptable for the time being. It fixes potential segfaults, is consistent with the current API, and isn't a breaking change (since they used to segfault). Let's have a follow up issue or pull request to review and/or change cases where libxml returns a NULL pointer.
@RX14 I think your review is stale. Can this be merged? |
Lets just rip out |
For now, this can be merged so the fixes are in if/when we split XML into a shard. |
@RX14 I'm unopinionated wether this should be in stdlib or not, but if you choose to extract it into a shard feel free to ping me – I might be able to help with maintenance. |
This adds a couple of method bindings that come in handy when doing pull parsing or hybrid parsing (search with pull then expand node).