Extend XML::Reader with more LibXML methods #5740

felixbuenemann · 2018-02-22T22:56:38Z

This adds a couple of method bindings that come in handy when doing pull parsing or hybrid parsing (search with pull then expand node).

RX14 · 2018-02-22T23:25:44Z

src/xml/reader.cr

+    LibXML.xmlTextReaderNext(@reader) == 1
+  end
+
+  def next_sibling


libXML docs say this and next are the same, are they? More docs in general would be great.

No, they are not the same.

See this example:

<root> <childa/> <childb/> </root>

If you call reader.next would return true and it will move to </root>, but reader.next_sibling would return false, because there is no other sibling under the root node. The comment that next_sibling only works on documents means that it can't be called before the first call to reader.read which sets the document, while reader.next would move to the root node in that case.

See:

xmlTextReaderNextSibling

xmlTextReaderNext

That's the theory at least, the system /usr/lib/libxml2.2.dylib on my mac which should be 2.9.4 according to xml2-config --version always returns -1, looks like I'll have to recompile against the latest version from homebrew to see if it's a bug in that version.

Yeah, that appears to be a bug in libxml2, it works after applying the following patch to master:

diff --git a/xmlreader.c b/xmlreader.c index 4053269b..c195d875 100644 --- a/xmlreader.c +++ b/xmlreader.c @@ -2037,7 +2037,7 @@ int xmlTextReaderNextSibling(xmlTextReaderPtr reader) { if (reader == NULL) return(-1); - if (reader->doc == NULL) { + if (reader->doc != NULL) { /* TODO */ return(-1); }

I created GNOME/libxml2#13 to get this fixed in libxml.

I have pushed a workaround that makes next_sibling work even with the incomplete xmlTextReaderNextSibling().

Note: The libxml2 patch posted above is outdated and broken, see GNOME/libxml2#13 for the revised version.

RX14 · 2018-02-22T23:27:38Z

I'm fine with this but I think the new methods should have docs and specs. Even if the surrounding methods don't have docs.

If you want to document the whole class though, that would be fantastic :)

felixbuenemann · 2018-02-23T04:17:29Z

Yeah, none of the XML::Reader class is currently tested.

I could copy over the comments from the libxml docs and adjust for the slightly different return values.

RX14 · 2018-02-23T12:41:37Z

If the function is broken on all released versions of libxml2, what's the point in adding it? Adding in a function that we know is going to be buggy for many years to come because of out-of-date distros is just going to be painful. Can we work around this bug in crystal?

felixbuenemann · 2018-02-24T12:44:12Z

We could implement our own version of the next_sibling method. I think I'll have to add some more bindings for that, but it should work.

This adds a couple of method bindings that come in handy when doing pull parsing or hybrid parsing (search with pull then expand node).

The current implementation of xmlTextReaderNextSibling() only works on preparsed documents, so we need to detect the error returned if the reader is not using a preparsed document and implement our own next sibling by looking at reader internals.

felixbuenemann · 2018-06-23T15:21:10Z

I'll start working on specs for XML::Reader next…

This avoids segfaults when those methods are called before the first or after the last read.

This fixes a problem where XML::Reader#node_type would return zero before the first or after the last read, which previously had no mapping in the XML::Type enum, so the value couldn't be checked.

felixbuenemann · 2018-06-24T15:04:42Z

@RX14 Please review.

As suggested in your earlier review I've documented all the methods in XML::Reader and added specs.

I've also fixed a few methods that could crash, if called before the first or after the last call to XML::Reader#read, but as a consequence those methods can now return nil.

Would it be better to adjust the code so that they raised XML::Error instead?

That would reduce the number of nil checks required when working with them and with proper usage the nil case should never be encountered.

RX14 · 2018-06-25T14:30:54Z

I've also fixed a few methods that could crash, if called before the first or after the last call to XML::Reader#read, but as a consequence those methods can now return nil.

I think that should be a separate PR, since it's a breaking change. Or actually - if they currently segfault in this condition, then it's not really a breaking change to make them raise.

instead return an empty string if the methods are called in an invalid reader state (before the first or after the last read). A special case is the #value method, which could also return nil if called on a node without a text value, like `<tag>`, but here an empty string also makes sense.

and implement behavior similar to XML::Node.

felixbuenemann · 2018-06-25T18:47:40Z

I've also fixed a few methods that could crash, if called before the first or after the last call to XML::Reader#read, but as a consequence those methods can now return nil.

I think that should be a separate PR, since it's a breaking change. Or actually - if they currently segfault in this condition, then it's not really a breaking change to make them raise.

I changed the getter methods that could return either a String or Nil to always return a String. In the cases where nil was returned previosuly, I now return an empty string. This avoids breaking backwards compatibility and makes more sense than raising an error.

I also renamed the XML::Reader#attribute method to XML::Reader#[]? and added #[] which raises KeyError if the attribute is not found.

RX14 · 2018-06-25T19:01:08Z

Returning an empty string is unacceptable. Lets revert all the changes to behaviour in other methods, and keep this PR focussed on docs and adding new methods. We can discuss those methods in another PR, otherwise this one will just get delayed by deciding on how to change behaviour.

felixbuenemann · 2018-06-25T21:56:07Z

Oh come on, unacceptable?

I had that idea by looking at the API of XML::Node which does exactly that, see:

XML::Node#content (equivalent to XML::Reader#value)
XML::Node#name (equivalent to XML::Reader#name)

So please enlighten me why you think it is unacceptable in this case?

I think it is a good idea since it avoids nil checks in user code just to handle one edge case.

felixbuenemann · 2018-06-26T13:20:32Z

Let's recap why I chose empty strings instead of raising:

XML::Reader#name only returns a null pointer, when it is not currently on a node, so this is before the first or after the last call to #read. This method will only return an empty string in this edge case, in other cases it will either return the element name, or something like "#text" or "#comment". This means the edge case can be detected by checking for en empty string instead of having to clutter the code with nil checks just to handle the edge case that doesn't happen if the api is properly used.
XML::Reader#value returns a null pointer in two cases: when the node doesn't have text content, eg <root/> or before the first or after the last #read. Since we can't differentiate between these two states, it would be a very bad idea to raise here, since you might want to parse xml like <root><child/><child>text</child></root> which means you would have to add additional checks to #empty_element? just to ensure you don't run into exception, while returning an empty string for "empty content" seems perfectly fine for the majority of use cases and if the user really cares if the node has no text content he can just as well call #empty_element?.

So I urge you to reconsider your point of view of allowing empty strings in these methods.

felixbuenemann · 2018-06-26T13:22:55Z

Oh and if you're worried about this PR getting too far off-topic, I can simply slice it up into multiple PRs once we're happy with the outcome.

straight-shoota · 2018-06-27T08:34:58Z

Please slice it up already. It's too confusing to discuss several topics in one place. Let's have one discussion about adding new stuff and one about changing existing methods.

ysbaddaden

Keeping unchecked NULL pointers in stdlib is what's unacceptable. Returning an empty String is a valid solution; yet, reading your arguments, what about raising in #name for reporting the error (invalid state)?

def name
  name = LibXML.xmlTextReaderConstName(@reader)
  raise Error.new("Can't get name: no currently on a node") unless name
  String.new(name)
end

Having #value return an empty string seems valid, but maybe we could introduce a #value? version to report this difference, and maybe detect and raise on an error (invalid state)?

def value
  value? || ""
end

def value?
  if value = LibXML.xmlTextReaderConstValue(@reader)
    String.new(value)
  elsif !empty_element?
    raise Error.new("Can't get value: no currently on a node")
  end
end

felixbuenemann · 2018-06-27T09:17:52Z

I think special casing #name and #value to raise an error, while all the other existing XML::Reader methods silenty swallow errors is not a good idea, since it makes the API inconsistent.

There are a lot of methods that do something like return LibXML.xmlFoo(@reader) == 1 and they all ignore that those methods can return -1 and treat it as false instead of nil or an exception.

I think either all reader methods should raise on error or none.

The way you usually use the reader api is like this:

reader = XML::Reader.new(io)
while reader.read
  # do stuff
end

So you will never hit those edge cases unless you use the api in a completely wrong way.

straight-shoota · 2018-06-27T09:23:24Z

So you will never hit those edge cases unless you use the api in a completely wrong way.

It should raise then.

felixbuenemann · 2018-06-27T09:29:27Z

@straight-shoota Look, I added tests for the entire LibXML::Reader here and documented the methods, as suggested by @RX14 and +1ed by you at the beginning of this PR. This revealed that two methods could segfault when the reader was in an invalid state due to missing NULL pointer checks, so I fixed them making the tests pass, because having a PR with failing tests would be a bad idea.

If the entires API of XML::Reader should be changed to raise if any of the LibXML methods return an error, that is a huge breaking change and should be tackled in a separate PR.

ysbaddaden

I think special casing #name and #value to raise an error, while all the other existing XML::Reader methods silenty swallow errors is not a good idea, since it makes the API inconsistent.

Good point. Swallowing the error and returning an empty string is acceptable for the time being. It fixes potential segfaults, is consistent with the current API, and isn't a breaking change (since they used to segfault). Let's have a follow up issue or pull request to review and/or change cases where libxml returns a NULL pointer.

felixbuenemann · 2018-06-27T19:34:21Z

@RX14 I think your review is stale. Can this be merged?

RX14 · 2018-06-29T16:59:37Z

Lets just rip out XML entirely. Put it in a shard. It's not mature enough to be in the stdlib, as these revelations prove.

RX14 · 2018-06-29T17:32:01Z

For now, this can be merged so the fixes are in if/when we split XML into a shard.

felixbuenemann · 2018-07-04T21:51:53Z

@RX14 I'm unopinionated wether this should be in stdlib or not, but if you choose to extract it into a shard feel free to ping me – I might be able to help with maintenance.

RX14 requested changes Feb 22, 2018

View reviewed changes

felixbuenemann added 2 commits June 23, 2018 15:51

Extend XML::Reader with more LibXML methods

9c53526

This adds a couple of method bindings that come in handy when doing pull parsing or hybrid parsing (search with pull then expand node).

felixbuenemann force-pushed the extend-xml-reader branch from a11bb01 to f08e8a1 Compare June 23, 2018 14:48

felixbuenemann added 3 commits June 24, 2018 16:12

Fix XML::Reader#name/#value when not on node

07df3ff

This avoids segfaults when those methods are called before the first or after the last read.

Add XML::Type::NONE for XML::Reader#node_type

58e0243

This fixes a problem where XML::Reader#node_type would return zero before the first or after the last read, which previously had no mapping in the XML::Type enum, so the value couldn't be checked.

Document all XML::Reader methods

0b14ba2

felixbuenemann force-pushed the extend-xml-reader branch from d092c9f to c336759 Compare June 24, 2018 15:25

Add specs for all XML::Reader methods

ee945e6

felixbuenemann force-pushed the extend-xml-reader branch from c336759 to ee945e6 Compare June 24, 2018 15:29

felixbuenemann added 3 commits June 25, 2018 20:05

Use explicit type for XML:Reader attribute methods

cddf7a9

Rename XML::Reader#attribute to #[]/#[]?

4f6486f

and implement behavior similar to XML::Node.

ysbaddaden reviewed Jun 27, 2018

View reviewed changes

ysbaddaden approved these changes Jun 27, 2018

View reviewed changes

sdogruyol approved these changes Jun 28, 2018

View reviewed changes

RX14 approved these changes Jun 29, 2018

View reviewed changes

RX14 merged commit 3696bb1 into crystal-lang:master Jun 29, 2018

RX14 added this to the Next milestone Jun 29, 2018

RX14 added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib kind:feature labels Jun 29, 2018

felixbuenemann deleted the extend-xml-reader branch June 29, 2018 21:52

RX14 modified the milestones: Next, 0.26.0 Jul 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend XML::Reader with more LibXML methods #5740

Extend XML::Reader with more LibXML methods #5740

felixbuenemann commented Feb 22, 2018

RX14 Feb 22, 2018

felixbuenemann Feb 23, 2018 •

edited

Loading

felixbuenemann Feb 23, 2018

felixbuenemann Jun 23, 2018

RX14 commented Feb 22, 2018

felixbuenemann commented Feb 23, 2018

RX14 commented Feb 23, 2018

felixbuenemann commented Feb 24, 2018

felixbuenemann commented Jun 23, 2018

felixbuenemann commented Jun 24, 2018

RX14 commented Jun 25, 2018 •

edited

Loading

felixbuenemann commented Jun 25, 2018

RX14 commented Jun 25, 2018 •

edited

Loading

felixbuenemann commented Jun 25, 2018 •

edited

Loading

felixbuenemann commented Jun 26, 2018

felixbuenemann commented Jun 26, 2018

straight-shoota commented Jun 27, 2018

ysbaddaden left a comment •

edited

Loading

felixbuenemann commented Jun 27, 2018 •

edited

Loading

straight-shoota commented Jun 27, 2018

felixbuenemann commented Jun 27, 2018

ysbaddaden left a comment •

edited

Loading

felixbuenemann commented Jun 27, 2018

RX14 commented Jun 29, 2018

RX14 commented Jun 29, 2018

felixbuenemann commented Jul 4, 2018

Extend XML::Reader with more LibXML methods #5740

Extend XML::Reader with more LibXML methods #5740

Conversation

felixbuenemann commented Feb 22, 2018

RX14 Feb 22, 2018

Choose a reason for hiding this comment

felixbuenemann Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

felixbuenemann Feb 23, 2018

Choose a reason for hiding this comment

felixbuenemann Jun 23, 2018

Choose a reason for hiding this comment

RX14 commented Feb 22, 2018

felixbuenemann commented Feb 23, 2018

RX14 commented Feb 23, 2018

felixbuenemann commented Feb 24, 2018

felixbuenemann commented Jun 23, 2018

felixbuenemann commented Jun 24, 2018

RX14 commented Jun 25, 2018 • edited Loading

felixbuenemann commented Jun 25, 2018

RX14 commented Jun 25, 2018 • edited Loading

felixbuenemann commented Jun 25, 2018 • edited Loading

felixbuenemann commented Jun 26, 2018

felixbuenemann commented Jun 26, 2018

straight-shoota commented Jun 27, 2018

ysbaddaden left a comment • edited Loading

Choose a reason for hiding this comment

felixbuenemann commented Jun 27, 2018 • edited Loading

straight-shoota commented Jun 27, 2018

felixbuenemann commented Jun 27, 2018

ysbaddaden left a comment • edited Loading

Choose a reason for hiding this comment

felixbuenemann commented Jun 27, 2018

RX14 commented Jun 29, 2018

RX14 commented Jun 29, 2018

felixbuenemann commented Jul 4, 2018

felixbuenemann Feb 23, 2018 •

edited

Loading

RX14 commented Jun 25, 2018 •

edited

Loading

RX14 commented Jun 25, 2018 •

edited

Loading

felixbuenemann commented Jun 25, 2018 •

edited

Loading

ysbaddaden left a comment •

edited

Loading

felixbuenemann commented Jun 27, 2018 •

edited

Loading

ysbaddaden left a comment •

edited

Loading