Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend XML::Reader with more LibXML methods #5740

Merged
merged 9 commits into from
Jun 29, 2018

Conversation

felixbuenemann
Copy link
Contributor

This adds a couple of method bindings that come in handy when doing pull parsing or hybrid parsing (search with pull then expand node).

LibXML.xmlTextReaderNext(@reader) == 1
end

def next_sibling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libXML docs say this and next are the same, are they? More docs in general would be great.

Copy link
Contributor Author

@felixbuenemann felixbuenemann Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not the same.

See this example:

<root>
  <childa/>
  <childb/><!-- current reader node -->
</root>

If you call reader.next would return true and it will move to </root>, but reader.next_sibling would return false, because there is no other sibling under the root node. The comment that next_sibling only works on documents means that it can't be called before the first call to reader.read which sets the document, while reader.next would move to the root node in that case.

See:

That's the theory at least, the system /usr/lib/libxml2.2.dylib on my mac which should be 2.9.4 according to xml2-config --version always returns -1, looks like I'll have to recompile against the latest version from homebrew to see if it's a bug in that version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that appears to be a bug in libxml2, it works after applying the following patch to master:

diff --git a/xmlreader.c b/xmlreader.c
index 4053269b..c195d875 100644
--- a/xmlreader.c
+++ b/xmlreader.c
@@ -2037,7 +2037,7 @@ int
 xmlTextReaderNextSibling(xmlTextReaderPtr reader) {
     if (reader == NULL)
         return(-1);
-    if (reader->doc == NULL) {
+    if (reader->doc != NULL) {
         /* TODO */
        return(-1);
     }

I created GNOME/libxml2#13 to get this fixed in libxml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a workaround that makes next_sibling work even with the incomplete xmlTextReaderNextSibling().

Note: The libxml2 patch posted above is outdated and broken, see GNOME/libxml2#13 for the revised version.

@RX14
Copy link
Contributor

RX14 commented Feb 22, 2018

I'm fine with this but I think the new methods should have docs and specs. Even if the surrounding methods don't have docs.

If you want to document the whole class though, that would be fantastic :)

@felixbuenemann
Copy link
Contributor Author

Yeah, none of the XML::Reader class is currently tested.

I could copy over the comments from the libxml docs and adjust for the slightly different return values.

@RX14
Copy link
Contributor

RX14 commented Feb 23, 2018

If the function is broken on all released versions of libxml2, what's the point in adding it? Adding in a function that we know is going to be buggy for many years to come because of out-of-date distros is just going to be painful. Can we work around this bug in crystal?

@felixbuenemann
Copy link
Contributor Author

We could implement our own version of the next_sibling method. I think I'll have to add some more bindings for that, but it should work.

This adds a couple of method bindings that come in handy when doing
pull parsing or hybrid parsing (search with pull then expand node).
The current implementation of xmlTextReaderNextSibling() only works on
preparsed documents, so we need to detect the error returned if the
reader is not using a preparsed document and implement our own
next sibling by looking at reader internals.
@felixbuenemann
Copy link
Contributor Author

I'll start working on specs for XML::Reader next…

This avoids segfaults when those methods are called before the first or
after the last read.
This fixes a problem where XML::Reader#node_type would return zero
before the first or after the last read, which previously had no
mapping in the XML::Type enum, so the value couldn't be checked.
@felixbuenemann
Copy link
Contributor Author

@RX14 Please review.

As suggested in your earlier review I've documented all the methods in XML::Reader and added specs.

I've also fixed a few methods that could crash, if called before the first or after the last call to XML::Reader#read, but as a consequence those methods can now return nil.

Would it be better to adjust the code so that they raised XML::Error instead?

That would reduce the number of nil checks required when working with them and with proper usage the nil case should never be encountered.

@RX14
Copy link
Contributor

RX14 commented Jun 25, 2018

I've also fixed a few methods that could crash, if called before the first or after the last call to XML::Reader#read, but as a consequence those methods can now return nil.

I think that should be a separate PR, since it's a breaking change. Or actually - if they currently segfault in this condition, then it's not really a breaking change to make them raise.

instead return an empty string if the methods are called in an invalid
reader state (before the first or after the last read).

A special case is the #value method, which could also return nil if
called on a node without a text value, like `<tag>`, but here an empty
string also makes sense.
and implement behavior similar to XML::Node.
@felixbuenemann
Copy link
Contributor Author

I've also fixed a few methods that could crash, if called before the first or after the last call to XML::Reader#read, but as a consequence those methods can now return nil.

I think that should be a separate PR, since it's a breaking change. Or actually - if they currently segfault in this condition, then it's not really a breaking change to make them raise.

I changed the getter methods that could return either a String or Nil to always return a String. In the cases where nil was returned previosuly, I now return an empty string. This avoids breaking backwards compatibility and makes more sense than raising an error.

I also renamed the XML::Reader#attribute method to XML::Reader#[]? and added #[] which raises KeyError if the attribute is not found.

@RX14
Copy link
Contributor

RX14 commented Jun 25, 2018

Returning an empty string is unacceptable. Lets revert all the changes to behaviour in other methods, and keep this PR focussed on docs and adding new methods. We can discuss those methods in another PR, otherwise this one will just get delayed by deciding on how to change behaviour.

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jun 25, 2018

Oh come on, unacceptable?

I had that idea by looking at the API of XML::Node which does exactly that, see:

So please enlighten me why you think it is unacceptable in this case?

I think it is a good idea since it avoids nil checks in user code just to handle one edge case.

@felixbuenemann
Copy link
Contributor Author

Let's recap why I chose empty strings instead of raising:

  • XML::Reader#name only returns a null pointer, when it is not currently on a node, so this is before the first or after the last call to #read. This method will only return an empty string in this edge case, in other cases it will either return the element name, or something like "#text" or "#comment". This means the edge case can be detected by checking for en empty string instead of having to clutter the code with nil checks just to handle the edge case that doesn't happen if the api is properly used.
  • XML::Reader#value returns a null pointer in two cases: when the node doesn't have text content, eg <root/> or before the first or after the last #read. Since we can't differentiate between these two states, it would be a very bad idea to raise here, since you might want to parse xml like <root><child/><child>text</child></root> which means you would have to add additional checks to #empty_element? just to ensure you don't run into exception, while returning an empty string for "empty content" seems perfectly fine for the majority of use cases and if the user really cares if the node has no text content he can just as well call #empty_element?.

So I urge you to reconsider your point of view of allowing empty strings in these methods.

@felixbuenemann
Copy link
Contributor Author

Oh and if you're worried about this PR getting too far off-topic, I can simply slice it up into multiple PRs once we're happy with the outcome.

@straight-shoota
Copy link
Member

Please slice it up already. It's too confusing to discuss several topics in one place. Let's have one discussion about adding new stuff and one about changing existing methods.

Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping unchecked NULL pointers in stdlib is what's unacceptable. Returning an empty String is a valid solution; yet, reading your arguments, what about raising in #name for reporting the error (invalid state)?

def name
  name = LibXML.xmlTextReaderConstName(@reader)
  raise Error.new("Can't get name: no currently on a node") unless name
  String.new(name)
end

Having #value return an empty string seems valid, but maybe we could introduce a #value? version to report this difference, and maybe detect and raise on an error (invalid state)?

def value
  value? || ""
end

def value?
  if value = LibXML.xmlTextReaderConstValue(@reader)
    String.new(value)
  elsif !empty_element?
    raise Error.new("Can't get value: no currently on a node")
  end
end

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jun 27, 2018

I think special casing #name and #value to raise an error, while all the other existing XML::Reader methods silenty swallow errors is not a good idea, since it makes the API inconsistent.

There are a lot of methods that do something like return LibXML.xmlFoo(@reader) == 1 and they all ignore that those methods can return -1 and treat it as false instead of nil or an exception.

I think either all reader methods should raise on error or none.

The way you usually use the reader api is like this:

reader = XML::Reader.new(io)
while reader.read
  # do stuff
end

So you will never hit those edge cases unless you use the api in a completely wrong way.

@straight-shoota
Copy link
Member

So you will never hit those edge cases unless you use the api in a completely wrong way.

It should raise then.

@felixbuenemann
Copy link
Contributor Author

@straight-shoota Look, I added tests for the entire LibXML::Reader here and documented the methods, as suggested by @RX14 and +1ed by you at the beginning of this PR. This revealed that two methods could segfault when the reader was in an invalid state due to missing NULL pointer checks, so I fixed them making the tests pass, because having a PR with failing tests would be a bad idea.

If the entires API of XML::Reader should be changed to raise if any of the LibXML methods return an error, that is a huge breaking change and should be tackled in a separate PR.

Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think special casing #name and #value to raise an error, while all the other existing XML::Reader methods silenty swallow errors is not a good idea, since it makes the API inconsistent.

Good point. Swallowing the error and returning an empty string is acceptable for the time being. It fixes potential segfaults, is consistent with the current API, and isn't a breaking change (since they used to segfault). Let's have a follow up issue or pull request to review and/or change cases where libxml returns a NULL pointer.

@felixbuenemann
Copy link
Contributor Author

@RX14 I think your review is stale. Can this be merged?

@RX14
Copy link
Contributor

RX14 commented Jun 29, 2018

Lets just rip out XML entirely. Put it in a shard. It's not mature enough to be in the stdlib, as these revelations prove.

@RX14
Copy link
Contributor

RX14 commented Jun 29, 2018

For now, this can be merged so the fixes are in if/when we split XML into a shard.

@RX14 RX14 merged commit 3696bb1 into crystal-lang:master Jun 29, 2018
@RX14 RX14 added this to the Next milestone Jun 29, 2018
@RX14 RX14 added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib kind:feature labels Jun 29, 2018
@felixbuenemann felixbuenemann deleted the extend-xml-reader branch June 29, 2018 21:52
@felixbuenemann
Copy link
Contributor Author

@RX14 I'm unopinionated wether this should be in stdlib or not, but if you choose to extract it into a shard feel free to ping me – I might be able to help with maintenance.

@RX14 RX14 modified the milestones: Next, 0.26.0 Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. kind:feature topic:stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants