Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for adding additional filtering. Imagine the following example, where the xml doc is:
The stream parser is created as:
Basically we want the stream parser to return all the
BBB
nodes whose text aren'tb3
. By looking at the sample XML, we know it should return: the<BBB>
nodes whose texts areb1
,b2
,b4
, andb5
.However, the current code only returns
b1
andb2
.The problem lies in the stream element matching inside
case xml.StartElement
.Currently the code does this:
We originally under the assumption that if the
streamElementXPath
query returns anything, it must be this node itself; thus if it returns, this node is the stream node candidate.But it's clearly wrong in this
b3
example above. For the node<BBB>b3</BBB>
it is first considered as the stream candidate, but later filtering ([. != 'b3']
) removes its stream node status, and treats it just like any other non-stream nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will always "matches"streamElementXPath
. So in the example above, the node<ZZZ>
is now erroneously considered stream node, and any child nodes are not even tested for streaming anymore.There are two fixes:
In
xml.StartElement
stream match, instead of just doingQuerySelector(...) != nil
check, we need to issue aQuerySelectorAll(...)
call and examine all the returned nodes, if the current node is one of them, then this current node is considered stream candidate.Simpler: if a stream candidate is later filtered out inside
case xml.EndElement
handling, then simply remove it from the node tree, thus preventing future erroneous matches.Fix 1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it in any capacity. Also imagine if any XML doc has lots of
<BBB>b3</BBB>
nodes, the memory growth would be really bad. All things considered, chose fix 2).