Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40

jf-tech · 2020-09-13T11:34:35Z

Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for adding additional filtering. Imagine the following example, where the xml doc is:

	<ROOT>
		<AAA>
			<CCC>c1</CCC>
			<BBB>b1</BBB>
			<DDD>d1</DDD>
			<BBB>b2<ZZZ z="1">z1</ZZZ></BBB>
			<BBB>b3</BBB>
		</AAA>
		<ZZZ>
			<BBB>b4</BBB>
			<BBB>b5</BBB>
			<CCC>c3</CCC>
		</ZZZ>
	</ROOT>

The stream parser is created as:

	CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']")

Basically we want the stream parser to return all the BBB nodes whose text aren't b3. By looking at the sample XML, we know it should return: the <BBB> nodes whose texts are b1, b2, b4, and b5.

However, the current code only returns b1 and b2.

The problem lies in the stream element matching inside case xml.StartElement.

Currently the code does this:

case xml.StartElement:
	...
	...
	if p.streamElementXPath != nil {
		if p.streamNode == nil {
			if QuerySelector(p.doc, p.streamElementXPath) != nil {
				// Now we assume this node is the stream node candidate.
			}

We originally under the assumption that if the streamElementXPath query returns anything, it must be this node itself; thus if it returns, this node is the stream node candidate.

But it's clearly wrong in this b3 example above. For the node <BBB>b3</BBB> it is first considered as the stream candidate, but later filtering ([. != 'b3']) removes its stream node status, and treats it just like any other non-stream nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will always "matches" streamElementXPath. So in the example above, the node <ZZZ> is now erroneously considered stream node, and any child nodes are not even tested for streaming anymore.

There are two fixes:

In xml.StartElement stream match, instead of just doing QuerySelector(...) != nil check, we need to issue a QuerySelectorAll(...) call and examine all the returned nodes, if the current node is one of them, then this current node is considered stream candidate.
Simpler: if a stream candidate is later filtered out inside case xml.EndElement handling, then simply remove it from the node tree, thus preventing future erroneous matches.

Fix 1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it in any capacity. Also imagine if any XML doc has lots of <BBB>b3</BBB> nodes, the memory growth would be really bad. All things considered, chose fix 2).

…sing all subsequent valid matches fail. Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for adding additional filtering. Imagine the following example, where the xml doc is: ``` <ROOT> <AAA> <CCC>c1</CCC> <BBB>b1</BBB> <DDD>d1</DDD> <BBB>b2<ZZZ z="1">z1</ZZZ></BBB> <BBB>b3</BBB> </AAA> <ZZZ> <BBB>b4</BBB> <BBB>b5</BBB> <CCC>c3</CCC> </ZZZ> </ROOT> ``` The stream parser is created as: ``` CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']") ``` Basically we want the stream parser to return all the `BBB` nodes whose text aren't `b3`. By looking at the sample XML, we know it should return: the `<BBB>` nodes whose texts are `b1`, `b2`, `b4`, and `b5`. However, the current code only returns `b1` and `b2`. The problem lies in the stream element matching inside `case xml.StartElement`. Currently the code does this: ``` case xml.StartElement: ... ... if p.streamElementXPath != nil { if p.streamNode == nil { if QuerySelector(p.doc, p.streamElementXPath) != nil { // Now we assume this node is the stream node candidate. } ``` We originally under the assumption that if the `streamElementXPath` query returns anything, it must be this node itself; thus if it returns, this node is the stream node candidate. But it's clearly wrong in this `b3` example above. For the node `<BBB>b3</BBB>` it is first considered as the stream candidate, but later filtering (`[. != 'b3']`) removes its stream node status, and treats it just like any other non-stream nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will **always** "matches" `streamElementXPath`. So in the example above, the node `<ZZZ>` is now erroneously considered stream node, and any child nodes are not even tested for streaming anymore. There are two fixes: 1) In `xml.StartElement` stream match, instead of just doing `QuerySelector(...) != nil` check, we need to issue a `QuerySelectorAll(...)` call and examine all the returned nodes, if the current node is one of them, then this current node is considered stream candidate. 2) Simpler: if a stream candidate is later filtered out inside `case xml.EndElement` handling, then simply remove it from the node tree, thus preventing future erroneous matches. 1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it in any capacity. Thus chose fix 2).

coveralls · 2020-09-13T11:35:54Z

Coverage increased (+0.03%) to 91.59% when pulling e26cec5 on jf-tech:jf-tech/streamparser-fix into 5648b2f on antchfx:master.

jf-tech · 2020-09-14T07:51:36Z

@zhengchun just make sure it's on your radar with this bug fix.

zhengchun · 2020-09-14T10:19:26Z

I released a new version on https://github.com/antchfx/xmlquery/releases/tag/v1.3.1

zhengchun merged commit 1871a20 into antchfx:master Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40

Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40

jf-tech commented Sep 13, 2020

coveralls commented Sep 13, 2020

jf-tech commented Sep 14, 2020

zhengchun commented Sep 14, 2020

Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40

Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40

Conversation

jf-tech commented Sep 13, 2020

coveralls commented Sep 13, 2020

jf-tech commented Sep 14, 2020

zhengchun commented Sep 14, 2020