Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail. #40

Merged
merged 1 commit into from
Sep 14, 2020

Conversation

jf-tech
Copy link

@jf-tech jf-tech commented Sep 13, 2020

Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for adding additional filtering. Imagine the following example, where the xml doc is:

	<ROOT>
		<AAA>
			<CCC>c1</CCC>
			<BBB>b1</BBB>
			<DDD>d1</DDD>
			<BBB>b2<ZZZ z="1">z1</ZZZ></BBB>
			<BBB>b3</BBB>
		</AAA>
		<ZZZ>
			<BBB>b4</BBB>
			<BBB>b5</BBB>
			<CCC>c3</CCC>
		</ZZZ>
	</ROOT>

The stream parser is created as:

	CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']")

Basically we want the stream parser to return all the BBB nodes whose text aren't b3. By looking at the sample XML, we know it should return: the <BBB> nodes whose texts are b1, b2, b4, and b5.

However, the current code only returns b1 and b2.

The problem lies in the stream element matching inside case xml.StartElement.

Currently the code does this:

case xml.StartElement:
	...
	...
	if p.streamElementXPath != nil {
		if p.streamNode == nil {
			if QuerySelector(p.doc, p.streamElementXPath) != nil {
				// Now we assume this node is the stream node candidate.
			}

We originally under the assumption that if the streamElementXPath query returns anything, it must be this node itself; thus if it returns, this node is the stream node candidate.

But it's clearly wrong in this b3 example above. For the node <BBB>b3</BBB> it is first considered as the stream candidate, but later filtering ([. != 'b3']) removes its stream node status, and treats it just like any other non-stream nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will always "matches" streamElementXPath. So in the example above, the node <ZZZ> is now erroneously considered stream node, and any child nodes are not even tested for streaming anymore.

There are two fixes:

  1. In xml.StartElement stream match, instead of just doing QuerySelector(...) != nil check, we need to issue a QuerySelectorAll(...) call and examine all the returned nodes, if the current node is one of them, then this current node is considered stream candidate.

  2. Simpler: if a stream candidate is later filtered out inside case xml.EndElement handling, then simply remove it from the node tree, thus preventing future erroneous matches.

Fix 1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it in any capacity. Also imagine if any XML doc has lots of <BBB>b3</BBB> nodes, the memory growth would be really bad. All things considered, chose fix 2).

…sing all subsequent valid matches fail.

Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for
adding additional filtering. Imagine the following example, where the xml doc is:

```
	<ROOT>
		<AAA>
			<CCC>c1</CCC>
			<BBB>b1</BBB>
			<DDD>d1</DDD>
			<BBB>b2<ZZZ z="1">z1</ZZZ></BBB>
			<BBB>b3</BBB>
		</AAA>
		<ZZZ>
			<BBB>b4</BBB>
			<BBB>b5</BBB>
			<CCC>c3</CCC>
		</ZZZ>
	</ROOT>
```

The stream parser is created as:

```
	CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']")
```

Basically we want the stream parser to return all the `BBB` nodes whose text aren't `b3`. By looking at the sample
XML, we know it should return: the `<BBB>` nodes whose texts are `b1`, `b2`, `b4`, and `b5`.

However, the current code only returns `b1` and `b2`.

The problem lies in the stream element matching inside `case xml.StartElement`.

Currently the code does this:

```
case xml.StartElement:
	...
	...
	if p.streamElementXPath != nil {
		if p.streamNode == nil {
			if QuerySelector(p.doc, p.streamElementXPath) != nil {
				// Now we assume this node is the stream node candidate.
			}
```
We originally under the assumption that if the `streamElementXPath` query returns anything, it must be this node itself;
thus if it returns, this node is the stream node candidate.

But it's clearly wrong in this `b3` example above. For the node `<BBB>b3</BBB>` it is first considered as the stream
candidate, but later filtering (`[. != 'b3']`) removes its stream node status, and treats it just like any other non-stream
nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will
**always** "matches" `streamElementXPath`. So in the example above, the node `<ZZZ>` is now erroneously considered stream
node, and any child nodes are not even tested for streaming anymore.

There are two fixes:

1) In `xml.StartElement` stream match, instead of just doing `QuerySelector(...) != nil` check, we need to issue a
`QuerySelectorAll(...)` call and examine all the returned nodes, if the current node is one of them, then this current node
is considered stream candidate.

2) Simpler: if a stream candidate is later filtered out inside `case xml.EndElement` handling, then simply remove
it from the node tree, thus preventing future erroneous matches.

1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it
in any capacity. Thus chose fix 2).
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 91.59% when pulling e26cec5 on jf-tech:jf-tech/streamparser-fix into 5648b2f on antchfx:master.

@jf-tech
Copy link
Author

jf-tech commented Sep 14, 2020

@zhengchun just make sure it's on your radar with this bug fix.

@zhengchun zhengchun merged commit 1871a20 into antchfx:master Sep 14, 2020
@zhengchun
Copy link
Contributor

I released a new version on https://github.com/antchfx/xmlquery/releases/tag/v1.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants