Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Parsing XML from a URL #41

Merged
merged 3 commits into from
Sep 21, 2020
Merged

[FIX] Parsing XML from a URL #41

merged 3 commits into from
Sep 21, 2020

Conversation

quetzyg
Copy link
Contributor

@quetzyg quetzyg commented Sep 20, 2020

Problem

The logic introduced to fix issue #39 does a very basic string check. Because only the exact text/xml and application/xml strings are considered valid, when the Content-Type header contains values such as text/xml; charset=UTF-8 or application/xml; charset=UTF-8, we get an error.

Another problem is that there are other XML MIME types besides those two.

If we do a quick search here, we can find 400+ MIME types related to XML.

Solution

The mime type check logic has been improved by using a regular expression. The only thing that might take longer to understand in the changes is the regex itself, so here's a brief explanation:

(?i)  --> case insensitive matches (no need to convert the string to lowercase)
(
  (application|image|message|model) --> most of the types that make up an XML MIME type
  / --> type/subtype separator
  ((\w|\.|-)+\+?)?  --> this group may exist with characters (a-z, 0-9, _, . and -) (+ may not be there) at the start of a subtype
  |
  text --> remaining type that can be part of an XML MIME type
  / --> type/subtype separator
)
(wb)? --> Some XML MIME types might end in wbxml
xml --> the final part of the subtype

Caveats

The current regular expression isn't strict about what types should go with sub types, which means that non-existent/unofficial MIME types, like image/dash+xml or model/svg+xml are considered valid.

For the sake of simplicity, I think that's a good trade-off. It also means that when new XML MIME types are added in the future, this logic should require low maintenance, if any.

Task items:

  • Improve MIME type check logic in LoadURL();
  • Update existing tests for success and add tests to fail;

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 91.773% when pulling 0f199ae on quetzyg:master into 1871a20 on antchfx:master.

@zhengchun zhengchun merged commit 2fd993d into antchfx:master Sep 21, 2020
@zhengchun
Copy link
Contributor

merged. thanks!

https://github.com/antchfx/xmlquery/releases/tag/v1.3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants