Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri parser removes child element from anchor tag and add them as a separate element. #1876

Closed
rmishra-ror opened this issue Feb 26, 2019 · 1 comment

Comments

@rmishra-ror
Copy link

Nokogiri parser removes child element of anchor tag and add them as a separate element.

To Reproduce

Here's an example:

parse_data =  Nokogiri::HTML.parse "<a> <table> <tr> </tr></table> </a>"

parse_data.to_html
"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<a> </a><table> <tr> </tr>\n</table> </body></html>\n" 

So it removes table from anchor tag and add them as a separate tag<a> </a><table> <tr> </tr>\n</table>
It seems the problem is with parser(Nokogiri::HTML.parse) which not correctly add child element of anchor tag

Nokogiri version: "1.10.1"

@flavorjones
Copy link
Member

Hi! Thanks for asking this question. The short answer is that Nokogiri is inheriting this behavior from the underlying parser, libxml2, and so there's unfortunately very little that nokogiri can easily do to modify this behavior. But read on for a suggestion (hint: nokogumbo).

The slightly longer version: the HTML4 spec for an A anchor element defines only "inline" elements as valid subelements. If you recurse through the inline definition, I think you'll find that only these elements are valid within an A element:

  • "abbr"
  • "acronym"
  • "applet"
  • "b"
  • "basefont"
  • "bdo"
  • "big"
  • "br"
  • "button"
  • "cite"
  • "code"
  • "dfn"
  • "em"
  • "embed"
  • "font"
  • "i"
  • "iframe"
  • "img"
  • "input"
  • "kbd"
  • "label"
  • "map"
  • "object"
  • "q"
  • "s"
  • "samp"
  • "script"
  • "select"
  • "small"
  • "span"
  • "strike"
  • "strong"
  • "sub"
  • "sup"
  • "textarea"
  • "tt"
  • "u"
  • "var"

and this is in fact what libxml2 does.

Now you may be saying to yourself, "But MDN says that table is a valid subelement!" and this is a very good point. I'll further note that Nokogiri, when run on JRuby (using the nekoHTML parsing library) does allow that table within the a element.

This can be traced to the fact that this was introduced in the HTML5 spec, which nekoHTML appears to at least partially support. However libxml2 does NOT support HTML5, and so Nokogiri-on-libxml2 inherits this limitation.

You may want to take a look at using NokoGumbo, which aims to bring HTML5 support to Nokogiri.

I hope this explanation helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants