Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent html entity does not decoded #25

Open
paveltyk opened this issue Aug 3, 2020 · 6 comments
Open

Percent html entity does not decoded #25

paveltyk opened this issue Aug 3, 2020 · 6 comments

Comments

@paveltyk
Copy link

paveltyk commented Aug 3, 2020

Expected: HtmlEntities.decode("100%") #=> "100%"
Actual: HtmlEntities.decode("100%") #=> "100%"

@paveltyk
Copy link
Author

paveltyk commented Aug 3, 2020

This seems to be a comprehensive of HTML entities: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

I can build I mapping file, if you are willing to use it in your project.

@martinsvalin
Copy link
Owner

Hi, thank you for pointing this out. The list you referenced is actually what I used to generate this file, which is then used as a source for all the function clauses to cover these named entities.

The wikipedia page has since been updated to include entities defined in HTML 5.0, growing the list from a few hundred to a few thousand entities.

It's a reasonable addition, but I'll think about if this can be done in a nice way so that users who only need to decode old documents from back when entities were more commonplace can have a slimmer, more performant dependency. Functionally it's a backwards compatible change, but there will be some cost in performance and compiled file size. At least I need to check what the impact is on size and performance.

Where did you find a document in the wild with HTML 5.0 entities in it? I'm a little bit surprised as I don't see good reasons to encode characters beyond the ones needed to produce html-safe text these days.

@paveltyk
Copy link
Author

paveltyk commented Aug 5, 2020

We do web scrapping a lot, and there are many weird things in the wild :)

Please note there are quite a few entities with multiple codepoints. Also, I've noticed &amp and & are both valid entities, so I had to sort entities in Util.HtmlCharref.Util.load_entities by their length. Otherwise "Tom & Jerry" could be decoded to "Tom &; Jerry".

My quick solution to this (excerpt from your codebase):

defmodule Util.HtmlCharref do
  def decode(text) when is_binary(text), do: decode(text, [])
  def decode(text), do: text

  # https://html.spec.whatwg.org/entities.json
  @charref_filename "./lib/util/html_charref/entities.txt"
  codes = Util.HtmlCharref.Util.load_entities(@charref_filename)

  for {name, codepoints} <- codes do
    defp decode(<<unquote(name), rest::binary>>, acc) do
      decode(rest, unquote(codepoints) ++ acc)
    end
  end

  defp decode(<<head::utf8, rest::binary>>, acc), do: decode(rest, [head | acc])

  defp decode(<<>>, acc), do: acc |> Enum.reverse() |> List.to_string()
end

P.S. Thank you for a great lib.

@martinsvalin
Copy link
Owner

Right, I noticed the footnote about which entities allow dropping the semi-colon now that I read the wiki entry more carefully. Let's open a separate issue for this. I'm currently working on creating a mix task to make it easy to generate my source file from a copy of the wikitable, and I started adding support for the [a] footnote, marking the entities in my list that allow no semi-colon.

@martinsvalin
Copy link
Owner

As for entities that can decode to multiple codepoints, that should be tackled in this issue, or the html 5 entities won't decode properly. Seems simple enough, we'll turn the codepoint part into a list, and replace the entity with all of them.

@paveltyk
Copy link
Author

paveltyk commented Aug 6, 2020

Take a look at this file: https://html.spec.whatwg.org/entities.json Might worth using it instead of wiki table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants