Percent html entity does not decoded #25

paveltyk · 2020-08-03T09:43:31Z

Expected: HtmlEntities.decode("100&percnt;") #=> "100%"
Actual: HtmlEntities.decode("100&percnt;") #=> "100&percnt;"

The text was updated successfully, but these errors were encountered:

paveltyk · 2020-08-03T10:12:07Z

This seems to be a comprehensive of HTML entities: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

I can build I mapping file, if you are willing to use it in your project.

martinsvalin · 2020-08-04T13:53:55Z

Hi, thank you for pointing this out. The list you referenced is actually what I used to generate this file, which is then used as a source for all the function clauses to cover these named entities.

The wikipedia page has since been updated to include entities defined in HTML 5.0, growing the list from a few hundred to a few thousand entities.

It's a reasonable addition, but I'll think about if this can be done in a nice way so that users who only need to decode old documents from back when entities were more commonplace can have a slimmer, more performant dependency. Functionally it's a backwards compatible change, but there will be some cost in performance and compiled file size. At least I need to check what the impact is on size and performance.

Where did you find a document in the wild with HTML 5.0 entities in it? I'm a little bit surprised as I don't see good reasons to encode characters beyond the ones needed to produce html-safe text these days.

paveltyk · 2020-08-05T07:38:09Z

We do web scrapping a lot, and there are many weird things in the wild :)

Please note there are quite a few entities with multiple codepoints. Also, I've noticed &amp and & are both valid entities, so I had to sort entities in Util.HtmlCharref.Util.load_entities by their length. Otherwise "Tom & Jerry" could be decoded to "Tom &; Jerry".

My quick solution to this (excerpt from your codebase):

defmodule Util.HtmlCharref do
  def decode(text) when is_binary(text), do: decode(text, [])
  def decode(text), do: text

  # https://html.spec.whatwg.org/entities.json
  @charref_filename "./lib/util/html_charref/entities.txt"
  codes = Util.HtmlCharref.Util.load_entities(@charref_filename)

  for {name, codepoints} <- codes do
    defp decode(<<unquote(name), rest::binary>>, acc) do
      decode(rest, unquote(codepoints) ++ acc)
    end
  end

  defp decode(<<head::utf8, rest::binary>>, acc), do: decode(rest, [head | acc])

  defp decode(<<>>, acc), do: acc |> Enum.reverse() |> List.to_string()
end

P.S. Thank you for a great lib.

martinsvalin · 2020-08-06T09:10:02Z

Right, I noticed the footnote about which entities allow dropping the semi-colon now that I read the wiki entry more carefully. Let's open a separate issue for this. I'm currently working on creating a mix task to make it easy to generate my source file from a copy of the wikitable, and I started adding support for the [a] footnote, marking the entities in my list that allow no semi-colon.

martinsvalin · 2020-08-06T09:27:03Z

As for entities that can decode to multiple codepoints, that should be tackled in this issue, or the html 5 entities won't decode properly. Seems simple enough, we'll turn the codepoint part into a list, and replace the entity with all of them.

paveltyk · 2020-08-06T13:49:46Z

Take a look at this file: https://html.spec.whatwg.org/entities.json Might worth using it instead of wiki table.

martinsvalin mentioned this issue Aug 6, 2020

Support entities without ending semicolon when this is allowed by the spec #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percent html entity does not decoded #25

Percent html entity does not decoded #25

paveltyk commented Aug 3, 2020 •

edited

Loading

paveltyk commented Aug 3, 2020

martinsvalin commented Aug 4, 2020

paveltyk commented Aug 5, 2020

martinsvalin commented Aug 6, 2020

martinsvalin commented Aug 6, 2020

paveltyk commented Aug 6, 2020

Percent html entity does not decoded #25

Percent html entity does not decoded #25

Comments

paveltyk commented Aug 3, 2020 • edited Loading

paveltyk commented Aug 3, 2020

martinsvalin commented Aug 4, 2020

paveltyk commented Aug 5, 2020

martinsvalin commented Aug 6, 2020

martinsvalin commented Aug 6, 2020

paveltyk commented Aug 6, 2020

paveltyk commented Aug 3, 2020 •

edited

Loading