-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Percent html entity does not decoded #25
Comments
This seems to be a comprehensive of HTML entities: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references I can build I mapping file, if you are willing to use it in your project. |
Hi, thank you for pointing this out. The list you referenced is actually what I used to generate this file, which is then used as a source for all the function clauses to cover these named entities. The wikipedia page has since been updated to include entities defined in HTML 5.0, growing the list from a few hundred to a few thousand entities. It's a reasonable addition, but I'll think about if this can be done in a nice way so that users who only need to decode old documents from back when entities were more commonplace can have a slimmer, more performant dependency. Functionally it's a backwards compatible change, but there will be some cost in performance and compiled file size. At least I need to check what the impact is on size and performance. Where did you find a document in the wild with HTML 5.0 entities in it? I'm a little bit surprised as I don't see good reasons to encode characters beyond the ones needed to produce html-safe text these days. |
We do web scrapping a lot, and there are many weird things in the wild :) Please note there are quite a few entities with multiple codepoints. Also, I've noticed My quick solution to this (excerpt from your codebase):
P.S. Thank you for a great lib. |
Right, I noticed the footnote about which entities allow dropping the semi-colon now that I read the wiki entry more carefully. Let's open a separate issue for this. I'm currently working on creating a mix task to make it easy to generate my source file from a copy of the wikitable, and I started adding support for the |
As for entities that can decode to multiple codepoints, that should be tackled in this issue, or the html 5 entities won't decode properly. Seems simple enough, we'll turn the codepoint part into a list, and replace the entity with all of them. |
Take a look at this file: https://html.spec.whatwg.org/entities.json Might worth using it instead of wiki table. |
Expected:
HtmlEntities.decode("100%") #=> "100%"
Actual:
HtmlEntities.decode("100%") #=> "100%"
The text was updated successfully, but these errors were encountered: