-
-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider removing unprintable characters? #65
Comments
In principle this feature makes sense to me; however I'm curious as to why
|
Sure - the problem occurs downstream. In our particular case, we are parsing RSS feeds and then sanitizing the post content and storing it in a database. Later on, we are converting that post content to JSON (no problems yet, the unprintable characters are valid JSON) and then using a client-side Javascript framework (Backbone.js in our case) to render the data. The issues occur when we try to run those unprintable characters through the Javascript parser. More details here: http://timelessrepo.com/json-isnt-a-javascript-subset but the long and short of it: Javascript cannot handle these unprintable characters even though JSON can. "No string in JavaScript can contain a literal U+2028 or a U+2029." |
That's crazy! OK, got it loud and clear. How about this proposed usage: doc = Loofah.fragment(thing)
doc.scrub!(:unprintable) so that the non-printable characters can be removed by chaining If that usage makes sense, then are you able to take a first whack at an implementation? If not, I'll try to get to it. |
Possibly useful reference: here's how Rack::JSONP addressed this issue: |
👍 sounds good to me. I can probably get going on a pull request next week. |
Let me know if you have questions on where to start. Should be as easy as On Wed, Apr 9, 2014 at 12:25 PM, matt swanson notifications@github.comwrote:
|
Add new scrubber to remove unprintable Unicode characters, fixes #65
We've encountered some issues with HTML containing unprintable characters (namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295). The issue manifests itself further down the chain once we try to load the HTML with unprintable characters as JSON.
We've resolved our issues by adding a
gsub(/[^[:print:]]/, '')
call after running the contents throughloofah
. I think this makes sense to add this as part ofloofah
's sanitizing process - any thoughts before I take a stab at a PR?The text was updated successfully, but these errors were encountered: