Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding strikes back #186

Closed
wingman-jr-addon opened this issue Apr 2, 2023 · 3 comments
Closed

Character encoding strikes back #186

wingman-jr-addon opened this issue Apr 2, 2023 · 3 comments

Comments

@wingman-jr-addon
Copy link
Owner

wingman-jr-addon commented Apr 2, 2023

User Drago got me some great feedback about the ongoing battle to make the character detection work flawlessly. See #70 for past history.

on some websites (e.g. https://winfuture.de/news,123262.html) special characters like "ä", "ö", "ü", "ß" and probably become broken and shown as �. The developer reduced problematic pages like these to a minimum already, so its not a big deal.

Having an actual site to check against helps so much! I can reproduce the issue.

@wingman-jr-addon
Copy link
Owner Author

I looked into this a bit and the issue seems to be related to the fact that we are getting raw bytes that may NOT be UTF-8 encoding and always dump them out as UTF-8 encoded using TextEncoder. This is for sure not the fully correct way to handle this; however, Firefox doesn't support other character sets on the TextEncoder. (See

// 2) Ensures the output Content-Type is UTF-8 because that is what TextEncoder supports
, https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder) Playing around with this PR as a possible solution, but it may introduce other things as I haven't check the regression tests:
#187

@wingman-jr-addon
Copy link
Owner Author

All tests passed after some tweaks - keeping an eye on this for regressions

@Dragodraki
Copy link

Don't want to bother you. If you seeking for yet more optimization for special characters fixing, here is another site which seems to have problems (german website):

https://uniconverter.wondershare.de/ogg/aac-vs-ogg.html

There you can find the "�" character again instead of "ä", "ö", "ü", "ß"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants