Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every accented characters are corrupted #13

Open
DrLuthor opened this issue Apr 2, 2019 · 0 comments
Open

Every accented characters are corrupted #13

DrLuthor opened this issue Apr 2, 2019 · 0 comments

Comments

@DrLuthor
Copy link

DrLuthor commented Apr 2, 2019

  • Platform: AWS Lambda

Expected Behavior

When you POST a request with only the URL parameter. The response is UTF-8 friendly.
When I use the html parameter, response should be utf-8 friendly too.

The API should return a title like this : "Le démantèlement des réacteurs nucléaires, véritable filière industrielle"
And content like this :
... <p><strong>Dans les prochaines ann&#xE9;es, avec la transition &#xE9;nerg&#xE9;tique et le d&#xE9;mant&#xE8;lement ...

Current Behavior

Title returned : "Le d�mant�lement des r�acteurs nucl�aires, v�ritable fili�re industrielle"
Content returned:
...<p><strong>Dans les prochaines ann**&#xFFFD;**es, avec la transition &#xFFFD;nerg&#xFFFD;tique et le d&#xFFFD;mant&#xFFFD;lement ...

Steps to Reproduce

I just do a POST to the parse-html endpoint
{ "url": "https://www.europeanscientist.com/fr/energie/demantelement-reacteurs-nucleaires-dechets-pngmdr/", "html" : [copy_paste_of_html_code] }

Possible Solution

I tried to force header's request Content-type to utf-8 with application/json; charset=utf-8 but it doesn't change the result.
While running this request locally, I've got an Iconv-lite deprecation warning related to encoding
Iconv-lite warning: decode()-ing strings is deprecated. Refer to https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding

@DrLuthor DrLuthor changed the title Title and excerpt have corrupted accented characters Every accented characters are corrupted Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant