-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid lossy html entities encoding by setting charset #24645
Conversation
Sets the charset of the html passed into DomDocument to utf-8. Replaces the mb_convert_encoding call replacing utf-8 with html entities before handing off to DomDocument. This avoids the need to later convert back to utf-8 from html entities afterward. This secondary mb_convert_encoding call was converting not only the utf-8 we converted earlier but also entity encoding html stored inside data-* or other attributes of html elements. Fixes Automattic/wp-calypso#44897 Maintains the fix for WordPress#24445 (WordPress#24447)
Looking good so far. We should just add some unit test coverage for the cases this fixes 👍 @mcsf @youknowriad Can we get this into 8.8? The fix we got into 8.7.1 broke cases where server-side rendered blocks inject HTML (e.g. Automattic/wp-calypso#44897) 😕 |
Will do - got sidetracked today. Just added some screenshots that might help explain things a bit better to anyone stopping by. |
Added unit tests (based on https://3v4l.org/EaZv9) so we can get this merged in time for the 8.8 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving. I've touched this, but the unit tests give us a pretty good guarantee that this covers a very broad class of content that was previously causing encoding issues.
Congratulations on your first merged pull request, @lsl! We'd like to credit you for your contribution in the post announcing the next WordPress release, but we can't find a WordPress.org profile associated with your GitHub account. When you have a moment, visit the following URL and click "link your GitHub account" under "GitHub Username" to link your accounts: https://profiles.wordpress.org/me/profile/edit/ And if you don't have a WordPress.org account, you can create one on this page: https://login.wordpress.org/register Kudos! |
This change specifies the content type and charset of the html passed into `DomDocument` as `utf-8`. Replaces the `mb_convert_encoding` call which encodes `UTF-8` as `HTML-ENTITIES` before handing off to `DomDocument`. This change avoids the need to later revert the encoding back to `UTF-8` afterwards using `mb_convert_encoding`. This secondary `mb_convert_encoding` call was converting not only the `UTF-8` characters that were converted earlier but also any pre-existing entity encoded html stored inside block content. This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way [Jetpack is injecting](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L434) html into the [`data-image-description` attributes](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L485). There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation. Co-authored-by: Bernie Reiter <ockham@raz.or.at>
Thanks for this! |
Description
This change specifies the content type and charset of the html passed into
DomDocument
asutf-8
.Replaces the
mb_convert_encoding
call which encodesUTF-8
asHTML-ENTITIES
before handing off toDomDocument
.This change avoids the need to later revert the encoding back to
UTF-8
afterwards usingmb_convert_encoding
. This secondarymb_convert_encoding
call was converting not only theUTF-8
characters that were converted earlier but also any pre-existing entity encoded html stored inside block content.This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way Jetpack is injecting html into the
data-image-description
attributes.There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation.
How has this been tested?
npm run test-php
unittests cover this somewhat, will add more soon.Screenshots
Before
After
Types of changes
Checklist:
(re)Fixes #24445 (Mostly maintaining the same behavior introduced in #24447)