-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Expose raw tag markup to support existing filters #5143
base: trunk
Are you sure you want to change the base?
HTML API: Expose raw tag markup to support existing filters #5143
Conversation
… existing filters) Many existing filters in Core pass segments of the raw HTML markup for matched tags where code implementing those hooks perform futher analyis on those segments. The HTML API attempts to hide the raw inner markup as much as possible, making it hard to provide full backwards compatability with these existing filters. In this patch a new function is added which extracts the raw markup to maintain those existing behaviors. It requires a disclaimer to operate to try and encourage folks to use the structural and semantic methods provided by the HTML API. Where possible, existing code and filters should be updated so that they no longer depend on the raw HTML.
$disclaimer = <<<INTERNAL_ONLY | ||
I understand that this is only for internal WordPress usage and something | ||
will likely break if used elsewhere. This function comes with no warranty. | ||
INTERNAL_ONLY; | ||
|
||
if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following the meaning of your emoji.
The point I'm trying to make here is that I really want the API to discourage use of this function. It's dangerous because it carries the legacy of interacting with HTML as a string that led to all the issues that brought us here.
In fact the only reason I'm considering this is a purely pragmatic backwards-compatibility issue. Core (and Jetpack and many other popular plugins) expose full or near-full HTML syntax through filters so that extending code can do further processing. For example, a full link HTML is passed into wp_targeted_link_rel
and some plugins examine the URL to decide what additional rel
values to add.
We could reconstruct normative HTML for these filters, but that would involve iterating over the attributes to create strings when we have one already. Maybe it's a worthwhile tradeoff, but I think most of those filters are unused and so if we can pass along a substring of an existing string and not touch it, that will have far less impact.
I'd prefer we find ways to update existing code to work structurally instead stringily. We don't need this function today; it's merely an aid to refactors that need the HTML API. If we can deprecate all those filters and leave time to update, then we can avoid this entire thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the added context and explanation, much appreciated!
The emoji meant something to the effect, "Wowzers, this one blows the one we use for the constructor out of the water. I guess that's the only precedent we have, but it's probably warranted."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I did realize that I can't make a class const private, so the one in the constructor could be easily foiled by calling WP_HTML_Processor::CONSTRUCTOR_UNLOCK_CODE
.
In that case I'm less worried anyway because it's there to alert someone why something broke. In this case it's to try and put pressure against dangerous programming habits that will work if attempted, at least to a point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation, @dmsnell. Could you provide an example snippet used in Jetpack (or other popular plugin), and how that code should look like after this patch? Just to get a better sense of the changes introduced here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bor0 there is no code change required by this PR for other plugins and code authors. the point is to preserve backwards compatibility. however, here is one of many examples of how existing filters rely on the raw markup.
// formatting.php
$rel = apply_filters( 'wp_targeted_link_rel', 'noopener', $link_html );
here, the $link_html
is supposed to represent the full LINK
element's HTML, such as <link rel="stylesheet" href="some/url">
lots of existing code then parses this with a regex or string-find, and breaks down in the same cases that normal HTML parsing fails: boolean attribute values, different quoting than expected, duplicate or differently-ordered attributes, escaping issues, etc…
ideally we would move Core towards a future where instead of passing around these strings we could pass around a kind of read-only or otherwise limited HTML Tag Processor, so that filters could reliably read attribute values, but changing the filter would mean breaking those existing plugins
Sadly needs a rebase now 🙈 Would you mind taking care of that? I'll land it tomorrow morning 😊 |
Thanks @ockham - I should have created this as a draft as it needs much more consideration before merge. |
See WordPress#5143 This patch refactors `wp_targeted_link_rel` to be more reliable and clear by using the HTML API to replace existing PCRE code. It provides backwards compatability for the `rel` filter by creating a quick HTML string representing the opening A tag for the matched element and providing the `href`, `rel`, and `target` attributes on that tag.
See WordPress#5143 This patch refactors `wp_targeted_link_rel` to be more reliable and clear by using the HTML API to replace existing PCRE code. It provides backwards compatability for the `rel` filter by creating a quick HTML string representing the opening A tag for the matched element and providing the `href`, `rel`, and `target` attributes on that tag.
See WordPress#5143 This patch refactors `wp_targeted_link_rel` to be more reliable and clear by using the HTML API to replace existing PCRE code. It provides backwards compatability for the `rel` filter by creating a quick HTML string representing the opening A tag for the matched element and providing the `href`, `rel`, and `target` attributes on that tag.
See WordPress#5143 This patch refactors `wp_targeted_link_rel` to be more reliable and clear by using the HTML API to replace existing PCRE code. It provides backwards compatability for the `rel` filter by creating a quick HTML string representing the opening A tag for the matched element and providing the `href`, `rel`, and `target` attributes on that tag.
will likely break if used elsewhere. This function comes with no warranty. | ||
INTERNAL_ONLY; | ||
|
||
if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only necessary to check for tag_name_starts_at
? If that property is set, does it follow that tag_ends_at
is set too? If not, we could add the check here, but if it does, adding a clarifying (assertion) comment would improve code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct, and as internal values this isn't exposed. it's an idiom used throughout the Tag Processor. I've thought about different means to check what we're doing, but I don't think adding the same comment to twenty different places within the class would be that useful.
I do agree this is a bit indirect; more or less though I'm waiting until a much clearer and better solution appears.
$disclaimer = <<<INTERNAL_ONLY | ||
I understand that this is only for internal WordPress usage and something | ||
will likely break if used elsewhere. This function comes with no warranty. | ||
INTERNAL_ONLY; | ||
|
||
if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting approach. I haven't seen how we do things like this in WP - is this the first function of such nature?
Another potential idea is to prefix such functions with unsafe
or something, like React does it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gutenberg uses this in some places. we considered unsafe_
prefixes but I don't like those because of how normative they can become and how they can unintentionally lesson the impact of seeing unsafe_
in the code.
this is a method I think I would prefer we never merge, and instead of supporting these old filters we move to new ones entirely. if we do put these in I want to discourage people as much as possible to call them separately, but we need to make the methods public for Core's internal consumption.
Trac ticket: #59291-trac
Current status:
Probably this will never merge and never should merge. It's an exploration of one way to preserve backwards compatability.
Summary
Many existing filters in Core pass segments of the raw HTML markup for matched tags where code implementing those hooks perform futher analyis on those segments. The HTML API attempts to hide the raw inner markup as much as possible, making it hard to provide full backwards compatability with these existing filters.
In this patch a new function is added which extracts the raw markup to maintain those existing behaviors. It requires a disclaimer to operate to try and encourage folks to use the structural and semantic methods provided by the HTML API.
Where possible, existing code and filters should be updated so that they no longer depend on the raw HTML.
To Consider
We could canonicalize the output here using the HTML5 serialization algorithm. This implies writing all attributes as double-quoted strings with proper escaping. Whitespace between attributes and the tag name should be normalized. Characters are properly escaped and quoted. This is already trivial to reconstruct through the existing methods, but incurs a runtime cost.
In this patch I've opted to propose a RAW method to avoid that cost, particularly since this is designed to be used in places in Core that are running frequently and on all data (kses, formatting, etc…). Using the raw contents risks leaving "broken" markup broken, leaving existing breakages in place, but should not break existing code relying on those breakages.
If, on the other hand, we normalize the HTML, we could potentially close existing bugs that fail because of the typical kind of parsing failures, regexps looking for double-quoted attributes, etc…
We can introduce another function to generate the normalized output, but I am hesitant to make things look too easy and accidentally encourage people to work around the semantic methods in the HTML API in favor of "easier" regexp parsing.