Skip to content

Commit

Permalink
HTML API: Add custom text decoder.
Browse files Browse the repository at this point in the history
Provides a custom decoder for strings coming from HTML attributes and
markup. This custom decoder is necessary because of deficiencies in
PHP's `html_entity_decode()` function:

  - It isn't aware of 720 of the possible named character references in
    HTML, leaving many out that should be translated.

  - It isn't aware of the ambiguous ampersand rule, which allows
    conversion of character references in certain contexts when they
    are missing their closing `;`.

  - It doesn't draw a distinction for the ambiguous ampersand rule
    when decoding attribute values instead of markup values.

  - Use of `html_entity_decode()` requires manually passing non-default
    paramter values to ensure it decodes properly.

This decoder also provides some conveniences, such as making a
single-pass and interruptable decode operation possible. This will
provide a number of opportunities to optimize detection and decoding
of things like value prefixes, and whether a value contains a given
substring.

Developed in WordPress/wordpress-develop#6387
Discussed in https://core.trac.wordpress.org/ticket/61072

Props dmsnell, gziolo, jonsurrell, jorbin, westonruter, zieladam.
Fixes #61072.

Built from https://develop.svn.wordpress.org/trunk@58281


git-svn-id: https://core.svn.wordpress.org/trunk@57741 1a063a9b-81f0-0310-95a4-ce76da25c4cd
  • Loading branch information
dmsnell committed Jun 2, 2024
1 parent 7b88768 commit aa5a99b
Show file tree
Hide file tree
Showing 5 changed files with 481 additions and 23 deletions.
30 changes: 15 additions & 15 deletions wp-includes/class-wp-token-map.php
Original file line number Diff line number Diff line change
Expand Up @@ -435,8 +435,8 @@ public static function from_precomputed_table( $state ) {
*
* @since 6.6.0
*
* @param string $word Determine if this word is a lookup key in the map.
* @param ?string $case_sensitivity 'ascii-case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @param string $word Determine if this word is a lookup key in the map.
* @param string $case_sensitivity Optional. Pass 'ascii-case-insensitive' to ignore ASCII case when matching. Default 'case-sensitive'.
* @return bool Whether there's an entry for the given word in the map.
*/
public function contains( $word, $case_sensitivity = 'case-sensitive' ) {
Expand Down Expand Up @@ -521,10 +521,10 @@ public function contains( $word, $case_sensitivity = 'case-sensitive' ) {
* @since 6.6.0
*
* @param string $text String in which to search for a lookup key.
* @param ?int $offset How many bytes into the string where the lookup key ought to start.
* @param ?int &$matched_token_byte_length Holds byte-length of found token matched, otherwise not set.
* @param ?string $case_sensitivity 'ascii-case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @return string|false Mapped value of lookup key if found, otherwise `false`.
* @param int $offset Optional. How many bytes into the string where the lookup key ought to start. Default 0.
* @param ?int &$matched_token_byte_length Optional. Holds byte-length of found token matched, otherwise not set. Default null.
* @param string $case_sensitivity Optional. Pass 'ascii-case-insensitive' to ignore ASCII case when matching. Default 'case-sensitive'.
* @return string|null Mapped value of lookup key if found, otherwise `null`.
*/
public function read_token( $text, $offset = 0, &$matched_token_byte_length = null, $case_sensitivity = 'case-sensitive' ) {
$ignore_case = 'ascii-case-insensitive' === $case_sensitivity;
Expand All @@ -539,7 +539,7 @@ public function read_token( $text, $offset = 0, &$matched_token_byte_length = nu
// Perhaps a short word then.
return strlen( $this->small_words ) > 0
? $this->read_small_token( $text, $offset, $matched_token_byte_length, $case_sensitivity )
: false;
: null;
}

$group = $this->large_words[ $group_at / ( $this->key_length + 1 ) ];
Expand All @@ -564,19 +564,19 @@ public function read_token( $text, $offset = 0, &$matched_token_byte_length = nu
// Perhaps a short word then.
return strlen( $this->small_words ) > 0
? $this->read_small_token( $text, $offset, $matched_token_byte_length, $case_sensitivity )
: false;
: null;
}

/**
* Finds a match for a short word at the index.
*
* @since 6.6.0.
*
* @param string $text String in which to search for a lookup key.
* @param ?int $offset How many bytes into the string where the lookup key ought to start.
* @param ?int &$matched_token_byte_length Holds byte-length of found lookup key if matched, otherwise not set.
* @param ?string $case_sensitivity 'ascii-case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @return string|false Mapped value of lookup key if found, otherwise `false`.
* @param string $text String in which to search for a lookup key.
* @param int $offset Optional. How many bytes into the string where the lookup key ought to start. Default 0.
* @param ?int &$matched_token_byte_length Optional. Holds byte-length of found lookup key if matched, otherwise not set. Default null.
* @param string $case_sensitivity Optional. Pass 'ascii-case-insensitive' to ignore ASCII case when matching. Default 'case-sensitive'.
* @return string|null Mapped value of lookup key if found, otherwise `null`.
*/
private function read_small_token( $text, $offset, &$matched_token_byte_length, $case_sensitivity = 'case-sensitive' ) {
$ignore_case = 'ascii-case-insensitive' === $case_sensitivity;
Expand Down Expand Up @@ -616,7 +616,7 @@ private function read_small_token( $text, $offset, &$matched_token_byte_length,
return $this->small_mappings[ $at / ( $this->key_length + 1 ) ];
}

return false;
return null;
}

/**
Expand Down Expand Up @@ -692,7 +692,7 @@ public function to_array() {
*
* @since 6.6.0
*
* @param ?string $indent Use this string for indentation, or rely on the default horizontal tab character.
* @param string $indent Optional. Use this string for indentation, or rely on the default horizontal tab character. Default "\t".
* @return string Value which can be pasted into a PHP source file for quick loading of table.
*/
public function precomputed_php_source_table( $indent = "\t" ) {
Expand Down
Loading

0 comments on commit aa5a99b

Please sign in to comment.