-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update post grammar to include markers for inner blocks #11082
Conversation
That'd work for my needs, but part of me wonders if it would make more sense to inject some sort of comment into the html to be str_replaced with the generated block markup instead of the index of a string that could get more easily mucked up if anything passes through a filter that tweaks it |
Thank for you for addressing this issue. My personal point of view is that the solution is very hacky, and closes the door to alternative parsers that are able to produce better outputs easily, like a real AST (Abstract Syntax Tree). For instance, with the Rust Gutenberg parser, the AST is defined as: pub enum Node<'a> {
Block {
name: (Input<'a>, Input<'a>),
attributes: Option<Input<'a>>,
children: Vec<Node<'a>>
},
Phrase(Input<'a>)
} And thus the following input: <!-- wp:outer-block -->
Check out my
<!-- wp:void-inner-block /-->
and my other
<!-- wp:inner-block -->
with its own content.
<!-- /wp:inner-block -->
<!-- /wp:outer-block --> produces the following output: [
/* Block */ {
"name": "core/outer-block",
"attributes": null,
"children": [
/* Phrase */ {
"phrase": "\nCheck out my\n"
},
/* Block */ {
"name": "core/void-inner-block",
"attributes": null,
"children": []
},
/* Phrase */ {
"phrase": "\nand my other\n"
},
/* Block */ {
"name": "core/inner-block",
"attributes": null,
"children": [
/* Phrase */ {
"phrase": "\nwith its own content.\n"
}
]
},
/* Phrase */ {
"phrase": "\n"
}
]
}
] The AST is clean and it reflects the document structure: A Keeping the new class Block {
constructor(name, attributes, children) {
this.blockName = name;
this.attrs = attributes;
this.innerBlocks = [];
this.innerHTML = '';
for (let child of children) {
if (child instanceof Block) {
this.innerBlocks.push(child);
} else if (child instanceof Phrase) {
this.innerHTML += child.innerHTML;
}
}
}
}
class Phrase {
constructor(phrase) {
this.attrs = {};
this.innerHTML = phrase;
}
} The way What I'm trying to say is: Instead of using the same :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment #11082 (comment).
The failing tests deal with a lot of Looks like all Core usages of iconv -- https://github.com/WordPress/WordPress/search?q=iconv -- have a function_exists check on it as the extension may not be in place. The ID3 library in Core does have a fallback though -- |
Thanks for all the work here. Some topics of review:
|
@mcsf I'm trying to make a change that doesn't break existing compatibility here so I think that altering the parse format is out of the question.
Here I'm going to go back to the motivation for these changes: we broke isomorphism when we decided to create Right now dynamic blocks are still unaware of their nested blocks because the server isn't fully parsing posts.
let's talk about your worries: "potential conversion failure" is a very unlikely case I think because the transform is well-defined and vetted.
sadly no because PHP and JavaScript have different ideas of how many things long a multi-byte character is. strlen( '𐀀' ) === 4
mb_strlen( "𐀀" ) === 1
strlen( '💩' ) === 4
mb_strlen( '💩' ) === 1 '𐀀'.length === 2
[ ...'𐀀' ].length === 1
'💩'.length === 2
[ ...'💩' ].length === 1 also don't those depend on having the multi-byte PHP extensions loaded just as the we can move the conversion to JavaScript but I'm not exactly sure how. it's possible that using the string-cloning-by-iterating method will get a match at the cost of performance. if it's running in WordPress then what can you say about these changes right now for these goals:
|
35e4377
to
bff5592
Compare
@mcsf we could also switch to something like "bytes into document" which PHP handles quite well with |
for my own reference…
I found a function to compute UTF8 byte length in JS and it appears to be right from a code-audit standpoint and from a quick test. it iterates over the string but doesn't clone it. |
13cdfdd
to
a5f5afb
Compare
Closing in favor of #11334 |
Attempt three at including positional information from the parse to enable isomorphic reconstruction of the source `post_content` after parsing. See alternate attempts: #11082, #11309 Motivated by: #7247, #8760, Automattic/jetpack#10256 Enables: #10463, #10108 ## Abstract Add new `innerContent` property to each block in parser output indicating where in the innerHTML each innerBlock was found. ## Status - will update fixtures after design review indicates this is the desired approach - all parsers passing new tests for fragment behavior ## Summary Inner blocks, or nested blocks, or blocks-within-blocks, can exist in Gutenberg posts. They are serialized in `post_content` in place as normal blocks which exist in between another block's comment delimiters. ```html <!-- wp:outerBlock --> Check out my <!-- wp:voidInnerBlock /--> and my other <!-- wp:innerBlock --> with its own content. <!-- /wp:innerBlock --> <!-- /wp:outerBlock --> ``` The way this gets parsed leaves us in a quandary: we cannot reconstruct the original `post_content` after parsing because we lose the origin location information for each inner block since they are only passed as an array of inner blocks. ```json { "blockName": "core/outerBlock", "attrs": {}, "innerBlocks": [ { "blockName": "core/voidInnerBlock", "attrs": {}, "innerBlocks": [], "innerHTML": "" }, { "blockName": "core/innerBlock", "attrs": {}, "innerBlocks": [], "innerHTML": "\nwith its own content.\n" } ], "innerHTML": "\nCheck out my\n\nand my other\n\n" } ``` At this point we have parsed the blocks and prepared them for attaching into the JavaScript block code that interprets them but we have lost our reverse transformation. In this PR I'd like to introduce a new mechanism which shouldn't break existing functionality but which will enable us to go back and forth isomorphically between the `post_content` and first stage of parsing. If we can tear apart a Gutenberg post and reassemble then it will let us to structurally-informed processing of the posts without needing to be aware of all the block JavaScript. The proposed mechanism is a new property as a **list of HTML fragments with `null` values interspersed between those fragments where the blocks were found**. ```json { "blockName": "core/outerBlock", "attrs": {}, "innerBlocks": [ { "blockName": "core/voidInnerBlock", "attrs": {}, "innerBlocks": [], "blockMarkers": [], "innerHTML": "" }, { "blockName": "core/innerBlock", "attrs": {}, "innerBlocks": [], "blockMarkers": [], "innerHTML": "\nwith its own content.\n" } ], "innerHTML": "\nCheck out my\n\nand my other\n\n", "innerContent": [ "\nCheck out my\n", null, "\n and my other\n", null, "\n" ], } ``` Doing this allows us to replace those `null` values with their associated block (sequentially) from `innerBlocks`. ## Questions - Why not use a string token instead of an array? - See #11309. The fundamental problem with the token is that it could be valid content input from a person and so there's a probability that we would fail to split the content accurately. - Why add the `null` instead of leaving basic array splits like `[ 'before', 'after' ]`? - By inspection we can see that without an explicit marker we don't know if the block came before or after or between array elements. We could add empty strings `''` and say that blocks exist only _between_ array elements but the parser code would have to be more complicated to make sure we appropriately add those empty strings. The empty strings are a bit odd anyway. - Why add a new property? - Code already depends on `innerHTML` and `innerBlocks`; I don't want to break any existing behaviors and adding is less risky than changing.
Motivated by #8760
Necessary for #10463, #10108
Abstract
Add new property to parsed block object indicating where in the
innerHTML
eachinnerBlock
was found.Summary
Inner blocks, or nested blocks, or blocks-within-blocks, can exist in Gutenberg posts. They are serialized in
post_content
in place as normal blocks which exist in between another block's comment delimiters.The way this gets parsed leaves us in a quandary: we cannot reconstruct the original
post_content
after parsing because we lose the origin location information for each inner block since they are only passed as an array of inner blocks.At this point we have parsed the blocks and prepared them for attaching into the JavaScript block code that interprets them but we have lost our reverse transformation.
In this PR I'd like to introduce a new mechanism which shouldn't break existing functionality but which will enable us to go back and forth isomorphically between the
post_content
and first stage of parsing. If we can tear apart a Gutenberg post and reassemble then it will let us to structurally-informed processing of the posts without needing to be aware of all the block JavaScript.The proposed mechanism is a form of tombstone or
blockMarkers
that specify the index intoinnerHTML
where theinnerBlocks
were found.The block markers represent the
UCS-2 indexUTF-8 byte-length into theinnerHTML
where the block was found and where it should be replaced if we reserialize.UCS-2 has its own quirks that become important to recognize when dealing with Unicode strings.That is, we get the value by taking the portion ofinnerHTML
preceding that inner block and then compute how many bytes are required to represent it in UTF-8, then that count is our index.The array of markers is of the same length as the array of
innerBlocks
and each location index corresponds to the block at the same array index ininnerBlocks
.Simplified,
aa[1]bb[2]cc
would correspond totext: "aabbcc", blocks: [1,2], markers: [2, 4 ]