Rewrite ROW to be Unicode capable #13626

lhecker · 2022-07-29T15:37:35Z

This commit is a from-scratch rewrite of ROW with the primary goal to get
rid of the rather bodgy UnicodeStorage class and improve Unicode support.

Previously a 120x9001 terminal buffer would store a vector of 9001 ROWs
where each ROW stored exactly 120 wchar_t. Glyphs exceeding their
allocated space would be stored in the UnicodeStorage which was basically
a hashmap<Coordinate, String>. Iterating over the text in a ROW would
require us to check each glyph and fetch it from the map conditionally.
On newlines we'd have to invalidate all map entries that are now gone,
so for every invalidated ROW we'd iterate through all glyphs again and if
a single one was stored in UnicodeStorage, we'd then iterate through the
entire hashmap to remove all coordinates that were residing on that ROW.
All in all, this wasn't the most robust nor performant code.

The new implementation is simple (from a design perspective):
Store all text in a ROW in a regular string. Grow the string if needed.
The association between columns and text works by storing character offsets
in a column-wide array. This algorithm is <100 LOC and removes ~1000.

As an aside this PR does a few more things that go hand in hand:

Remove most of ROW helper classes, which aren't needed anymore.
Allocate backing memory in a single VirtualAlloc call.
Rewrite IsCursorDoubleWidth to use DbcsAttrAt directly.
Improves overall performance by 10-20% and makes this implementation
faster than the previous NxM storage, despite the added complexity.

Part of #8000

Validation Steps Performed

Existing and new unit and feature tests complete ✅
Printing Unicode completes without crashing ✅
Resizing works without crashing ✅

DHowett · 2022-07-29T17:00:19Z

I love it.

DHowett · 2022-07-29T17:02:39Z

src/buffer/out/Row.cpp

+void ROW::_init() noexcept
+{
+    std::fill_n(_chars.begin(), _columnCount, UNICODE_SPACE);
+    iota_n(_charOffsets.begin(), _columnCount + 1u, uint16_t{ 0 });


Are we worried about the expense of having, say, 9001 copies of an array of all numbers between 0 and 120?

The space requirement and initialization cost is about the same as before when we used 16-bit sized DbcsAttribute instances for each wchar_t in a ROW. As such I'm not concerned about any regressions.
I would've liked to do lazy initialization for ROWs, but that would require us to remove the const variant of GetRowByOffset, which I wasn't willing to do in this already massive PR.

DHowett · 2022-07-29T17:06:06Z

I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane

DHowett · 2022-07-29T17:09:02Z

src/host/ut_host/SelectionTests.cpp

@@ -343,68 +343,6 @@ class SelectionTests
        VERIFY_ARE_EQUAL(srOriginal.Left + sDeltaLeft, srSelection.Left);
        VERIFY_ARE_EQUAL(srOriginal.Right + sDeltaRight, srSelection.Right);
    }
-
-    TEST_METHOD(TestBisectSelection)


Hmm. For some tests, it may be good practice to leave a comment on the PR explaining why it's not necessary any more. :)

"Bisect" in this case stands for "trailing wide character in the first / leading wide character in the last cell of a row", which we don't support anymore, even with CHAR_INFOs. So this test was removed.

which we don't support anymore

As in, this PR breaks that functionality? Or we officially stopped supporting it a while back?
Why is it ok that we're dropping support for that? I seem to be missing some context here haha.

As in, this PR breaks that functionality? Or we officially stopped supporting it a while back?
Why is it ok that we're dropping support for that? I seem to be missing some context here haha.

Both. Kinda. This PR breaks it harder than we already broke it in the past.
I'm not sure when, but at some point, we stopped supporting storage of "half a wide glyph" in the first/last column of the console ("bisecting rows"). Instead it was supposed to either store a whole wide glyph or none at all. With any of the public console APIs it shouldn't have been possible to bring conhost into a state where it stores such bisecting glyphs in a long while.
But we still kind of supported it internally, for instance when you rapidly resized the console you can bring it into a state, where the reflow algorithm puts glyphs into the wrong columns. This PR puts the last nail into the coffin and ensures that no one can mess with our TextBuffer anymore, not even ourselves. And so this test can't possibly work anymore, just like you already couldn't do that via any of the public APIs. To be entirely honest, I'm not even 100% sure why TestBisectSelection still existed in the first place, and our git history didn't bring up any clues for that either.

BTW a bug caused by bisecting glyphs: #14275
I'm sure there's tons more of that lingering around in our code base.

DHowett · 2022-07-29T17:10:05Z

I haven't found it yet, but how do you handle the case where a double-cell character is inserted on a line where only a single-cell character could fit on the right side? We used to keep track of whether we had "forced" a wide glyph on to the next line, and we used that to reconsitute something for the API i believe...

lhecker · 2022-07-29T17:14:10Z

That's still all there in ROW::WriteCells which still has the SetDoubleBytePadded(true) logic.
This PR doesn't improve Unicode support just yet - it only makes it possible to support in the future once we get rid of ROW::WriteCells and until then, in this PR, all the existing code around it was kept intact.

lhecker · 2022-07-29T17:16:09Z

src/host/ut_host/TextBufferTests.cpp

-void TextBufferTests::TestBufferRowByOffset()
-{
-    auto& textBuffer = GetTbi();
-    auto csBufferHeight = GetBufferHeight();
-
-    VERIFY_IS_TRUE(csBufferHeight > 20);
-
-    auto sId = csBufferHeight / 2 - 5;
-
-    const auto& row = textBuffer.GetRowByOffset(sId);
-    VERIFY_ARE_EQUAL(row.GetId(), sId);
-}


This test isn't helpful anymore, since ROWs don't have IDs anymore.

lhecker · 2022-07-30T23:56:05Z

src/buffer/out/Row.cpp

+void swap(ROW& lhs, ROW& rhs) noexcept
+{
+    std::swap(lhs._charsBuffer, rhs._charsBuffer);
+    std::swap(lhs._charsHeap, rhs._charsHeap);
+    std::swap(lhs._chars, rhs._chars);
+    std::swap(lhs._charOffsets, rhs._charOffsets);
+    std::swap(lhs._attr, rhs._attr);
+    std::swap(lhs._columnCount, rhs._columnCount);
+    std::swap(lhs._lineRendition, rhs._lineRendition);
+    std::swap(lhs._wrapForced, rhs._wrapForced);
+    std::swap(lhs._doubleBytePadded, rhs._doubleBytePadded);
+}


I'm not sure whether I should keep thisstd::swap specialization... It helps the std::rotate in TextBuffer, which relies on std::swap internally.

Eh, I vote keep it.

any way to guarantee that it swaps all contents?

No unfortunately not. The alternative is to implement a operator=(ROW&&) which of course would suffer from the same unsafety issue...

This fixes an issue were overwriting parts of a row would only trigger that specific portion of the row to be redrawn. This isn't just problematic for combining characters, but also for things like the new `TestDbcsBisectWriteCells` test introduced in #13626. Benchmarks showed no impact on performance whatsoever. ## Validation Steps Performed * Pick this commit into #13626 * Run the `TestDbcsBisectWriteCells` test and break before OpenConsole exits * A correct "QいかなZYXWVUTに" output is visible ✅

src/buffer/out/Row.hpp

src/buffer/out/Row.cpp

j4james · 2022-08-09T11:15:10Z

I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane

I think it's worth mentioning that I was using the row IDs to track image segments in my sixel implementation. Not suggesting that's a good reason to keep them, since I'm not likely to do anything with that branch any time soon, but I do think there is value in having some mechanism via which additional metadata can be attached a row.

Another potential use case would be the scrollbar marks. I think we're currently tracking them with just a row number, but long term I'm hoping we'll change that, because I suspect the current approach will break if we ever get margins working in the terminal. That's perhaps a bit of an edge case, so not critical, but still.

lhecker · 2022-08-09T11:29:42Z

[...] but I do think there is value in having some mechanism via which additional metadata can be attached a row.

Why not store or reference that data in ROW itself?
Can sixels be treated as (immutable) bitmaps? If so I can imagine it might be fairly performant and simple to store a vector of shared_ptr to the sixels/bitmaps on each row that they touch. We should then still be able to infer the exact pixels that need to be rendered on each row given the height stored in the referenced sixel/bitmap.

j4james · 2022-08-09T13:06:32Z

Why not store or reference that data in ROW itself?

In the case of sixel, 99% of the time you're not likely to have any data attached to the row - most users will probably never use sixel - so I didn't want to introduce an additional field that would just be wasting space most of the time. And since the ID already existed, that provided an easy way to add sixel functionality without impacting anyone that wasn't using it.

Having some kind of pointer on the row itself would certainly be easier, but it does mean more overhead. And what happens when we have some other bit of metadata that's also rarely used, and also needs to be associated with a row? Do we add yet another field? Maybe some kind of linked list of extension points? I don't know. I just wanted to point out that this use case exists, and row IDs were quite a compact way of addressing that need.

Can sixels be treated as (immutable) bitmaps?

Not really. Sixel isn't so much an image format - it's more of a drawing protocol. So you need to think of it in terms of plotting arbitrary pixels at a point on the screen. The way I managed that was by creating a bitmap associated with each row that was touched by sixel output. Those bitmaps obviously need to be mutable so they can be updated when additional pixels are output.

But any kind of reference in the row would be fine for this - I was just looking for a way to minimize the overhead involved, and potentially share it with other row extensions.

Anyway, this isn't anything that has to be resolved now. I just wanted to the raise the issue as something to consider for the future.

lhecker · 2022-08-09T13:23:20Z

Thanks for the feedback! I don't disagree with you and I'd be absolutely open to add back IDs even if for no reason at all. 🙂

But unrelated to this PR and any designs you have in mind (which I'd approve regardless), I'd like to mention an opposing view to the overhead of unused fields, just because...
Right now on x64/ARM64 a ROW is 584 bytes and this PR will shrink it down to just 84 bytes (with the option to shrink it further down to just 48 bytes). Even if we add a dozen pointers or so to ROW we'd still have way less overhead than before. Because we have many more potential areas were we can get drastic improvements like that, I strongly feel like we shouldn't worry at all about memory usage of a few unused fields and rather pick whatever is the easiest for us to write and the CPU to process. 🙂

…-buffer-rework

AZero13 · 2022-10-12T15:27:28Z

Any updates on this

carlos-zamora

lhecker · 2022-11-11T19:33:40Z

PankajBhojwani

Just had a couple of questions!

PankajBhojwani · 2022-11-11T17:47:44Z

src/inc/test/CommonState.hpp

+                pRow->ReplaceCharacters(x, 2, t.string);
+                x += 2;


Why don't we need to loop through the characters in the string (like the way we need to below when the string isn't wide)? Is it guaranteed that if wide == true then the string is just 1 glyph that's 2-cells wide? I know that the strings we pass in conform to this, but in the context of within this function is that always true

I wrote a different response, but you know what... I just realized, I should just remove the for loop instead and place every character individually. After all, there's no guarantee technically that narrow glyphs are always only 1 character long. There's also only a single caller of applyTestString, so I'm probably just going to inline the ReplaceCharacters calls to make this less confusing.

src/host/ut_host/ScreenBufferTests.cpp

## Summary of the Pull Request Ensures that reading the buffer content actually returns the content. ## References Regressed in #13626. ## PR Checklist * [x] Closes #14378 * [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [x] Tests added/passed ## Validation Steps Performed Added a test.

`TextBuffer` is buggy and allows a `Trailing` `DbcsAttribute` to be written into the first column. Since other code then blindly assumes that there's a preceding `Leading` character, we'll get called with a X coordinate of -1. This issue will be fixed by #13626 and this commit fixes it in the meantime. Additionally fixes an unimportant crash when the window height is 0px, because it was annoying during testing and doesn't hurt to be fixed. ## Validation Steps Performed * Run a stress test that prints random Unicode at random positions * Resize the window furiously at the same time * Doesn't crash / fail-fast ✅ (cherry picked from commit bb4711d) Service-Card-Id: 86353783 Service-Version: 1.16

When the buffer contains wide characters that occupy more than one cell, and those cells are scrolled horizontally by exactly one column, that operation can result in the wide characters being completely erased. This PR attempts to fix that bug, although it's not an ideal long term solution. Although not really to blame, it was PR #13626 that exposed this issue. The root of the problem is that scrolling operations copy cells one by one, but wide characters are written to the buffer two cells at a time. So when you move a wide character one position to the left or right, it can overwrite itself before it's finished copying, and the end result is the whole character gets erased. I've attempt to solve this by getting the affected operations to read two cells in advance before they start writing, so there's no risk of losing the source data before it's fully output. This may not work in the long term, with characters wider than two cells, but it should at least be good enough for now. I've also changed the `TextBuffer::Write` call to a `WriteLine` call to improve the handling of a wide character on the end of the line, where moving it right by one column would place it half off screen. It should just be dropped, but with the `Write` method, it would end up pushed onto the following line. ## Validation Steps Performed I've manually confirmed this fixes all the test cases described in #14626, and also added some unit tests that replicate those scenarios. Closes #14626

ghost · 2023-01-24T18:22:04Z

🎉Windows Terminal Preview v1.17.1023 has been released which incorporates this pull request.:tada:

Handy links:

#13626 contains a small "regression" compared to #13321: It now began to store trailers in the buffer wherever possible to allow a region of the buffer to be backed up and restored via Read/WriteConsoleOutput. But we're unfortunately still ill-equipped to handle anything but UCS-2 via WriteConsoleOutput, so it's best to again ignore trailers just like in #13321. ## Validation Steps Performed * Added unit test ✅

#13626 contains a small "regression" compared to #13321: It now began to store trailers in the buffer wherever possible to allow a region of the buffer to be backed up and restored via Read/WriteConsoleOutput. But we're unfortunately still ill-equipped to handle anything but UCS-2 via WriteConsoleOutput, so it's best to again ignore trailers just like in #13321. ## Validation Steps Performed * Added unit test ✅ (cherry picked from commit 9dcdcac) Service-Card-Id: 88719297 Service-Version: 1.17

The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 ## Validation Steps Performed * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅

The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅ (cherry picked from commit 373faf0) Service-Card-Id: 92121912 Service-Version: 1.20

The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅ (cherry picked from commit 373faf0) Service-Card-Id: 92129542 Service-Version: 1.19

This is simply some unused code from the days before the big text buffer rewrite in #13626.

STL iterators have a significant overhead. This improves performance of `GetLastNonSpaceColumn` by >100x (it's too large to measure), and reflow by ~15x in debug builds. This makes text reflow in debug builds today ~10x faster than it used to be in release builds before the large rewrites in #15701 and #13626.

lhecker force-pushed the dev/lhecker/8000-text-buffer-rework branch from eb4a04e to 2f01956 Compare July 29, 2022 15:43

DHowett reviewed Jul 29, 2022

View reviewed changes

lhecker commented Jul 29, 2022

View reviewed changes

lhecker commented Jul 31, 2022

View reviewed changes

Rewrite ROW to be Unicode capable

1761d60

lhecker force-pushed the dev/lhecker/8000-text-buffer-rework branch from 04dd61f to 1761d60 Compare July 31, 2022 00:30

lhecker added 2 commits July 31, 2022 16:59

Fix resize with wide characters

3fd2788

Add test for buffer backup/restore

ffcac71

This comment was marked as outdated.

Sign in to view

lhecker mentioned this pull request Jul 31, 2022

GdiEngine: Invalidate entire rows at a time #13638

Merged

Fix spell check

bd9f6ad

zadjii-msft added this to the Terminal v1.17 milestone Aug 1, 2022

PankajBhojwani reviewed Aug 8, 2022

View reviewed changes

src/buffer/out/Row.hpp Outdated Show resolved Hide resolved

src/buffer/out/Row.cpp Outdated Show resolved Hide resolved

lhecker added 4 commits September 13, 2022 13:10

Merge remote-tracking branch 'origin/main' into dev/lhecker/8000-text…

eddce54

…-buffer-rework

Finish merge, Address feedback

ceea6e3

Fix a regression

9200b21

Merge remote-tracking branch 'origin/main' into dev/lhecker/8000-text…

106b2ab

…-buffer-rework

j4james mentioned this pull request Sep 20, 2022

Add support for selective erase operations #14046

Merged

j4james mentioned this pull request Oct 1, 2022

Add support for the rectangular area operations #14112

Closed

carlos-zamora approved these changes Nov 11, 2022

View reviewed changes

DHowett approved these changes Nov 11, 2022

View reviewed changes

lhecker merged commit a01500f into main Nov 11, 2022

lhecker deleted the dev/lhecker/8000-text-buffer-rework branch November 11, 2022 19:35

PankajBhojwani reviewed Nov 11, 2022

View reviewed changes

ianjoneill mentioned this pull request Nov 12, 2022

Ensure reading the buffer content actually returns the content #14379

Merged

3 tasks

zadjii-msft mentioned this pull request Dec 12, 2022

[Epic] Text Buffer rewrite #8000

Closed

15 tasks

j4james mentioned this pull request Jan 8, 2023

Prevent horizontally scrolling wide chars erasing themselves #14650

Merged

This was referenced Jan 25, 2023

Squots (2x3 block drawing characters) are not sized properly #11694

Closed

the vast majority of our supra-BMP output on windows/msys is fubar dankamongmen/notcurses#2117

Closed

lhecker mentioned this pull request Feb 13, 2023

Ignore CHAR_INFO trailers during WriteConsoleOutput #14840

Merged

lhecker mentioned this pull request Mar 19, 2023

Add an efficient text stream write function #14821

Merged

lhecker mentioned this pull request Apr 29, 2023

Use virtual memory for the console buffer #10732

Closed

This was referenced Mar 18, 2024

Regression vs legacy conhost: different behavior from ReadConsoleOutputCharacterW when surrogate pair(s) are present. #16892

Closed

Fix a ReadConsoleOutputCharacter regression #16898

Merged

lhecker mentioned this pull request Mar 24, 2024

Implement grapheme clusters #16916

Merged

lhecker mentioned this pull request Jun 19, 2024

Remove some unused TextBuffer methods #17443

Merged

github-merge-queue bot pushed a commit that referenced this pull request Jun 20, 2024

Remove some unused TextBuffer methods (#17443)

e0686fa

This is simply some unused code from the days before the big text buffer rewrite in #13626.

lhecker mentioned this pull request Jul 11, 2024

15x faster reflow in debug builds #17550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite ROW to be Unicode capable #13626

Rewrite ROW to be Unicode capable #13626

lhecker commented Jul 29, 2022 •

edited

Loading

DHowett commented Jul 29, 2022

DHowett Jul 29, 2022

lhecker Jul 29, 2022

DHowett commented Jul 29, 2022

DHowett Jul 29, 2022

lhecker Jul 29, 2022 •

edited

Loading

carlos-zamora Nov 8, 2022

lhecker Nov 9, 2022 •

edited

Loading

DHowett commented Jul 29, 2022

lhecker commented Jul 29, 2022 •

edited

Loading

lhecker Jul 29, 2022

lhecker Jul 30, 2022

carlos-zamora Nov 10, 2022

DHowett Nov 10, 2022

lhecker Nov 11, 2022

This comment was marked as outdated.

j4james commented Aug 9, 2022

lhecker commented Aug 9, 2022

j4james commented Aug 9, 2022

lhecker commented Aug 9, 2022

AZero13 commented Oct 12, 2022

carlos-zamora left a comment

lhecker commented Nov 11, 2022

PankajBhojwani left a comment

PankajBhojwani Nov 11, 2022

lhecker Nov 12, 2022

ghost commented Jan 24, 2023

Rewrite ROW to be Unicode capable #13626

Rewrite ROW to be Unicode capable #13626

Conversation

lhecker commented Jul 29, 2022 • edited Loading

Validation Steps Performed

DHowett commented Jul 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DHowett commented Jul 29, 2022

Choose a reason for hiding this comment

lhecker Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhecker Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

DHowett commented Jul 29, 2022

lhecker commented Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

j4james commented Aug 9, 2022

lhecker commented Aug 9, 2022

j4james commented Aug 9, 2022

lhecker commented Aug 9, 2022

AZero13 commented Oct 12, 2022

carlos-zamora left a comment

Choose a reason for hiding this comment

lhecker commented Nov 11, 2022

PankajBhojwani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jan 24, 2023

lhecker commented Jul 29, 2022 •

edited

Loading

lhecker Jul 29, 2022 •

edited

Loading

lhecker Nov 9, 2022 •

edited

Loading

lhecker commented Jul 29, 2022 •

edited

Loading