-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite ROW to be Unicode capable #13626
Conversation
eb4a04e
to
2f01956
Compare
src/buffer/out/Row.cpp
Outdated
void ROW::_init() noexcept | ||
{ | ||
std::fill_n(_chars.begin(), _columnCount, UNICODE_SPACE); | ||
iota_n(_charOffsets.begin(), _columnCount + 1u, uint16_t{ 0 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we worried about the expense of having, say, 9001 copies of an array of all numbers between 0 and 120?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The space requirement and initialization cost is about the same as before when we used 16-bit sized DbcsAttribute
instances for each wchar_t
in a ROW
. As such I'm not concerned about any regressions.
I would've liked to do lazy initialization for ROW
s, but that would require us to remove the const
variant of GetRowByOffset
, which I wasn't willing to do in this already massive PR.
I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane |
@@ -343,68 +343,6 @@ class SelectionTests | |||
VERIFY_ARE_EQUAL(srOriginal.Left + sDeltaLeft, srSelection.Left); | |||
VERIFY_ARE_EQUAL(srOriginal.Right + sDeltaRight, srSelection.Right); | |||
} | |||
|
|||
TEST_METHOD(TestBisectSelection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. For some tests, it may be good practice to leave a comment on the PR explaining why it's not necessary any more. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Bisect" in this case stands for "trailing wide character in the first / leading wide character in the last cell of a row", which we don't support anymore, even with CHAR_INFO
s. So this test was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which we don't support anymore
As in, this PR breaks that functionality? Or we officially stopped supporting it a while back?
Why is it ok that we're dropping support for that? I seem to be missing some context here haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, this PR breaks that functionality? Or we officially stopped supporting it a while back?
Why is it ok that we're dropping support for that? I seem to be missing some context here haha.
Both. Kinda. This PR breaks it harder than we already broke it in the past.
I'm not sure when, but at some point, we stopped supporting storage of "half a wide glyph" in the first/last column of the console ("bisecting rows"). Instead it was supposed to either store a whole wide glyph or none at all. With any of the public console APIs it shouldn't have been possible to bring conhost into a state where it stores such bisecting glyphs in a long while.
But we still kind of supported it internally, for instance when you rapidly resized the console you can bring it into a state, where the reflow algorithm puts glyphs into the wrong columns. This PR puts the last nail into the coffin and ensures that no one can mess with our TextBuffer anymore, not even ourselves. And so this test can't possibly work anymore, just like you already couldn't do that via any of the public APIs. To be entirely honest, I'm not even 100% sure why TestBisectSelection
still existed in the first place, and our git history didn't bring up any clues for that either.
BTW a bug caused by bisecting glyphs: #14275
I'm sure there's tons more of that lingering around in our code base.
I haven't found it yet, but how do you handle the case where a double-cell character is inserted on a line where only a single-cell character could fit on the right side? We used to keep track of whether we had "forced" a wide glyph on to the next line, and we used that to reconsitute something for the API i believe... |
That's still all there in |
void TextBufferTests::TestBufferRowByOffset() | ||
{ | ||
auto& textBuffer = GetTbi(); | ||
auto csBufferHeight = GetBufferHeight(); | ||
|
||
VERIFY_IS_TRUE(csBufferHeight > 20); | ||
|
||
auto sId = csBufferHeight / 2 - 5; | ||
|
||
const auto& row = textBuffer.GetRowByOffset(sId); | ||
VERIFY_ARE_EQUAL(row.GetId(), sId); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't helpful anymore, since ROW
s don't have IDs anymore.
void swap(ROW& lhs, ROW& rhs) noexcept | ||
{ | ||
std::swap(lhs._charsBuffer, rhs._charsBuffer); | ||
std::swap(lhs._charsHeap, rhs._charsHeap); | ||
std::swap(lhs._chars, rhs._chars); | ||
std::swap(lhs._charOffsets, rhs._charOffsets); | ||
std::swap(lhs._attr, rhs._attr); | ||
std::swap(lhs._columnCount, rhs._columnCount); | ||
std::swap(lhs._lineRendition, rhs._lineRendition); | ||
std::swap(lhs._wrapForced, rhs._wrapForced); | ||
std::swap(lhs._doubleBytePadded, rhs._doubleBytePadded); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether I should keep thisstd::swap
specialization... It helps the std::rotate
in TextBuffer
, which relies on std::swap
internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I vote keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any way to guarantee that it swaps all contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No unfortunately not. The alternative is to implement a operator=(ROW&&)
which of course would suffer from the same unsafety issue...
04dd61f
to
1761d60
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This fixes an issue were overwriting parts of a row would only trigger that specific portion of the row to be redrawn. This isn't just problematic for combining characters, but also for things like the new `TestDbcsBisectWriteCells` test introduced in #13626. Benchmarks showed no impact on performance whatsoever. ## Validation Steps Performed * Pick this commit into #13626 * Run the `TestDbcsBisectWriteCells` test and break before OpenConsole exits * A correct "QいかなZYXWVUTに" output is visible ✅
I think it's worth mentioning that I was using the row IDs to track image segments in my sixel implementation. Not suggesting that's a good reason to keep them, since I'm not likely to do anything with that branch any time soon, but I do think there is value in having some mechanism via which additional metadata can be attached a row. Another potential use case would be the scrollbar marks. I think we're currently tracking them with just a row number, but long term I'm hoping we'll change that, because I suspect the current approach will break if we ever get margins working in the terminal. That's perhaps a bit of an edge case, so not critical, but still. |
Why not store or reference that data in |
In the case of sixel, 99% of the time you're not likely to have any data attached to the row - most users will probably never use sixel - so I didn't want to introduce an additional field that would just be wasting space most of the time. And since the ID already existed, that provided an easy way to add sixel functionality without impacting anyone that wasn't using it. Having some kind of pointer on the row itself would certainly be easier, but it does mean more overhead. And what happens when we have some other bit of metadata that's also rarely used, and also needs to be associated with a row? Do we add yet another field? Maybe some kind of linked list of extension points? I don't know. I just wanted to point out that this use case exists, and row IDs were quite a compact way of addressing that need.
Not really. Sixel isn't so much an image format - it's more of a drawing protocol. So you need to think of it in terms of plotting arbitrary pixels at a point on the screen. The way I managed that was by creating a bitmap associated with each row that was touched by sixel output. Those bitmaps obviously need to be mutable so they can be updated when additional pixels are output. But any kind of reference in the row would be fine for this - I was just looking for a way to minimize the overhead involved, and potentially share it with other row extensions. Anyway, this isn't anything that has to be resolved now. I just wanted to the raise the issue as something to consider for the future. |
Thanks for the feedback! I don't disagree with you and I'd be absolutely open to add back IDs even if for no reason at all. 🙂 But unrelated to this PR and any designs you have in mind (which I'd approve regardless), I'd like to mention an opposing view to the overhead of unused fields, just because... |
Any updates on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a couple of questions!
pRow->ReplaceCharacters(x, 2, t.string); | ||
x += 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need to loop through the characters in the string (like the way we need to below when the string isn't wide)? Is it guaranteed that if wide == true
then the string is just 1 glyph that's 2-cells wide? I know that the strings we pass in conform to this, but in the context of within this function is that always true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a different response, but you know what... I just realized, I should just remove the for
loop instead and place every character individually. After all, there's no guarantee technically that narrow glyphs are always only 1 character long. There's also only a single caller of applyTestString
, so I'm probably just going to inline the ReplaceCharacters
calls to make this less confusing.
## Summary of the Pull Request Ensures that reading the buffer content actually returns the content. ## References Regressed in #13626. ## PR Checklist * [x] Closes #14378 * [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [x] Tests added/passed ## Validation Steps Performed Added a test.
`TextBuffer` is buggy and allows a `Trailing` `DbcsAttribute` to be written into the first column. Since other code then blindly assumes that there's a preceding `Leading` character, we'll get called with a X coordinate of -1. This issue will be fixed by #13626 and this commit fixes it in the meantime. Additionally fixes an unimportant crash when the window height is 0px, because it was annoying during testing and doesn't hurt to be fixed. ## Validation Steps Performed * Run a stress test that prints random Unicode at random positions * Resize the window furiously at the same time * Doesn't crash / fail-fast ✅ (cherry picked from commit bb4711d) Service-Card-Id: 86353783 Service-Version: 1.16
When the buffer contains wide characters that occupy more than one cell, and those cells are scrolled horizontally by exactly one column, that operation can result in the wide characters being completely erased. This PR attempts to fix that bug, although it's not an ideal long term solution. Although not really to blame, it was PR #13626 that exposed this issue. The root of the problem is that scrolling operations copy cells one by one, but wide characters are written to the buffer two cells at a time. So when you move a wide character one position to the left or right, it can overwrite itself before it's finished copying, and the end result is the whole character gets erased. I've attempt to solve this by getting the affected operations to read two cells in advance before they start writing, so there's no risk of losing the source data before it's fully output. This may not work in the long term, with characters wider than two cells, but it should at least be good enough for now. I've also changed the `TextBuffer::Write` call to a `WriteLine` call to improve the handling of a wide character on the end of the line, where moving it right by one column would place it half off screen. It should just be dropped, but with the `Write` method, it would end up pushed onto the following line. ## Validation Steps Performed I've manually confirmed this fixes all the test cases described in #14626, and also added some unit tests that replicate those scenarios. Closes #14626
🎉 Handy links: |
#13626 contains a small "regression" compared to #13321: It now began to store trailers in the buffer wherever possible to allow a region of the buffer to be backed up and restored via Read/WriteConsoleOutput. But we're unfortunately still ill-equipped to handle anything but UCS-2 via WriteConsoleOutput, so it's best to again ignore trailers just like in #13321. ## Validation Steps Performed * Added unit test ✅
#13626 contains a small "regression" compared to #13321: It now began to store trailers in the buffer wherever possible to allow a region of the buffer to be backed up and restored via Read/WriteConsoleOutput. But we're unfortunately still ill-equipped to handle anything but UCS-2 via WriteConsoleOutput, so it's best to again ignore trailers just like in #13321. ## Validation Steps Performed * Added unit test ✅ (cherry picked from commit 9dcdcac) Service-Card-Id: 88719297 Service-Version: 1.17
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 ## Validation Steps Performed * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅ (cherry picked from commit 373faf0) Service-Card-Id: 92121912 Service-Version: 1.20
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅ (cherry picked from commit 373faf0) Service-Card-Id: 92129542 Service-Version: 1.19
This is simply some unused code from the days before the big text buffer rewrite in #13626.
STL iterators have a significant overhead. This improves performance of `GetLastNonSpaceColumn` by >100x (it's too large to measure), and reflow by ~15x in debug builds. This makes text reflow in debug builds today ~10x faster than it used to be in release builds before the large rewrites in #15701 and #13626.
This commit is a from-scratch rewrite of
ROW
with the primary goal to getrid of the rather bodgy
UnicodeStorage
class and improve Unicode support.Previously a 120x9001 terminal buffer would store a vector of 9001
ROW
swhere each
ROW
stored exactly 120wchar_t
. Glyphs exceeding theirallocated space would be stored in the
UnicodeStorage
which was basicallya
hashmap<Coordinate, String>
. Iterating over the text in aROW
wouldrequire us to check each glyph and fetch it from the map conditionally.
On newlines we'd have to invalidate all map entries that are now gone,
so for every invalidated
ROW
we'd iterate through all glyphs again and ifa single one was stored in
UnicodeStorage
, we'd then iterate through theentire hashmap to remove all coordinates that were residing on that
ROW
.All in all, this wasn't the most robust nor performant code.
The new implementation is simple (from a design perspective):
Store all text in a
ROW
in a regular string. Grow the string if needed.The association between columns and text works by storing character offsets
in a column-wide array. This algorithm is <100 LOC and removes ~1000.
As an aside this PR does a few more things that go hand in hand:
ROW
helper classes, which aren't needed anymore.VirtualAlloc
call.IsCursorDoubleWidth
to useDbcsAttrAt
directly.Improves overall performance by 10-20% and makes this implementation
faster than the previous NxM storage, despite the added complexity.
Part of #8000
Validation Steps Performed