Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support UTF-32 position encoding #14141

Merged
merged 2 commits into from
Feb 14, 2023
Merged

feat: Support UTF-32 position encoding #14141

merged 2 commits into from
Feb 14, 2023

Conversation

matklad
Copy link
Member

@matklad matklad commented Feb 14, 2023

Looks like this is a native encoding for Emacs at least!

Looks like this is a native encoding for Emacs at least!
@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 14, 2023
@matklad
Copy link
Member Author

matklad commented Feb 14, 2023

Together with emacs-lsp/lsp-mode#3958, this fixes the infamous Emacs vs 💩 problem.

@@ -192,9 +193,9 @@ pub(crate) fn apply_document_changes(
let mut line_index = LineIndex {
// the index will be overwritten in the bottom loop's first iteration
index: Arc::new(ide::LineIndex::new(&text)),
// We don't care about line endings or offset encoding here.
// We don't care about line endings here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @lnicola , the comment about irrelevance of encoding here turned out to be wrong at least for tests. So I just blindly threaded the correct encoding here. I wonder if that might have fixed some latent bugs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from c8b9ec8. But I don't remember the code well enough to say.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I took a second look here, and it does look like an outright bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear, yeah, this one was busted pretty substantially. If we didn't get reports about that, that means that no-one actually uses utf-8 😢

It only worked with VS Code because Code uses utf16.

Co-authored-by: Stig Brautaset <stig@brautaset.org>
@matklad
Copy link
Member Author

matklad commented Feb 14, 2023

I considered also clientInfo to get "user agent string" and set encoding as utf32 for lsp-mode, but:

  • lsp-mode reports itself as "Emacs", not as "lsp-mode"
  • eglot does not set this string at all

So it seems safer to not do this.

@lnicola
Copy link
Member

lnicola commented Feb 14, 2023

In emacs-lsp/lsp-mode#3344 (comment) @yyoncho seems to be open to applying one patch (I'm not sure which), so maybe this can be fixed "properly" soon. Unfortunately, it seems that nobody bothered to even test the two previous patches (yours likely didn't work).

I still think the proper fix is to send the correct UTF-16 positions, because in the meanwhile everybody will move on to pester litigate this against the developers of the other language servers.

assert_eq!(col_index.utf16_to_utf8_col(1, 18), TextSize::from(20)); // space
assert_eq!(col_index.utf16_to_utf8_col(1, 19), TextSize::from(21)); // second メ at 21..24
let text: String = {
let mut chars: Vec<char> = ((0 as char)..char::MAX).collect(); // Neat!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find this impl, does it skip the surrogates? I guess it's not a problem today, but might be worth keeping it in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl is here:

https://doc.rust-lang.org/stable/src/core/iter/range.rs.html#409

My understanding is that this includes chars which are encoded by surrogate pairs, but not individual surrogates (as those would produce invalid utf8)

@lnicola
Copy link
Member

lnicola commented Feb 14, 2023

LGTM, I'd be in favor of merging this rather sooner than later, so it has a bit of time to bake until next Monday.

@matklad
Copy link
Member Author

matklad commented Feb 14, 2023

For me, the fact that Emacs native coordinates are in unicode code points feels like a very strong argument to support this on the server.

Eglot today handles this correctly by re-encoding whole lines:

https://github.com/joaotavora/eglot/blob/e501275e06952889056268dabe08ccd0dbaf23e5/eglot.el#L1409-L1413

which feels pretty wasteful to me.

(at the same time, Emacs should have used Rust-style utf8 indexes for internal coordinates, as that's a superior scheme oveal)

@matklad
Copy link
Member Author

matklad commented Feb 14, 2023

bors r+

@lnicola
Copy link
Member

lnicola commented Feb 14, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Feb 14, 2023

📌 Commit 9fdcf57 has been approved by lnicola

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Feb 14, 2023

⌛ Testing commit 9fdcf57 with merge 31486a6...

@bors
Copy link
Contributor

bors commented Feb 14, 2023

☀️ Test successful - checks-actions
Approved by: lnicola
Pushing 31486a6 to master...

@bors bors merged commit 31486a6 into rust-lang:master Feb 14, 2023
@matklad matklad deleted the utf-32 branch February 14, 2023 11:09
@matklad
Copy link
Member Author

matklad commented Feb 14, 2023

(I assume that utf-32 is native to emacs based on joaotavora/eglot#916 (comment))

@lnicola lnicola changed the title Support UTF-32 position encoding feat: Support UTF-32 position encoding Feb 14, 2023
//!
//! Currently, we use oorandom instead, but it misses these two utilities.
//! Perhaps we should switch to `fastrand`, or our own small prng, it's not like
//! we need anything move complicatied that xor-shift.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is already merged but it looks like there are at least 3 typos in this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two typos, would you mind pointing out the third one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "move complicatied that" should be "more complicated than".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the second one, what's the third? https://github.com/rust-lang/rust-analyzer/pull/14154/files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. move -> more
  2. complicatied -> complicated
  3. that -> than

🤔 maybe I didn't express myself clearly... your PR seems to fix everything in one go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I missed than, but matklad kindly pointed it out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great, if you don't mind there's a staarts typo in https://blog.dend.ro/introducing-sd-notify/ (very nice crate btw!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants