Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for unicode and escape codes in literals #116907

Open
4 tasks
traviscross opened this issue Oct 18, 2023 · 16 comments
Open
4 tasks

Tracking Issue for unicode and escape codes in literals #116907

traviscross opened this issue Oct 18, 2023 · 16 comments
Assignees
Labels
B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC F-mixed_utf8_literals #![feature(mixed_utf8_literals)] I-lang-nominated Nominated for discussion during a lang team meeting. T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@traviscross
Copy link
Contributor

This is a tracking issue for the RFC 3349 (rust-lang/rfcs#3349).

The feature gate for the issue is #![feature(mixed_utf8_literals)].

From the RFC:

Relax the restrictions on which characters and escape codes are allowed in string, char, byte string, and byte literals.

Most importantly, this means we accept the exact same characters and escape codes in "…" and b"…" literals. That is:

  • Allow unicode characters, including \u{…} escape codes, in byte string literals. E.g. b"hello\xff我叫\u{1F980}"
  • Also allow non-ASCII \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

About tracking issues

Tracking issues are used to record the overall progress of implementation. They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions. A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature. Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

Steps

Unresolved Questions

  • Should concat!("\xf0\x9f", "\xa6\x80") work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.)
@traviscross traviscross added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Oct 18, 2023
@traviscross
Copy link
Contributor Author

@rustbot labels +T-lang

@rustbot rustbot added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 18, 2023
@traviscross
Copy link
Contributor Author

@rustbot labels +B-rfc-approved

@nnethercote
Copy link
Contributor

I would like to take this one.

@nnethercote
Copy link
Contributor

nnethercote commented Dec 13, 2023

I have a partial implementation of this RFC working locally (EDIT: now at #120286). The RFC proposes five changes to literal syntax. I think three of them are good, and two of them aren't necessary.

b"": add unicode chars

Adding them fixes the first of two cases where b"" syntax isn't a superset of "" syntax. This is good, and facilitates "conventionally UTF-8" string literals.

br"": add unicode chars

Adding them fixes the one case where rb"" syntax isn't a superset of r"" syntax. After this, rb"" syntax and r"" syntax are the same. This is good, and also facilitates "conventionally UTF-8" string literals.

b"": add \u{NN} escapes

Adding them fixes the second of two cases where b"" syntax isn't a superset of "" syntax, and fits well with adding unicode chars. This is good.

Note: After adding this, the one thing b"" syntax has that "" syntax does not is \x80-\xff bytes.

"": add \x80-\xff

Is this necessary? What useful new functionality does this provide?

It would make "" and b"" syntax identical, but strings and byte strings aren't identical types, so that identicalness isn't needed.

The RFC says "Allowing all characters and all known escape codes in both types of string literals reduces the complexity of the language. We'd no longer have different escape codes for different literal types. We'd only require regular string literals to be valid UTF-8." So it has just traded one exception for another. IMO that's not a simplification.

It's odd that it would be possible to write a "" that isn't valid UTF-8... both conceptually, and in the implementation. For the latter you can no longer start with an empty String and append chars one at a time knowing it'll be valid UTF-8 the whole way, which is how it's currently handled. Instead you need to start with a Vec<u8>, append chars as byte sequences, and then UTF-8 validate at the end. It's not that difficult, but it's not needed for any other literal kind, and weird enough that, combined with the other points above, makes me question it.

Not doing this keeps "" syntax consistent with '', which makes sense given that "" and '' are both unicode-oriented rather than byte-oriented. This is another refutation of the complexity argument above.

Not doing this was suggested in the "Alternatives" section of the RFC.

Not doing this also renders moot the unresolved question of what to do with concat!("\xf0\x9f", "\xa6\x80").

b'': add \u{00}-\u{7f}

Is this necessary? It doesn't provide any useful new functionality.

The \x syntax is strictly more powerful, covering the range \x00-0xff. And supporting just the ASCII subset of \u escapes doesn't match behaviour of any of the other literal syntaxes. Byte literals are about a single byte, why introduce Unicode-related stuff?

The quote from the RFC I mentioned above about complexity applies again, but again, it's just trading one exception for another.

cc @rust-lang/lang @m-ou-se

@nnethercote
Copy link
Contributor

nnethercote commented Dec 13, 2023

Here's an alternative version of the table that I've been using and found helpful. It shows all the escapes directly instead of grouping them by name, it shows the changes proposed by the RFC (affected literal kinds have two lines connected by a -->, where the second line shows what changed), and it includes C string literals. The proposed changes I don't like are marked with ?.

        chars    escapes                                        mixed utf8
        -----    -------                                        ----------
- ''    unicode  \' \" \n \r \t \\ \0 \x00-\x7f \u{..}          no
    
- b''   ascii    \' \" \n \r \t \\ \0 \x00-\xff                 no    
  -->                                           \u{0}..\u{7f}?  yes?
    
- ""    unicode  \' \" \n \r \t \\ \0 \x00-\x7f \u{..}          no
  -->                                 \x00-\xff?                yes?

- r""   unicode  N/A                                            no

- b""   ascii    \' \" \n \r \r \\ \0 \x00-0xff                 no
  -->   unicode                                 \u{..}          yes
    
- br""  ascii    N/A                                            no
  -->   unicode
  
- c""   unicode  \' \" \n \r \t \\ __ \x01-0xff \u{..}          yes

- cr""  unicode  N/A                                            no

This makes it easier to see things like adding \x80-\xff to "" syntax would make it identical to b"" syntax, but also make "" syntax different to '' syntax.

@nnethercote
Copy link
Contributor

nnethercote commented Dec 13, 2023

BTW, I have implemented the first three changes. They were pretty easy, and piggy-backed naturally off the existing support for mixed utf8 in C string literals, requiring only minor changes.

I haven't implemented the last two. They would both have required new kinds of checks, somewhat annoying to implement, which is what got me thinking about whether they are necessary.

@nnethercote
Copy link
Contributor

BTW, I have implemented the first three changes

A complete draft implementation is now at #120286.

bors added a commit to rust-lang-ci/rust that referenced this issue Jan 25, 2024
…, r=<try>

Implement RFC 3349, mixed utf8 literals

RFC: rust-lang/rfcs#3349
Tracking issue: rust-lang#116907

r? `@ghost`
@nnethercote nnethercote added the I-lang-nominated Nominated for discussion during a lang team meeting. label Jan 26, 2024
@nnethercote
Copy link
Contributor

Nominated for lang-team discussion for this comment above.

@joshtriplett
Copy link
Member

cc @m-ou-se, who may want to provide input/responses to the above.

@joshtriplett
Copy link
Member

@nnethercote FWIW, I do feel like having \u{00}-\u{7f} in b'...' is a clear win: if we allow it in b"...", we should also allow it in b'...' as well.

@nnethercote
Copy link
Contributor

@nnethercote FWIW, I do feel like having \u{00}-\u{7f} in b'...' is a clear win: if we allow it in b"...", we should also allow it in b'...' as well.

Is this a consistency argument? Consider the table. Currently some literals don't support \u escapes at all, while some support \u escapes fully. The proposal is to add a third category, \u{00}..\u{7f}, which would only apply to b''. I don't think that's a consistency improvement!

Or maybe it's a Postel's law style "we should accept anything that makes sense" argument? If so, I would immediately ask why? The \xx form is inherently superior for a literal that defines a single byte, because (a) it's shorter, (b) it covers the full range 0-255, (c) it's naturally byte-oriented and therefore a better conceptual fit than a unicode-oriented escape.

@nnethercote
Copy link
Contributor

Nominated for lang-team discussion for this comment above.

Still waiting for a lang-team response here, ten months later :(

@m-ou-se
Copy link
Member

m-ou-se commented Nov 19, 2024

"": add \x80-\xff

Is this necessary? What useful new functionality does this provide?

It seems nice to me that if you have a b"\xf0\x9f\xa6\x80" with valid UTF-8 and you refactor your program to use str instead of [u8], you can just strip the b and use "\xf0\x9f\xa6\x80".

b'': add \u{00}-\u{7f}

Same argument the other wary around: It seems nice that if you have a '\u{07}' you can change it for a b'\u{07}' later, without having to switch to a different way to escape the same character.

@m-ou-se
Copy link
Member

m-ou-se commented Nov 19, 2024

Is this a consistency argument? Consider the table. Currently some literals don't support \u escapes at all, while some support \u escapes fully. The proposal is to add a third category, \u{00}..\u{7f}, which would only apply to b''. I don't think that's a consistency improvement!

I think it is consistent. I don't see \u{00}..\u{7f} as a new category. I just see it as the same \u escape as in other literals. The fact that a byte literal can only contain a single byte is a separate requirement, just like we reject b'aa' or b'🦀'.

@nnethercote
Copy link
Contributor

It seems nice to me that if you have a b"\xf0\x9f\xa6\x80" with valid UTF-8 and you refactor your program to use str instead of [u8], you can just strip the b and use "\xf0\x9f\xa6\x80".

Same argument the other way around: It seems nice that if you have a '\u{07}' you can change it for a b'\u{07}' later, without having to switch to a different way to escape the same character.

These cases seem like they would be extremely rare, and in each case they would only avoid modifying a tiny number of characters.

In general I'm interested in new syntax that enables new programs to be written, far more than (a) new syntax that makes potential refactorings marginally easier, or (b) new syntax that (arguably) addresses obscure consistencies in the language that don't cause problems in practice.

@traviscross
Copy link
Contributor Author

We talked about this in our triage call today. We ended up wanting to read the thread here more carefully, though, so we'll be talking about this again.

Our general sentiment was that things like this are why we do partial stabilizations, and that probably the harder bits of this shouldn't block the more straightforward bits of this from moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC F-mixed_utf8_literals #![feature(mixed_utf8_literals)] I-lang-nominated Nominated for discussion during a lang team meeting. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants