-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape fewer Unicode codepoints in Debug
impl of str
#34485
Conversation
r? @brson (rust_highfive has picked a reviewer for you, use r? to override) |
Looks like run-pass/ifmt.rs is failing on travis. |
Is this changing |
It is changing that function. Why is it a breaking change? |
It's a stable function and this will break people's code which relies on the current behaviour. |
/cc @rust-lang/libs and @rust-lang/lang On Jun 26, 2016, 18:04 -0400, Oliver Middletonnotifications@github.com, wrote:
|
The documentation explicitly states that any character that is not in the printable ASCII range |
@ranma42 Agreed. I don't think we can change this. It's very clearly changing the contract of the function. |
@ranma42 OK, assume for now that we don't change that function, but only the |
Didn't these exact conversations happen before? Was a previous attempt abandoned? |
I don't know, I don't remember any. |
I can't find it. Might have been |
@tbu- , yes, I think that should work. We might also want to expose it as a function on |
Not sure what I think of this. The medium that |
Actually, the output of I would guess that the reason for this is that the people implementing it didn't need non-ASCII characters, and I mean if you don't need them they're just a nuisance. But if you're implementing a non-English program, then it basically makes the If you write to a device that doesn't support UTF-8, you should just escape these characters later, when writing to said device -- like the |
A possible objection is that
which seems to imply that it should not be exposed to the users, but rather to tools or developers. |
@tbu- not really, in that case he is the user of the |
@ranma42 It's a runtime error provided by the operating system, encountered while programming. EDIT: Also, you could probably look into the PEP, they also give a longer motivation in there. |
I'd rather they don't do these (mostly useless) format/escape for me (a programmer). They @tbu- Maybe we can't change |
@tbu- It is a runtime error provided by the operating system, encountered by |
If this was the reason, we should also implement |
That'd be a big breaking change. Not all |
@liigo Yes, that would be a major breaking change (it would change the constraints on the |
To me, the major advantage of the current implementation of Of course this does not mean that it should be used for everything. Specifically, I would only use #34318 shows an example where using Even though Rust does not (yet) have its own localised error messages, it would not be hard to imagine the same issue affecting other types of output, so it might be a good idea to think of a more general solution to ensure a way forward in this direction. |
@ranma42 If you want to see the exact code points, why only make an exception for English? That's very English-centric. :) EDIT: Imagine the We should probably provide a function that does the same as |
This is not advantage for non-ASCII text. It just makes unreadable noise ( |
@bors: r- |
f9bf85d
to
3d09b4a
Compare
@alexcrichton It should be fixed now. |
Escape fewer Unicode codepoints in `Debug` impl of `str` Use the same procedure as Python to determine whether a character is printable, described in [PEP 3138]. In particular, this means that the following character classes are escaped: - Cc (Other, Control) - Cf (Other, Format) - Cs (Other, Surrogate), even though they can't appear in Rust strings - Co (Other, Private Use) - Cn (Other, Not Assigned) - Zl (Separator, Line) - Zp (Separator, Paragraph) - Zs (Separator, Space), except for the ASCII space `' '` `0x20` This allows for user-friendly inspection of strings that are not English (e.g. compare `"\u{e9}\u{e8}\u{ea}"` to `"éèê"`). Fixes #34318. CC #34422. [PEP 3138]: https://www.python.org/dev/peps/pep-3138/
I’m very late to say this, but this adds 2102 bytes of static data to libcore, whereas previously all large Unicode tables were in the |
@SimonSapin I opened #39492 for this issue, and to propose a general policy. |
I’d like to know what the code do precisely (what are |
@ariasuni The commit message and PR message give a list of Unicode categories of characters that are escape, but yes if it’s not already this list should also be in some doc-comment in the code. I agree that I’d prefer to have these tables in |
@ariasuni The high-level view is that you have to store The low-level view of this particular implementation seems to have changed since I have implemented it, you can find some notes on the new one in 44bcd26. |
@SimonSapin We can split |
@ariasuni even if we did that, that doesn’t solve the situation that a |
Sorry, I probably overthought it. The idea is to put the code for Unicode categories elsewhere so that we can use it in |
I don’t understand how this would reduce code size, I may be missing something. |
If I understand correctly, |
Use the same procedure as Python to determine whether a character is
printable, described in PEP 3138. In particular, this means that the
following character classes are escaped:
' '
0x20
This allows for user-friendly inspection of strings that are not
English (e.g. compare
"\u{e9}\u{e8}\u{ea}"
to"éèê"
).Fixes #34318.
CC #34422.