Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape combining characters in char::Debug #49283

Merged
merged 13 commits into from
May 22, 2018

Conversation

varkor
Copy link
Member

@varkor varkor commented Mar 22, 2018

Although combining characters are technically printable, they make little sense to print on their own with Debug: it'd be better to escape them like non-printable characters.

This is a breaking change, but I imagine the fact escape_debug is rare and almost certainly primarily used for debugging that this is an acceptable change.
Resolves #41922.

r? @alexcrichton
cc @clarcharr

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 22, 2018
@clarfonthey
Copy link
Contributor

LGTM although the character ranges should probably be generated with a script instead of hard-coded, as they can change across unicode versions.

@varkor
Copy link
Member Author

varkor commented Mar 23, 2018

@clarcharr: have you got any ideas for how to detect ranges programmatically? Apart from parsing the Unicode symbol names (which does not seem robust), I couldn't find any reasonable suggestion for how to do this automatically.

@kennytm kennytm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 23, 2018
@alexcrichton
Copy link
Member

r? @SimonSapin

@SimonSapin
Copy link
Contributor

This PR is modifying the char::escape_debug. When calling this method directly or through impl Debug for char, there is only one code point being printed so the idea of escaping code points that do not make much sense on their own seems fine.

However this method is also used in impl Debug for str and in str::escape_debug. #41922 itself does not expect the behavior to change there and I agree. However that means making them based on something other than char::escape_debug. Should that other thing be exposed in new public API?


Regarding this being a breaking change, I think I’d be ok with documenting that the exact set of code points being escaped or not is not stable outside of the ASCII range.

The current doc-comment is already out of sync with the implementation: it claims that anything non-ASCII is escaped, which hasn’t been the case since #34485.


I also agree with not hard-coding code point ranges, and instead extracting more Unicode data in src/libstd_unicode/unicode.py. However it’s unclear to me what exactly should be included. Canonical_Combining_Class is about reordering during normalization and probably not what we want. The Diacritic property is defined as:

Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.

Then there’s the *Mark properties:

Abbr Long Description
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
Mc Spacing_Mark a spacing combining mark (positive advance width)
Me Enclosing_Mark an enclosing combining mark
M Mark Mn | Mc | Me

However #41922 suggests that only nonspacing marks are problematic?

@clarfonthey
Copy link
Contributor

I'd be fine with just going for nonspacing marks.

As for str::escape_debug goes, I'd personally opt for escaping any combination of nonspacing marks at the beginning of the string, although it's unclear if that would actually solve the problem. I'm fine with just leaving it with the current behaviour for now.

@varkor
Copy link
Member Author

varkor commented Mar 26, 2018

However that means making them based on something other than char::escape_debug. Should that other thing be exposed in new public API?

Yeah, I'm not sure about this. It'd be nicer if there was an extra argument (single: bool), so the behaviour could be different for individual characters, but we obviously can't change the signature. A specific public extra method that just special-cases non-spacing combining characters seems a little specialised, though.

The current doc-comment is already out of sync with the implementation: it claims that anything non-ASCII is escaped, which hasn’t been the case since #34485.

Which doc comment are you referring to here?

I also agree with not hard-coding code point ranges, and instead extracting more Unicode data in src/libstd_unicode/unicode.py

I have a change that uses Mn instead of hard-coding ranges, pending the decision about the method / effect on str impls.

@varkor
Copy link
Member Author

varkor commented Apr 4, 2018

@SimonSapin: as far as I can tell, there's an issue using src/libstd_unicode/unicode.py here, as it's not available for core, which is where the formatting trait is defined. Is it worth moving it, or taking a different approach?

@clarfonthey
Copy link
Contributor

Wasn't the goal of having libstd_unicode separate from libcore to avoid having to include Unicode tables in libcore? Which makes it suddenly interesting considering how the changes to escape_debug make that less necessary. Or, I'm probably missing something.

Personally I really dislike the libstd_unicode façade, as it tends to get way less attention than libcore, liballoc, and libstd, which leads to bugs like #44303. I'm still confused why it exists and I don't think I've actually seen it explained somewhere.

@SimonSapin
Copy link
Contributor

#49698 merges std_unicode into core.

It might be interesting to replace printable.py (formerly char_private.py) and unicode.py with something based on https://github.com/BurntSushi/ucd-generate.

@bors
Copy link
Contributor

bors commented Apr 12, 2018

☔ The latest upstream changes (presumably #49698) made this pull request unmergeable. Please resolve the merge conflicts.

@varkor varkor force-pushed the combining-chars-escape_debug branch from fb39348 to a1c6deb Compare April 12, 2018 11:25
@varkor
Copy link
Member Author

varkor commented Apr 12, 2018

@SimonSapin: #49698 makes everything much more pleasant :)

I've rebased on top of it (and also cleaned up unicode.py whilst I was at it, because it had some slightly annoying behaviour before).

@SimonSapin
Copy link
Contributor

@varkor Thanks for your work on this. However I’m still not sure what the right thing to do here is, in terms of what characters exactly to escape or not in which context. Do you know what other languages do for debug-printing Unicode strings and characters?

@varkor
Copy link
Member Author

varkor commented Apr 12, 2018

@SimonSapin: Swift does something similar to the proposed change here. I haven't tried working out exactly which characters they choose to escape, but it seems reasonable to assume they choose either the same, or a similar category.

@varkor
Copy link
Member Author

varkor commented Apr 12, 2018

Ah, so Swift actually escapes any non-ASCII character, which is what Rust used to do before #24588:
https://github.com/apple/swift/blob/c35d508600516b892732e2fd3f0f0a17ca4562ba/stdlib/public/core/UnicodeScalar.swift#L197-L252
So this is new territory, and it makes sense to go with what we think is most sensible: in this case, I think the suggestion of Mn is a reasonable choice.

@varkor varkor force-pushed the combining-chars-escape_debug branch 2 times, most recently from 672b39f to 59513ad Compare April 12, 2018 23:05
@rust-lang rust-lang deleted a comment from rust-highfive Apr 12, 2018
@varkor varkor force-pushed the combining-chars-escape_debug branch from 59513ad to c85cc88 Compare April 13, 2018 17:37
@rust-lang rust-lang deleted a comment from rust-highfive Apr 13, 2018
@varkor varkor force-pushed the combining-chars-escape_debug branch from c85cc88 to e96a115 Compare April 13, 2018 20:26
@rust-lang rust-lang deleted a comment from rust-highfive Apr 13, 2018
@TimNN
Copy link
Contributor

TimNN commented Apr 17, 2018

@SimonSapin: Friendly triage ping :)

@rust-lang rust-lang deleted a comment from rust-highfive Apr 17, 2018
@SimonSapin SimonSapin added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 21, 2018
@varkor varkor force-pushed the combining-chars-escape_debug branch from 3591ecd to 2fa22ef Compare May 21, 2018 18:00
@SimonSapin
Copy link
Contributor

@bors r+

@bors
Copy link
Contributor

bors commented May 21, 2018

📌 Commit b653937 has been approved by SimonSapin

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 21, 2018
@bors
Copy link
Contributor

bors commented May 21, 2018

⌛ Testing commit b653937 with merge 65a16c0...

bors added a commit that referenced this pull request May 21, 2018
Escape combining characters in char::Debug

Although combining characters are technically printable, they make little sense to print on their own with `Debug`: it'd be better to escape them like non-printable characters.

This is a breaking change, but I imagine the fact `escape_debug` is rare and almost certainly primarily used for debugging that this is an acceptable change.
Resolves #41922.

r? @alexcrichton
cc @clarcharr
@bors
Copy link
Contributor

bors commented May 22, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: SimonSapin
Pushing 65a16c0 to master...

@bors bors merged commit b653937 into rust-lang:master May 22, 2018
@varkor varkor deleted the combining-chars-escape_debug branch May 22, 2018 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.