This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Initial commit for System.Text.Rune #20935

Merged

GrabYourPitchforks merged 5 commits into dotnet:master from GrabYourPitchforks:rune

Nov 14, 2018

Member

GrabYourPitchforks commented Nov 10, 2018

The Rune type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive); see https://unicode.org/glossary/#unicode_scalar_value for more information.

This PR introduces the basic elemental type for this. There are APIs on this to mirror some of (but not all of) the APIs on System.Char and System.Text.CharUnicodeInfo. For example, APIs dealing with surrogate values or IConvertible do not exist. APIs are added to read a Rune from a string or to write a Rune to a UTF-16 output buffer.

The API surface in this PR is not complete, and additional APIs will come in future PRs. Examples of future APIs are more powerful inspection of string and ReadOnlySpan<char> data, including the ability to enumerate Rune elements in a UTF-16 buffer and the ability to read from the end of a buffer rather than solely from the front of a buffer. (This will eventually become important for UTF-8 string trimming.) The Rune API surface will also eventually be enlightened with UTF-8 support as that lights up in the framework over the next several months.

See also the UnicodeScalar proposal at https://github.com/dotnet/corefx/issues/30503 (and the corresponding API review at dotnet/apireviews#76) and the original Rune issue at https://github.com/dotnet/corefx/issues/24093 for further context.

There is some overlap in the logic for this type, System.Char and System.Text.CharUnicodeInfo. Some of this logic cannot be reconciled due to behavioral differences (see https://github.com/dotnet/coreclr/issues/19706 for an example). However, much of the logic (including the bit-twiddling tricks done as performance optimizations) I'm hoping eventually to consolidate to the internal UnicodeUtility helper class which is part of this PR and eventually have Rune, System.Char, and System.Text.CharUnicodeInfo all share the One True Implementation(tm) where appropriate. I didn't do that as part of this PR because I want to minimize risk to existing code for now and I didn't want this PR to blow up in size spread across many different files.

GrabYourPitchforks requested review from ahsonkhan, bartonjs and tarekgh

November 10, 2018 23:59

GrabYourPitchforks mentioned this pull request

System.Text.Rune ref APIs and unit tests dotnet/corefx#33395

Merged

Member Author

GrabYourPitchforks commented Nov 11, 2018

Corresponding corefx PR with reference APIs and unit tests is at dotnet/corefx#33395.

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs

+                      }
+                      // non-validating ctor
+                      private Rune(uint scalarValue, bool unused)

Member

tarekgh Nov 11, 2018

bool unused [](start = 39, length = 11)

why we have unused? is this only to distinguish it with the other constructor?

Member Author

GrabYourPitchforks Nov 11, 2018

Yeah. I guess we could make that parameter bool skipValidation instead, but would need to check the JIT output to make sure the branches are being properly elided by code gen.

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Show resolved Hide resolved

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs

+                          return charsWritten;
+                      }
+                      public override bool Equals(object obj) => (obj is Rune other) && this.Equals(other);

Member

tarekgh Nov 11, 2018

(obj is Rune other) [](start = 51, length = 19)

I learned from Stephen (obj is Rune other) produce more complicated IL than if you manually doing the cast. we don't know the perf impact though

Member Author

GrabYourPitchforks Nov 12, 2018

It will definitely produce less optimized IL if the type T on the right side of the is keyword is a reference type. I'm not sure what the behavior is since this is a value type. To be fair I didn't look at this too much since Equals(object) isn't really a highly optimized code path and I was going for readability more than anything else.

jkotas reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

jkotas reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

itsamelambda commented Nov 11, 2018

Oh no, "Rune" is going in, worst API name ever

ahsonkhan approved these changes

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

src/System.Private.CoreLib/shared/System/Text/Rune.cs

+                      public static Rune ToUpperInvariant(Rune value)
+                      {
+                          // Handle the most common case (ASCII data) first. Within the common case, we expect
+                          // that there'll be a mix of lowercase & uppercase chars, so make the conversion branchless.

ahsonkhan Nov 12, 2018

How is it branchless?
value.ValueUnsigned ^ ((isLowerAlpha) ? 0x20u : 0)

Member Author

GrabYourPitchforks Nov 12, 2018

I'm assuming the JIT will eventually gets its (bool) ? <power of two> : 0 optimization, and then we'll pick it up for free. It didn't seem worthwhile for me to use bitwise cleverness here when this is really the JIT's job. See also #16156 and https://github.com/dotnet/coreclr/issues/7447.

Member Author

GrabYourPitchforks Nov 12, 2018

I benchmarked this, and even with optimized codegen the bit twiddling tricks are faster. C'est la vie.

stephentoub reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs

+                  /// assuming that the underlying <see cref="Rune"/> instance is well-formed.
+                  /// </remarks>
+                  [DebuggerDisplay("{DebuggerDisplay,nq}")]
+                  public readonly struct Rune : IComparable<Rune>, IEquatable<Rune>

Member

stephentoub Nov 12, 2018 •

edited

Loading

I still don't love the name and would have preferred something like UnicodeScalar. The fact that the XML comments states "Represents a Unicode scalar" just reinforces for me that such a descriptive name would be better for this type, over "Rune" which as far as I can tell is just gibberish, and for which as far as I can tell the best argument for it is that both Go and Swift use it, albeit apparently each with a slightly different actual meaning. That said, I understand there are two entrenched naming camps here; seems like there are a few very passionate folks in favor of Rune and lots of less passionate folks against but not so much that we're willing to fall on a sword for it. And I understand this is what was agreed to in API review. Just noting my concerns :)

Member

bartonjs Nov 12, 2018

I'm definitely moving more toward "passionate" that it should (have) be(en) UnicodeScalar. While it's wordier, it's more descriptive, and the things like "GetRunes/EnumerateRunes" just feel... wrong.

and for which as far as I can tell the best argument for it is that both Go and Swift use it

Swift actually uses unicode scalar. (https://developer.apple.com/documentation/swift/unicode/scalar)

seems like there are a few very passionate folks in favor of Rune and lots of less passionate folks against but not so much that we're willing to fall on a sword for it. And I understand this is what was agreed to in API review.

I think that @terrajobst preferred Rune over UnicodeScalar due to the shorter name. And I think I led the majority with "I don't like Rune, but I won't go to the mat over this" and we essentially decided/agreed to not care.

So I'll move to a stronger "I think Rune is a bad name"; but it's possibly too late to matter, and probably still being outweighed by the ones who prefer Rune.

The closest to an FDG rule I can find is "DO favor readability over brevity". I hereby assert that UnicodeScalar is more readable (in that it's more descriptive).

Member

tarekgh Nov 12, 2018

I want to add, Rune is not even showing up in Unicode glossary https://www.unicode.org/glossary/ Also, looking at the dictionary, it has the definitions:

noun
any of the characters of certain ancient alphabets, as of a script used for writing the Germanic languages, especially of Scandinavia and Britain, from c200 to c1200, or a script used for inscriptions in a Turkic language of the 6th to 8th centuries from the area near the Orkhon River in Mongolia.
something written or inscribed in such characters.
an aphorism, poem, or saying with mystical meaning or for use in casting a spell.

It will be hard to map Unicode Scalar definition to Rune definition.

I vote for UnicodeScalar

stephentoub reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs

+                      /// <summary>
+                      /// Creates a <see cref="Rune"/> without performing validation on the input.
+                      /// </summary>
+                      [MethodImpl(MethodImplOptions.AggressiveInlining)]

Member

stephentoub Nov 12, 2018

Is this needed? I'm surprised this isn't otherwise inlined.

Member Author

GrabYourPitchforks Nov 12, 2018

TBH I didn't measure that specifically. It was just a force of habit that I wrote it. :/

stephentoub approved these changes

View reviewed changes

bartonjs reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs

+                      // - bottom 5 bits are the UnicodeCategory of the character
+                      private static ReadOnlySpan<byte> AsciiCharInfo => new byte[]
+                      {
+x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x8E, 0x8E, 0x8E, 0x8E, 0x8E, 0x0E, 0x0E,

Member

bartonjs Nov 12, 2018

While ASCII's information probably won't change, it would be nice to see this derived from the same source as the rest of the Unicode categorization, instead of being an independent blob.

bartonjs reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

bartonjs reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Show resolved Hide resolved

bartonjs reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Show resolved Hide resolved

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/UnicodeUtility.cs Show resolved Hide resolved

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

tarekgh reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/UnicodeUtility.cs Show resolved Hide resolved

jkotas reviewed

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Rune.cs Outdated Show resolved Hide resolved

Member Author

GrabYourPitchforks commented Nov 13, 2018

The latest commit in the PR is just in case we change the name Rune back to UnicodeScalar. I wanted to have the commit ready to go pending any further API review.

GrabYourPitchforks added 5 commits

November 13, 2018 15:23


          Initial commit for System.Text.Rune

e671dfe


          PR feedback


          Move GetRuneAt / TryGetRuneAt to System.String

dabb9f1


          Move GetRuneAt / TryGetRuneAt back to Rune

d9102e9

Other PR feedback


          Doc comment fixup

GrabYourPitchforks force-pushed the rune branch from ffd5060 to 9714422 Compare

November 13, 2018 23:29

GrabYourPitchforks merged commit 7fcd8a8 into dotnet:master

dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corefx that referenced this pull request


          Initial commit for System.Text.Rune (dotnet/coreclr#20935)

This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corert that referenced this pull request


          Initial commit for System.Text.Rune (dotnet/coreclr#20935)

a46ffd4

This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

GrabYourPitchforks deleted the rune branch

November 14, 2018 01:37

jkotas pushed a commit to dotnet/corefx that referenced this pull request


          Initial commit for System.Text.Rune (dotnet/coreclr#20935)

77ef1ff

This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

jkotas pushed a commit to dotnet/corert that referenced this pull request


          Initial commit for System.Text.Rune (dotnet/coreclr#20935)

ea4567a

This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

jlennox pushed a commit to jlennox/corefx that referenced this pull request


          Initial commit for System.Text.Rune (dotnet/coreclr#20935)

8f9c10a

This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

This was referenced Jan 31, 2020

[Tracking] UnicodeScalar: Does any platform's case conversion routine allow for full folding? dotnet/runtime#11454

Closed

Flow System.Text.Rune through more APIs dotnet/runtime#27912

Open

picenka21 pushed a commit to picenka21/runtime that referenced this pull request


          Initial commit for System.Text.Rune (dotnet/coreclr#20935)

17c7414

This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Commit migrated from dotnet/coreclr@7fcd8a8

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Reviewers

stephentoub stephentoub approved these changes

jkotas jkotas left review comments

bartonjs bartonjs left review comments

tarekgh tarekgh left review comments

ahsonkhan ahsonkhan approved these changes

Labels

None yet