Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Initial commit for System.Text.Rune #20935

Merged
merged 5 commits into from
Nov 14, 2018

Conversation

GrabYourPitchforks
Copy link
Member

The Rune type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive); see https://unicode.org/glossary/#unicode_scalar_value for more information.

This PR introduces the basic elemental type for this. There are APIs on this to mirror some of (but not all of) the APIs on System.Char and System.Text.CharUnicodeInfo. For example, APIs dealing with surrogate values or IConvertible do not exist. APIs are added to read a Rune from a string or to write a Rune to a UTF-16 output buffer.

The API surface in this PR is not complete, and additional APIs will come in future PRs. Examples of future APIs are more powerful inspection of string and ReadOnlySpan<char> data, including the ability to enumerate Rune elements in a UTF-16 buffer and the ability to read from the end of a buffer rather than solely from the front of a buffer. (This will eventually become important for UTF-8 string trimming.) The Rune API surface will also eventually be enlightened with UTF-8 support as that lights up in the framework over the next several months.

See also the UnicodeScalar proposal at https://github.com/dotnet/corefx/issues/30503 (and the corresponding API review at dotnet/apireviews#76) and the original Rune issue at https://github.com/dotnet/corefx/issues/24093 for further context.

There is some overlap in the logic for this type, System.Char and System.Text.CharUnicodeInfo. Some of this logic cannot be reconciled due to behavioral differences (see https://github.com/dotnet/coreclr/issues/19706 for an example). However, much of the logic (including the bit-twiddling tricks done as performance optimizations) I'm hoping eventually to consolidate to the internal UnicodeUtility helper class which is part of this PR and eventually have Rune, System.Char, and System.Text.CharUnicodeInfo all share the One True Implementation(tm) where appropriate. I didn't do that as part of this PR because I want to minimize risk to existing code for now and I didn't want this PR to blow up in size spread across many different files.

@GrabYourPitchforks
Copy link
Member Author

Corresponding corefx PR with reference APIs and unit tests is at dotnet/corefx#33395.

}

// non-validating ctor
private Rune(uint scalarValue, bool unused)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool unused [](start = 39, length = 11)

why we have unused? is this only to distinguish it with the other constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I guess we could make that parameter bool skipValidation instead, but would need to check the JIT output to make sure the branches are being properly elided by code gen.

return charsWritten;
}

public override bool Equals(object obj) => (obj is Rune other) && this.Equals(other);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(obj is Rune other) [](start = 51, length = 19)

I learned from Stephen (obj is Rune other) produce more complicated IL than if you manually doing the cast. we don't know the perf impact though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will definitely produce less optimized IL if the type T on the right side of the is keyword is a reference type. I'm not sure what the behavior is since this is a value type. To be fair I didn't look at this too much since Equals(object) isn't really a highly optimized code path and I was going for readability more than anything else.

@itsamelambda
Copy link

Oh no, "Rune" is going in, worst API name ever

public static Rune ToUpperInvariant(Rune value)
{
// Handle the most common case (ASCII data) first. Within the common case, we expect
// that there'll be a mix of lowercase & uppercase chars, so make the conversion branchless.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it branchless?
value.ValueUnsigned ^ ((isLowerAlpha) ? 0x20u : 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the JIT will eventually gets its (bool) ? <power of two> : 0 optimization, and then we'll pick it up for free. It didn't seem worthwhile for me to use bitwise cleverness here when this is really the JIT's job. See also #16156 and https://github.com/dotnet/coreclr/issues/7447.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked this, and even with optimized codegen the bit twiddling tricks are faster. C'est la vie.

/// assuming that the underlying <see cref="Rune"/> instance is well-formed.
/// </remarks>
[DebuggerDisplay("{DebuggerDisplay,nq}")]
public readonly struct Rune : IComparable<Rune>, IEquatable<Rune>
Copy link
Member

@stephentoub stephentoub Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't love the name and would have preferred something like UnicodeScalar. The fact that the XML comments states "Represents a Unicode scalar" just reinforces for me that such a descriptive name would be better for this type, over "Rune" which as far as I can tell is just gibberish, and for which as far as I can tell the best argument for it is that both Go and Swift use it, albeit apparently each with a slightly different actual meaning. That said, I understand there are two entrenched naming camps here; seems like there are a few very passionate folks in favor of Rune and lots of less passionate folks against but not so much that we're willing to fall on a sword for it. And I understand this is what was agreed to in API review. Just noting my concerns :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely moving more toward "passionate" that it should (have) be(en) UnicodeScalar. While it's wordier, it's more descriptive, and the things like "GetRunes/EnumerateRunes" just feel... wrong.

and for which as far as I can tell the best argument for it is that both Go and Swift use it

Swift actually uses unicode scalar. (https://developer.apple.com/documentation/swift/unicode/scalar)

seems like there are a few very passionate folks in favor of Rune and lots of less passionate folks against but not so much that we're willing to fall on a sword for it. And I understand this is what was agreed to in API review.

I think that @terrajobst preferred Rune over UnicodeScalar due to the shorter name. And I think I led the majority with "I don't like Rune, but I won't go to the mat over this" and we essentially decided/agreed to not care.

So I'll move to a stronger "I think Rune is a bad name"; but it's possibly too late to matter, and probably still being outweighed by the ones who prefer Rune.

The closest to an FDG rule I can find is "DO favor readability over brevity". I hereby assert that UnicodeScalar is more readable (in that it's more descriptive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add, Rune is not even showing up in Unicode glossary https://www.unicode.org/glossary/ Also, looking at the dictionary, it has the definitions:

noun
any of the characters of certain ancient alphabets, as of a script used for writing the Germanic languages, especially of Scandinavia and Britain, from c200 to c1200, or a script used for inscriptions in a Turkic language of the 6th to 8th centuries from the area near the Orkhon River in Mongolia.
something written or inscribed in such characters.
an aphorism, poem, or saying with mystical meaning or for use in casting a spell.

It will be hard to map Unicode Scalar definition to Rune definition.

I vote for UnicodeScalar

/// <summary>
/// Creates a <see cref="Rune"/> without performing validation on the input.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I'm surprised this isn't otherwise inlined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I didn't measure that specifically. It was just a force of habit that I wrote it. :/

// - bottom 5 bits are the UnicodeCategory of the character
private static ReadOnlySpan<byte> AsciiCharInfo => new byte[]
{
0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x8E, 0x8E, 0x8E, 0x8E, 0x8E, 0x0E, 0x0E,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While ASCII's information probably won't change, it would be nice to see this derived from the same source as the rest of the Unicode categorization, instead of being an independent blob.

@GrabYourPitchforks
Copy link
Member Author

The latest commit in the PR is just in case we change the name Rune back to UnicodeScalar. I wanted to have the commit ready to go pending any further API review.

@GrabYourPitchforks GrabYourPitchforks merged commit 7fcd8a8 into dotnet:master Nov 14, 2018
dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corefx that referenced this pull request Nov 14, 2018
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corert that referenced this pull request Nov 14, 2018
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit to dotnet/corefx that referenced this pull request Nov 14, 2018
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit to dotnet/corert that referenced this pull request Nov 14, 2018
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jlennox pushed a commit to jlennox/corefx that referenced this pull request Dec 16, 2018
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points.

Commit migrated from dotnet/coreclr@7fcd8a8
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants