-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Corresponding corefx PR with reference APIs and unit tests is at dotnet/corefx#33395. |
} | ||
|
||
// non-validating ctor | ||
private Rune(uint scalarValue, bool unused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool unused [](start = 39, length = 11)
why we have unused? is this only to distinguish it with the other constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I guess we could make that parameter bool skipValidation
instead, but would need to check the JIT output to make sure the branches are being properly elided by code gen.
return charsWritten; | ||
} | ||
|
||
public override bool Equals(object obj) => (obj is Rune other) && this.Equals(other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(obj is Rune other) [](start = 51, length = 19)
I learned from Stephen (obj is Rune other) produce more complicated IL than if you manually doing the cast. we don't know the perf impact though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will definitely produce less optimized IL if the type T
on the right side of the is keyword is a reference type. I'm not sure what the behavior is since this is a value type. To be fair I didn't look at this too much since Equals(object) isn't really a highly optimized code path and I was going for readability more than anything else.
Oh no, "Rune" is going in, worst API name ever |
public static Rune ToUpperInvariant(Rune value) | ||
{ | ||
// Handle the most common case (ASCII data) first. Within the common case, we expect | ||
// that there'll be a mix of lowercase & uppercase chars, so make the conversion branchless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it branchless?
value.ValueUnsigned ^ ((isLowerAlpha) ? 0x20u : 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the JIT will eventually gets its (bool) ? <power of two> : 0
optimization, and then we'll pick it up for free. It didn't seem worthwhile for me to use bitwise cleverness here when this is really the JIT's job. See also #16156 and https://github.com/dotnet/coreclr/issues/7447.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benchmarked this, and even with optimized codegen the bit twiddling tricks are faster. C'est la vie.
/// assuming that the underlying <see cref="Rune"/> instance is well-formed. | ||
/// </remarks> | ||
[DebuggerDisplay("{DebuggerDisplay,nq}")] | ||
public readonly struct Rune : IComparable<Rune>, IEquatable<Rune> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't love the name and would have preferred something like UnicodeScalar. The fact that the XML comments states "Represents a Unicode scalar" just reinforces for me that such a descriptive name would be better for this type, over "Rune" which as far as I can tell is just gibberish, and for which as far as I can tell the best argument for it is that both Go and Swift use it, albeit apparently each with a slightly different actual meaning. That said, I understand there are two entrenched naming camps here; seems like there are a few very passionate folks in favor of Rune and lots of less passionate folks against but not so much that we're willing to fall on a sword for it. And I understand this is what was agreed to in API review. Just noting my concerns :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely moving more toward "passionate" that it should (have) be(en) UnicodeScalar. While it's wordier, it's more descriptive, and the things like "GetRunes/EnumerateRunes" just feel... wrong.
and for which as far as I can tell the best argument for it is that both Go and Swift use it
Swift actually uses unicode scalar. (https://developer.apple.com/documentation/swift/unicode/scalar)
seems like there are a few very passionate folks in favor of Rune and lots of less passionate folks against but not so much that we're willing to fall on a sword for it. And I understand this is what was agreed to in API review.
I think that @terrajobst preferred Rune over UnicodeScalar due to the shorter name. And I think I led the majority with "I don't like Rune, but I won't go to the mat over this" and we essentially decided/agreed to not care.
So I'll move to a stronger "I think Rune is a bad name"; but it's possibly too late to matter, and probably still being outweighed by the ones who prefer Rune.
The closest to an FDG rule I can find is "DO favor readability over brevity". I hereby assert that UnicodeScalar is more readable (in that it's more descriptive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add, Rune is not even showing up in Unicode glossary https://www.unicode.org/glossary/ Also, looking at the dictionary, it has the definitions:
noun
any of the characters of certain ancient alphabets, as of a script used for writing the Germanic languages, especially of Scandinavia and Britain, from c200 to c1200, or a script used for inscriptions in a Turkic language of the 6th to 8th centuries from the area near the Orkhon River in Mongolia.
something written or inscribed in such characters.
an aphorism, poem, or saying with mystical meaning or for use in casting a spell.
It will be hard to map Unicode Scalar definition to Rune definition.
I vote for UnicodeScalar
/// <summary> | ||
/// Creates a <see cref="Rune"/> without performing validation on the input. | ||
/// </summary> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I'm surprised this isn't otherwise inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I didn't measure that specifically. It was just a force of habit that I wrote it. :/
// - bottom 5 bits are the UnicodeCategory of the character | ||
private static ReadOnlySpan<byte> AsciiCharInfo => new byte[] | ||
{ | ||
0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x0E, 0x8E, 0x8E, 0x8E, 0x8E, 0x8E, 0x0E, 0x0E, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While ASCII's information probably won't change, it would be nice to see this derived from the same source as the rest of the Unicode categorization, instead of being an independent blob.
The latest commit in the PR is just in case we change the name |
ffd5060
to
9714422
Compare
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This type represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; and [ U+E000..U+10FFFF ], inclusive). The primary scenario is for having a consistent representation of Unicode data regardless of the underlying input encoding type, including abstracting away surrogate code points. Commit migrated from dotnet/coreclr@7fcd8a8
The
Rune
type represents a Unicode scalar value ([U+0000..U+D7FF
], inclusive; and [U+E000..U+10FFFF
], inclusive); see https://unicode.org/glossary/#unicode_scalar_value for more information.This PR introduces the basic elemental type for this. There are APIs on this to mirror some of (but not all of) the APIs on
System.Char
andSystem.Text.CharUnicodeInfo
. For example, APIs dealing with surrogate values orIConvertible
do not exist. APIs are added to read a Rune from a string or to write a Rune to a UTF-16 output buffer.The API surface in this PR is not complete, and additional APIs will come in future PRs. Examples of future APIs are more powerful inspection of
string
andReadOnlySpan<char>
data, including the ability to enumerate Rune elements in a UTF-16 buffer and the ability to read from the end of a buffer rather than solely from the front of a buffer. (This will eventually become important for UTF-8 string trimming.) The Rune API surface will also eventually be enlightened with UTF-8 support as that lights up in the framework over the next several months.See also the
UnicodeScalar
proposal at https://github.com/dotnet/corefx/issues/30503 (and the corresponding API review at dotnet/apireviews#76) and the originalRune
issue at https://github.com/dotnet/corefx/issues/24093 for further context.There is some overlap in the logic for this type,
System.Char
andSystem.Text.CharUnicodeInfo
. Some of this logic cannot be reconciled due to behavioral differences (see https://github.com/dotnet/coreclr/issues/19706 for an example). However, much of the logic (including the bit-twiddling tricks done as performance optimizations) I'm hoping eventually to consolidate to the internalUnicodeUtility
helper class which is part of this PR and eventually haveRune
,System.Char
, andSystem.Text.CharUnicodeInfo
all share the One True Implementation(tm) where appropriate. I didn't do that as part of this PR because I want to minimize risk to existing code for now and I didn't want this PR to blow up in size spread across many different files.