-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: MemoryExtensions.CommonPrefixLength<T> #64271
Comments
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsBackground and motivationWe have MemoryExtensions.SequenceEqual that reports true or false as to whether two inputs contain all the same contents, but we don't have a mechanism for it reporting back the first place a difference occurred. In multiple places we end up open-coding such a loop, e.g. runtime/src/libraries/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.MultiSegment.cs Lines 629 to 650 in d910ce3
Lines 109 to 117 in d910ce3
We should have a helper that a) allows someone to write this in a single line rather than writing their own loops and b) does so as efficiently as possible, e.g. vectorized ala SequenceEqual is for many types. API Proposalnamespace System
{
public static class MemoryExtensions
{
+ public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other);
+ public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
}
} API Usage// Find prefix overlap between multiple strings
ReadOnlySpan<char> prefix = strings[0];
for (int i = 1; i < strings.Length && !prefix.IsEmpty; i++)
{
prefix = prefix.Slice(0, prefix.SequenceEqualUntil(strings[i]));
} Alternative DesignsWe could add overloads of the existing RisksNo response
|
Are there better names? If I read that, I'd except an API taking a length (pointless for span) or some other 'until' parameter. Like sequenceEqual until newline. Maybe more in the direction of FindFirstDifference or FindIndexOfFirstDifference? |
What value would be returned if there is no match even for the first item? A Sorry if these questions are dumb, I just want to get a good understanding of how it would be consumed by the end users. I like the |
Yes, 0.
0
The length of the shorter span. Pseudo-code implementation would be like: public static int SequenceEqualUntil<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T> comparer)
{
int length = Math.Min(span.Length, other.Length);
for (int i = 0; i < length; i++)
if (!EqualityComparer<T>.Default.Equals(span[i], other[i]))
return i;
return length;
} So the return value indicates how much was the same between them. If it returns 0, no elements were the same. If it returns 1, the first element of each was the same. If it returns 2, the first two elements of each were the same. Etc.
I don't love that naming because it suggests the result might be a valid index in both spans, but it wouldn't be in the case of an empty span. |
How about "CommonSuffixLength"? Or is it too string-oriented? |
Presumably it'd be "CommonPrefixLength" 😄 I'd be ok with that or some variation thereof. |
I've written similar methods in my own code previously, but I chose the name if (!spanOne.IsPrefixedBy(spanTwo, out var firstDifference))
{
_logger.LogDebug("No prefix match (first deviance at index {index})", firstDifference);
} This way, |
Alternative name: |
IndexOfFirstDifference was suggested earlier. My concern with the IndexOfDifference is the same: other IndexOf methods always return a valid index or -1. That wouldn't be the case here if one or both inputs were empty. In fact it's not actually returning an index, but a length. |
@stephentoub I am supportive of this idea in the current shape and I've marked it as ready for review. Would you like to present it in the API proposal meeting? |
Sure, thanks. |
We should consider having specific overloads for string ( Code which performs UTF-8 text processing would need similar logic to prevent errors or data corruption in its output. |
For the cited examples, Regex specifically doesn't consider surrogates; while that might be considered "wrong", it wouldn't be able to use this if the input did and it had no way to avoid it. Are the other cited uses in the opening issue already wrong then? Or for them are there other mitigating reasons why going byte-by-byte or char-by-char correct? I want to make sure that if we "fix" the implementation to handle surrogates and the like that we're not then making it so that no one can actually use it. |
It could call the
I don't know what the other cited code samples are used for. But in general, if you're splitting a string (UTF-16 or UTF-8) at arbitrary indexes and treating the resulting splits as binary data, party on. Nobody's going to care what you do with the data. If you're splitting a string (UTF-16 or UTF-8) at arbitrary indexes and attempting to process those splits as text, bad things will happen. You'll almost certainly end up with data corruption for some inputs. As a concrete example, consider the two French terms pêcheur ("fisherman") and pécheur ("sinner"). They differ only in the type of accent over the first e character. The UTF-8 representations of these texts are: [ 70 C3 AA 63 68 65 75 72 ] = "pêcheur"
[ 70 C3 A9 63 68 65 75 72 ] = "pécheur" The proposed CommonPrefixLength API would return 2 when presented with these inputs. Again, if you're treating this as binary data, whatever. But if you're attempting to do anything string-like with these APIs - such as case conversion, UTF-8 <-> UTF-16 conversion (for writing to i/o) - etc., then these strings once split will turn into: "p�" + "�cheur"
"p�" + "�cheur" And hey, look at that! In the "best" case they're corrupted and the meaning has changed. In the "worst" case the data corruption has caused them to be have identical contents once reconstituted, which can cause your application to run down a code path you never intended. |
It feels like a real pit of failure that ReadOnlySpan<char> span = ...;
SomeMethod(span) and ReadOnlySpan<char> span = ...;
SomeMethod<char>(span); would have different semantics. |
IMHO, If the user wishes for the implementation to be unicode-aware, they should pass another parameter indicating as such. The most likely parameter that would indicate that, to me, would be a StringComparison value. That is: ReadOnlySpan<char> span = ...;
SomeMethod(span) and ReadOnlySpan<char> span = ...;
SomeMethod(span, StringComparison.OrdinalIgnoreCase) have different semantics, and it becomes obvious why, as for the second overload you specified you wanted it to be treated "stringly", rather than as a sequence of char-size values |
You're not wrong. 😄 But that's why I think we really need to think long and hard about the scenarios that this API is intended for. My concern is that the API as currently proposed is already a pit of failure. When processing raw binary data, the behavior as proposed is just fine. Everything is a bit-by-bit comparison and we're good to go. When processing variable-length data formats like UTF-8/16, the behavior as proposed makes this API appropriate only for trie-like data structures and other scenarios where the caller never treats the prefix or suffix as anything other than opaque binary data. The only operations that would ever be valid for that opaque data are bit-for-bit equality comparisons. If you wanted to do anything "interesting" with the data (such as interpret it in a meaningful fashion other than as a binary blob), you'd have to concatenate it back into its whole, otherwise you risk misinterpreting the data. The scenarios listed in the issue description strongly indicate that we expect callers to pass stringy data in here. This makes it very easy for them to fall into a trap where they unwittingly perform contextual processing of the data. var dict = new Dictionary<string, object>(StringComparison.OrdinalIgnoreCase);
int commonPrefixLength = GetCommonPrefixLength(str1, str2);
string str3 = str1.Substring(0, commonPrefixLength);
dict[str3] = new object(); // <-- !! BUG !! - indirect contextual processing of substring (via StringComparison.OrdinalIgnoreCase.GetHashCode) If we're still ok with this and don't expect that developers will commonly run afoul of this, let's march forward as-is. But at minimum this deserves a mention in the docs.
Not sure what you mean by "Unicode-aware". If you're talking about case insensitivity or culture-aware (such as en-US or de-DE), you'd need new overloads anyway, as the current proposed overloads wouldn't suffice. I wouldn't recommend adding culture awareness anyway, since both the API surface and the implementation get very messy very quickly. |
I meant moreso "dealing with the way UTF-8 or UTF-16 encodes text" rather than "dealing with a sequence of bytes/similar" when I said "Unicode-aware". Sorry for the confusion 😅 |
Understood. :) If we doc the behavior as "don't use this API over structured data (like UTF-* text)" then this should be ok. But otherwise it's hard to separate char from the concept of Unicode, as we literally define the Separately, I've been pondering for a while whether concerns like this are best handled by analyzers. For example, |
Had a chance to take another look at the three consumers mentioned previously. The JSON usage is fine because it's performing binary (not string-contextual) comparisons, and it's not doing anything with the data other than saying "here's where the error occurred." The Regex usage I assume is by design because that type already explicitly disavows supplementary character support. The Path logic is problematic for a handful of reasons, many of which aren't relevant to this issue. (Fun example: the strings |
namespace System;
public static class MemoryExtensions
{
public static int CommonPrefixLength<T>(this Span<T> span, ReadOnlySpan<T> other) where T: IEquatable<T>;
public static int CommonPrefixLength<T>(this Span<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other) where T: IEquatable<T>;
public static int CommonPrefixLength<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> other, IEqualityComparer<T>? comparer = null);
} |
Background and motivation
We have MemoryExtensions.SequenceEqual that reports true or false as to whether two inputs contain all the same contents, but we don't have a mechanism for it reporting back the first place a difference occurred. In multiple places we end up open-coding such a loop, e.g.
runtime/src/libraries/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.MultiSegment.cs
Lines 629 to 650 in d910ce3
runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs
Lines 109 to 117 in d910ce3
runtime/src/libraries/Common/src/System/IO/PathInternal.cs
Lines 53 to 77 in 09c1a1f
We should have a helper that a) allows someone to write this in a single line rather than writing their own loops and b) does so as efficiently as possible, e.g. vectorized ala SequenceEqual is for many types.
API Proposal
API Usage
Alternative Designs
We could add overloads of the existing
SequenceEquals
that have an additionalout int
argument, which would be 0 in the case of returning true and the position of the first difference in the case of returning false.Risks
No response
The text was updated successfully, but these errors were encountered: