Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize InternableString.GetHashCode #6816

Merged
merged 7 commits into from
Sep 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions src/StringTools.UnitTests/SpanBasedStringBuilder_Tests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,28 @@ public void ReferenceEqualsReturnsExpectedValue()
internableString.ReferenceEquals(str).ShouldBeFalse();
}

[Theory]
[InlineData("012345678")] // odd number of characters
[InlineData("0123456789")] // even number of characters
public void GetHashCodeIsStableRegardlessOfSpanLength(string testString)
{
int hashCode = new InternableString(testString).GetHashCode();

// Chop the string into 2-3 parts and verify that the hash code is unchanged.
for (int i = 0; i < testString.Length - 1; i++)
{
for (int j = i + 1; j < testString.Length; j++)
{
SpanBasedStringBuilder stringBuilder = new SpanBasedStringBuilder();
stringBuilder.Append(testString.Substring(0, i));
stringBuilder.Append(testString.Substring(i, j - i));
stringBuilder.Append(testString.Substring(j));
InternableString internableString = new InternableString(stringBuilder);
internableString.GetHashCode().ShouldBe(hashCode);
}
}
}

[Theory]
[MemberData(nameof(TestData))]
public void AppendAppendsString(InterningTestData.TestDatum datum)
Expand Down
4 changes: 2 additions & 2 deletions src/StringTools.UnitTests/WeakStringCache_Tests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,8 @@ private void AddStringsWithSameHashCode(int numberOfStrings)

for (int i = 0; i < numberOfStrings; i++)
{
string strPart2 = "1" + String.Concat(Enumerable.Repeat("4428939786", i));
hashCodes[i] = AddString("Random string ", strPart2, (string cachedString) =>
string strPart2 = string.Concat(Enumerable.Repeat("100570862200", i + 2));
hashCodes[i] = AddString(string.Empty, strPart2, (string cachedString) =>
{
_cache.GetDebugInfo().ShouldBe(new WeakStringCache.DebugInfo()
{
Expand Down
78 changes: 67 additions & 11 deletions src/StringTools/InternableString.cs
Original file line number Diff line number Diff line change
Expand Up @@ -299,33 +299,89 @@ public override string ToString()
}

/// <summary>
/// Implements the simple yet very decently performing djb2 hash function (xor version).
/// Implements the simple yet very decently performing djb2-like hash function (xor version) as inspired by
/// https://github.com/dotnet/runtime/blob/6262ae8e6a33abac569ab6086cdccc470c810ea4/src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs#L810-L840
/// </summary>
/// <returns>A stable hashcode of the string represented by this instance.</returns>
/// <remarks>
/// Unlike the BCL method, this implementation works only on two characters at a time to cut on the complexity with
/// characters that feed into the same operation but straddle multiple spans. Note that it must return the same value for
/// a given string regardless of how it's split into spans (e.g. { "AB" } and { "A", "B" } have the same hash code).
/// </remarks>
public override unsafe int GetHashCode()
ladipro marked this conversation as resolved.
Show resolved Hide resolved
{
int hashCode = 5381;
uint hash = (5381 << 16) + 5381;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly off-topic: Wouldn't caching this string's hashcode result in a net improvement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetHashCode runs only once for a given string unless the caller calls SpanBasedStringBuilder.ToString() multiple times on the same instance without mutating it between the calls. Same as calling StringBuilder.ToString() multiple times, it is technically possible but enough of an anti-pattern that the implementation does not cache the result.

More on running GetHashCode only once: When a string is added to the weak cache, its hash code is used as a key in a dictionary. So it's not calculated on each look-up because the look-up is done based on the hash code and not the string itself. When we're looking for a string in the weak cache, we calculate its hash code once per lookup and, as argued above, there should not be more than one look-up for the same string.

bool hashedOddNumberOfCharacters = false;

fixed (char* charPtr = _inlineSpan)
{
for (int i = 0; i < _inlineSpan.Length; i++)
{
hashCode = unchecked(hashCode * 33 ^ charPtr[i]);
}
GetHashCodeHelper(charPtr, _inlineSpan.Length, ref hash, ref hashedOddNumberOfCharacters);
}
if (_spans != null)
{
foreach (ReadOnlyMemory<char> span in _spans)
{
fixed (char* charPtr = span.Span)
{
for (int i = 0; i < span.Length; i++)
{
hashCode = unchecked(hashCode * 33 ^ charPtr[i]);
}
GetHashCodeHelper(charPtr, span.Length, ref hash, ref hashedOddNumberOfCharacters);
}
}
}
return hashCode;
return (int)(hash);
}

/// <summary>
/// Hashes a memory block specified by a pointer and length.
/// </summary>
/// <param name="charPtr">Pointer to the first character.</param>
/// <param name="length">Number of characters at <paramref name="charPtr"/>.</param>
/// <param name="hash">The running hash code.</param>
/// <param name="hashedOddNumberOfCharacters">True if the incoming <paramref name="hash"/> was calculated from an odd number of characters.</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void GetHashCodeHelper(char* charPtr, int length, ref uint hash, ref bool hashedOddNumberOfCharacters)
ladipro marked this conversation as resolved.
Show resolved Hide resolved
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using ref variables in tight loops. JIT cant optimize it into registry.
I recommend to change signature to private static unsafe uint GetHashCodeHelper(char* charPtr, int length, uint hash, ref bool hashedOddNumberOfCharacters) and call it hash = GetHashCodeHelper(charPtr, span.Length, hash, ref hashedOddNumberOfCharacters);

In my micro benchmark, this simple change makes it about 2x faster.

Copy link
Member Author

@ladipro ladipro Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this must be the reason why it got slower after the last update. Confirming your results, it really is more than 2x faster after eliminating the ref parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also be faster using out parameters? Having an input and out parameter that happen to match. Or maybe returning a tuple?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Forgind Using non ref local variable as running hash in tight loop and than copy it to out parameter would most probably render about same benefit. However, returning integer value from procedure is something highly optimized by calling conventions. In particular it returns value in registry eax. This is significantly faster than exchanging return values by copying it into stack memory which both value tuple and out variables does. By significant I mean about 1 us slower for modern CPUs, so in practical world it rarely matters.

if (hashedOddNumberOfCharacters && length > 0)
{
// If the number of characters hashed so far is odd, the first character of the current block completes
// the calculation done with the last character of the previous block.
hash ^= BitConverter.IsLittleEndian ? ((uint)*charPtr << 16) : *charPtr;
length--;
charPtr++;
hashedOddNumberOfCharacters = false;
}

// The loop hashes two characters at a time.
uint* ptr = (uint*)charPtr;
while (length >= 2)
{
length -= 2;
hash = (RotateLeft(hash, 5) + hash) ^ *ptr;
ptr += 1;
}

if (length > 0)
{
hash = (RotateLeft(hash, 5) + hash) ^ (BitConverter.IsLittleEndian ? *((char*)ptr) : ((uint)*((char*)ptr) << 16));
ladipro marked this conversation as resolved.
Show resolved Hide resolved
hashedOddNumberOfCharacters = true;
}
}

/// <summary>
/// Rotates an integer by the specified number of bits.
/// </summary>
/// <param name="value">The value to rotate.</param>
/// <param name="offset">The number of bits.</param>
/// <returns>The rotated value.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint RotateLeft(uint value, int offset)
{
#if NETCOREAPP
return System.Numerics.BitOperations.RotateLeft(value, offset);
#else
// Copied from System\Numerics\BitOperations.cs in dotnet/runtime as the routine is not available on .NET Framework.
// The JIT recognized the pattern and generates efficient code, e.g. the rol instruction on x86/x64.
return (value << offset) | (value >> (32 - offset));
#endif
}
}
}