Optimize InternableString.GetHashCode #6816

ladipro · 2021-09-03T22:27:37Z

Context

The current straightforward implementation is slow, especially on 64-bit Framework CLR. It is causing a measurable regression when evaluating large projects.

Changes Made

Rewrote the routine using a similar approach as this BCL method. Calculating only 2 and not 4 characters at a time, though, to reduce the complexity since our implementation works on a list of spans and not just one string. The additional perf benefit of going 4 at a time would be relatively small.

Testing

A new unit test to verify correctness.
Micro-benchmark showing 1.6x boost on x86 and 2x boost on x64.
Evaluation perf traces showing significant reduction in the time spent in GetHashCode, saving 110 ms (~6%) per evaluation of the Unreal Engine C++ project on 64-bit.

src/StringTools/InternableString.cs

Therzok · 2021-09-04T03:48:37Z

src/StringTools/InternableString.cs

@@ -304,28 +304,59 @@ public override string ToString()
        /// <returns>A stable hashcode of the string represented by this instance.</returns>
        public override unsafe int GetHashCode()
        {
-            int hashCode = 5381;
+            uint hash = (5381 << 16) + 5381;


slightly off-topic: Wouldn't caching this string's hashcode result in a net improvement?

GetHashCode runs only once for a given string unless the caller calls SpanBasedStringBuilder.ToString() multiple times on the same instance without mutating it between the calls. Same as calling StringBuilder.ToString() multiple times, it is technically possible but enough of an anti-pattern that the implementation does not cache the result.

More on running GetHashCode only once: When a string is added to the weak cache, its hash code is used as a key in a dictionary. So it's not calculated on each look-up because the look-up is done based on the hash code and not the string itself. When we're looking for a string in the weak cache, we calculate its hash code once per lookup and, as argued above, there should not be more than one look-up for the same string.

rainersigwald

Looks reasonable to me but since hashing is scary just wanted to check on the algo and constants and things.

I also note in the linked impl that it's "when collisions aren't a problem", which . . . seems right for us but have you thought about that in depth?

src/StringTools/InternableString.cs

ladipro · 2021-09-07T21:21:59Z

Looks reasonable to me but since hashing is scary just wanted to check on the algo and constants and things.

Indeed, I have a major bug there because << is a bit shift and I need bit rotation 🤦‍♂️

The plan:

Fix the code to use BitOperations.RotateLeft.
Re-run the benchmark.
Fix the unit tests which explicitly uses colliding strings.
Check for collisions on a representative set of strings as a sanity check.

I also note in the linked impl that it's "when collisions aren't a problem", which . . . seems right for us but have you thought about that in depth?

This was discussed when the current implementation was being developed and we convinced ourselves that since you can ask MSBuild to wipe the drive or do anything the current user can do, we don't really worry about DoS'ing it with crafted strings. In other words, strings processed by MSBuild are not considered user input that we should be protecting against because the whole build process has to be trusted by design.

src/StringTools.UnitTests/SpanBasedStringBuilder_Tests.cs

src/StringTools/InternableString.cs

Co-authored-by: Forgind <Forgind@users.noreply.github.com>

ladipro · 2021-09-13T19:58:11Z

The perf win on my VM is now lower - was 4.5x, now only 2x. I see the same thing without the new commits so it's likely because of the particular physical CPU where my machine is hosted is different. I have updated the description with the new numbers.

rokonec · 2021-09-14T13:44:47Z

src/StringTools/InternableString.cs

+        /// <param name="hashedOddNumberOfCharacters">True if the incoming <paramref name="hash"/> was calculated from an odd number of characters.</param>
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static unsafe void GetHashCodeHelper(char* charPtr, int length, ref uint hash, ref bool hashedOddNumberOfCharacters)
+        {


Please avoid using ref variables in tight loops. JIT cant optimize it into registry.
I recommend to change signature to private static unsafe uint GetHashCodeHelper(char* charPtr, int length, uint hash, ref bool hashedOddNumberOfCharacters) and call it hash = GetHashCodeHelper(charPtr, span.Length, hash, ref hashedOddNumberOfCharacters);

In my micro benchmark, this simple change makes it about 2x faster.

Wow, this must be the reason why it got slower after the last update. Confirming your results, it really is more than 2x faster after eliminating the ref parameter.

Would it also be faster using out parameters? Having an input and out parameter that happen to match. Or maybe returning a tuple?

@Forgind Using non ref local variable as running hash in tight loop and than copy it to out parameter would most probably render about same benefit. However, returning integer value from procedure is something highly optimized by calling conventions. In particular it returns value in registry eax. This is significantly faster than exchanging return values by copying it into stack memory which both value tuple and out variables does. By significant I mean about 1 us slower for modern CPUs, so in practical world it rarely matters.

src/StringTools/InternableString.cs

Optimize InternableString.GetHashCode (4.5x faster on Framework x64)

961a860

cdmihai reviewed Sep 4, 2021

View reviewed changes

src/StringTools/InternableString.cs Show resolved Hide resolved

Therzok reviewed Sep 4, 2021

View reviewed changes

ladipro added 2 commits September 6, 2021 15:47

Fix unit tests (hash code values are now different)

0c0b04d

PR feedback: Add MethodImplOptions.AggressiveInlining

184bf70

rainersigwald approved these changes Sep 7, 2021

View reviewed changes

src/StringTools/InternableString.cs Show resolved Hide resolved

Forgind reviewed Sep 10, 2021

View reviewed changes

src/StringTools.UnitTests/SpanBasedStringBuilder_Tests.cs Outdated Show resolved Hide resolved

src/StringTools.UnitTests/SpanBasedStringBuilder_Tests.cs Outdated Show resolved Hide resolved

src/StringTools/InternableString.cs Outdated Show resolved Hide resolved

ladipro and others added 4 commits September 13, 2021 13:00

Apply suggestions from code review

60e6082

Co-authored-by: Forgind <Forgind@users.noreply.github.com>

Rotate bits, don't just shift

68cb95e

Fix unit tests (hash code values are now different, again)

4adc629

PR feedback: Update comment

4d660e9

Forgind approved these changes Sep 13, 2021

View reviewed changes

Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Sep 13, 2021

AR-May merged commit 4ceb3f8 into dotnet:main Sep 14, 2021

rokonec reviewed Sep 14, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize InternableString.GetHashCode #6816

Optimize InternableString.GetHashCode #6816

ladipro commented Sep 3, 2021 •

edited

Loading

Therzok Sep 4, 2021

ladipro Sep 6, 2021

rainersigwald left a comment

ladipro commented Sep 7, 2021 •

edited

Loading

ladipro commented Sep 13, 2021

rokonec Sep 14, 2021

ladipro Sep 14, 2021 •

edited

Loading

ladipro Sep 14, 2021

Forgind Sep 14, 2021

rokonec Sep 14, 2021

Optimize InternableString.GetHashCode #6816

Optimize InternableString.GetHashCode #6816

Conversation

ladipro commented Sep 3, 2021 • edited Loading

Context

Changes Made

Testing

Therzok Sep 4, 2021

Choose a reason for hiding this comment

ladipro Sep 6, 2021

Choose a reason for hiding this comment

rainersigwald left a comment

Choose a reason for hiding this comment

ladipro commented Sep 7, 2021 • edited Loading

ladipro commented Sep 13, 2021

rokonec Sep 14, 2021

Choose a reason for hiding this comment

ladipro Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

ladipro Sep 14, 2021

Choose a reason for hiding this comment

Forgind Sep 14, 2021

Choose a reason for hiding this comment

rokonec Sep 14, 2021

Choose a reason for hiding this comment

ladipro commented Sep 3, 2021 •

edited

Loading

ladipro commented Sep 7, 2021 •

edited

Loading

ladipro Sep 14, 2021 •

edited

Loading