Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Use a hash table for HW intrinsic name -> ID lookups #103763

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/coreclr/jit/compiler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#include "lower.h"
#include "stacklevelsetter.h"
#include "patchpointinfo.h"
#include "hwintrinsic.h"
#include "jitstd/algorithm.h"

extern ICorJitHost* g_jitHost;
Expand Down Expand Up @@ -1393,6 +1394,8 @@ void Compiler::compShutdown()
DisplayNowayAssertMap();
#endif // MEASURE_NOWAY

HWIntrinsicInfo::onJitShutdown();

/* Shut down the emitter */

emitter::emitDone();
Expand Down
142 changes: 117 additions & 25 deletions src/coreclr/jit/hwintrinsic.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,113 @@ CorInfoType Compiler::getBaseJitTypeFromArgIfNeeded(NamedIntrinsic intrins
return (diffInsCount >= 2);
}

struct IntrinsicKey
{
CORINFO_InstructionSet Isa;
int NumArgs;
const char* Name;

IntrinsicKey(CORINFO_InstructionSet isa, int numArgs, const char* name)
: Isa(isa)
, NumArgs(numArgs)
, Name(name)
{
}
};

struct IntrinsicKeyFuncs
{
static bool Equals(const IntrinsicKey& lhs, const IntrinsicKey& rhs)
{
return (lhs.Isa == rhs.Isa) && (lhs.NumArgs == rhs.NumArgs) && (strcmp(lhs.Name, rhs.Name) == 0);
}

static int GetHashCode(const IntrinsicKey& key)
{
uint32_t hash = static_cast<uint32_t>(key.Isa);
hash ^= key.NumArgs + 0x9e3779b9 + (hash << 19) + (hash >> 13);
hash ^= HashStringA(key.Name) + 0x9e3779b9 + (hash << 19) + (hash >> 13);
return static_cast<int>(hash);
}
};

struct MallocAllocator
{
template <typename T>
T* allocate(size_t count)
{
void* p = malloc(count * sizeof(T));
if (p == nullptr)
{
NOMEM();
}

return static_cast<T*>(p);
}

void deallocate(void* p)
{
free(p);
}
};

typedef JitHashTable<IntrinsicKey, IntrinsicKeyFuncs, NamedIntrinsic, MallocAllocator> IntrinsicsHashTable;

static IntrinsicsHashTable* volatile s_intrinsicsHashTable;

//------------------------------------------------------------------------
// getIntrinsicsHashTable: Get the hash table for name -> ID lookups.
//
// Return Value:
// Hash table.
//
static IntrinsicsHashTable* getIntrinsicsHashTable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to measure the cost of this as compared to the binary search...

Whenever we do a lookup the first thing we do is lookupIsa and then early exit if there's no chance it could be a hwintrinsic. This is doing some unnecessary work by not checking for the common case of enclosingClassName == nullptr first which should make the early exit path much faster. It then could likewise optimize the class name comparison a bit more, but it's already doing some early exit checks here.

This is helped by the fact that lookupId is only called for classes in the System.Runtime.Intrinsics and System.Runtime.Intrinsics.X86 namespace, but we could pass down which of these it is to help filter the queries a bit more as well.

After we get the ISA there's some up front checks around IsHardwareAccelerated and IsSupported that could be handled better, ideally isolating them to be handled after the main lookup instead.

The main lookup then simpler iterates through all possible IDs from NI_HWINTRINSIC_START to NI_HWINTRINSIC_END and this is doing the bulk of the unnecessary work. Since we already have the ISA at this point we should be able to have a very small static table that is just the first/last intrinsic entry per CORINFO_InstructionSet, this should particularly help since the longest range is 129 intrinsics, but the entire set is around 1192 intrinsics (for xarch at least, it's 270 out of 870 for Arm64 and growing rapidly due to SVE).

Within a given ISA, the intrinsics are already almost ordered alphabetically and it'd only require a little bit of extra work to guarantee that and assert it remains true over time, so we could bisect over the first character to get down to no more than 8 comparisons to find the exact entry.

I imagine that this would be overall cheaper and less memory intensive than having to allocate a hash table for all intrinsics and hash the strings every single time (which will themselves require 1-9 steps to compute, assuming 32-bit hashes, since the intrinsic names are between 2 and around 40 characters or so)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already looked at the profile (see my original comment) and we spend minimal time in lookupId now, so I'm not sure I see the need to do anything more complicated. I doubt a few hash computations for every intrinsic we recognize is going to have any measurable impact.

If you want to work on this feel free to commit to this PR or just open an alternative PR. I'm fine with closing this PR if you'd prefer doing this in the alternative way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put up a PR here: #103778

I think it would be worth comparing the JIT throughput numbers of both approaches here and just picking whatever we think is overall better long term (across perf, complexity, memory overhead, and guarantees they provide).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both PRs have nearly identical TP characteristics. 103778 has a slight edge on x64 while this PR has one on Arm64.

This PR has the downside in that increases memory overhead and requires cross threading considerations for correctness. The benefit is that it is overall simpler code to implement and get correct.

Inversely, 103778 has the downside in that it requires the hwintrinsic table to stay sorted. But, does enable more interesting downstream considerations once that guarantee is in place.

I don't have a preference for which approach we use, but I do think there's some parts from 103778 that would be good to take if we opted for this PR to be the one to go in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is the binary searching solution. I do not think that maintaining the tables sorted is significant burden. Less cycles spent during startup and consuming less memory is general goodness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me, let's do the binary search one.

{
if (s_intrinsicsHashTable != nullptr)
{
return s_intrinsicsHashTable;
}

MallocAllocator mallocator;
char* mem = mallocator.allocate<char>(sizeof(IntrinsicsHashTable));

IntrinsicsHashTable* hashTable = new (mem, jitstd::placement_t()) IntrinsicsHashTable(mallocator);

hashTable->Reallocate(NI_HW_INTRINSIC_END - NI_HW_INTRINSIC_START - 1);

for (int i = 0; i < (NI_HW_INTRINSIC_END - NI_HW_INTRINSIC_START - 1); i++)
{
const HWIntrinsicInfo& intrinsicInfo = hwIntrinsicInfoArray[i];
hashTable->Set(IntrinsicKey(static_cast<CORINFO_InstructionSet>(intrinsicInfo.isa), intrinsicInfo.numArgs,
intrinsicInfo.name),
intrinsicInfo.id);
}

IntrinsicsHashTable* observed = InterlockedCompareExchangeT(&s_intrinsicsHashTable, hashTable, nullptr);
if (observed != nullptr)
{
hashTable->~IntrinsicsHashTable();
mallocator.deallocate(hashTable);
return observed;
}

return hashTable;
}

//------------------------------------------------------------------------
// onJitShutdown: Free the intrinsics hash table on JIT shutdown.
//
void HWIntrinsicInfo::onJitShutdown()
{
IntrinsicsHashTable* hashTable = s_intrinsicsHashTable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other threads can be still JITing methods. This will lead to intermittent crashes during shutdown. The shutdown callback is only a notification that we are about to shutdown. It does not stop other threads running code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We just leak the memory in cases like these instead of trying to cleanup.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can create the hashtable at build time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I can just remove the free given that we expect the shutdown to happen during process shutdown anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can create the hashtable at build time.

Is there some prior art to this? Ideally this is what we would do, but it seemed like it would require a lot of infrastructure.

Copy link
Member

@am11 am11 Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have https://gist.github.com/am11/22eaa6584de55483d988d9831899bcd3 (which I was going to use for src/native/minipal/UnicodeDataGenerator but haven't so far as it increases the binary size by ~33K for ~2300 three-values records). Note that it's a perfect hash function (not a minimal perfect hash function MPHF) based on chm3 algorithm implementation of NetBSD nbperf (under 2-clause BSD License). The C# code generates a standalone C file with a lookup function. I can try to wire it up for the named intrinsic lookup.

if (hashTable != nullptr)
{
hashTable->~IntrinsicsHashTable();
free(hashTable);
s_intrinsicsHashTable = nullptr;
}
}

//------------------------------------------------------------------------
// lookupId: Gets the NamedIntrinsic for a given method name and InstructionSet
//
Expand Down Expand Up @@ -621,35 +728,20 @@ NamedIntrinsic HWIntrinsicInfo::lookupId(Compiler* comp,
}
#endif

for (int i = 0; i < (NI_HW_INTRINSIC_END - NI_HW_INTRINSIC_START - 1); i++)
IntrinsicsHashTable* hashTable = getIntrinsicsHashTable();
NamedIntrinsic ni;
if (hashTable->Lookup(IntrinsicKey(isa, sig->numArgs, methodName), &ni) ||
hashTable->Lookup(IntrinsicKey(isa, -1, methodName), &ni))
{
const HWIntrinsicInfo& intrinsicInfo = hwIntrinsicInfoArray[i];

if (isa != hwIntrinsicInfoArray[i].isa)
{
continue;
}

int numArgs = static_cast<unsigned>(intrinsicInfo.numArgs);

if ((numArgs != -1) && (sig->numArgs != static_cast<unsigned>(intrinsicInfo.numArgs)))
#if defined(TARGET_XARCH)
// on AVX1-only CPUs we only support a subset of intrinsics in Vector256
if (isLimitedVector256Isa && !AvxOnlyCompatible(ni))
{
continue;
return NI_Illegal;
}

if (strcmp(methodName, intrinsicInfo.name) == 0)
{
NamedIntrinsic ni = intrinsicInfo.id;

#if defined(TARGET_XARCH)
// on AVX1-only CPUs we only support a subset of intrinsics in Vector256
if (isLimitedVector256Isa && !AvxOnlyCompatible(ni))
{
return NI_Illegal;
}
#endif
return ni;
}

return ni;
}

// There are several helper intrinsics that are implemented in managed code
Expand Down
2 changes: 2 additions & 0 deletions src/coreclr/jit/hwintrinsic.h
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,8 @@ struct HWIntrinsicInfo
int8_t numArgs; // 1-byte
HWIntrinsicCategory category; // 1-byte

static void onJitShutdown();

static const HWIntrinsicInfo& lookup(NamedIntrinsic id);

static NamedIntrinsic lookupId(Compiler* comp,
Expand Down
Loading