-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better default for L3 cache size on win-arm64 and lin-arm64 #64645
Conversation
Tagging subscribers to this area: @dotnet/gc Issue DetailsWhile we're trying to address the L3 cache issue for both Win-arm64 and Linux-arm64 (osx-arm64 is fine already #64576), I think it makes sense to at least use the existing heuristic as a good default (based on logical cores count) if it's bigger than what we found (e.g. L2). This heuristic looks like this: int predictedCacheSize = Math.Min(1536, Math.Max(256, (int)logicalCPUs * 128)) * 1024; (it's not mine, it existed in the code for cases when we can't even get any cache info at all) Same heuristic but visualized: It doesn't predict 32Mb for our 30core eMAG, but it's better than nothing and it also won't hurt small CPUs or won't report something gigantic for some 80cores CPU etc. Gen0 size is currently calculated like this: runtime/src/coreclr/vm/gcenv.os.cpp Lines 649 to 652 in 51056be
Also, here is a graph for Gen0 size -- RPS for Plaintext-MVC benchmark (the most GC-bound we have in PerfLab currently): The red line is where we're now: 256kb L3 -> 480Kb Gen0 -> 380k RPS. For this specific benchmark the best RPS (1086k RPS) corresponds to ~16Mb Gen0 (L3 cache function should report something between 12mb and 16mb) I also ran a couple of simple micro-benchmarks locally on a workstation GC and it seems like performance gets steady after gen0 at least 4Mb
|
This looks good for now, but based on your measurements should we default to at-least 4mb? @Maoni0 ? |
on a 64 proc machine
would return 96mb..that's way too large. why are we doing the |
Added database-fortunes aspnet benchmark. Will run more.
I think it's just a heuristic like "if a cpu has more than 8 cores it's most likely something powerful"
Oops, in my formula in the issue I forgot about the leading |
@Maoni0 I've just changed the formula, now it's Max gen0 size is 7.5Mb (for systems with > 30 cores) , let me know if you want a smaller value |
I am running all the aspnet benchamarks we have in perflab via crank currently, so far the best results (or rather optimal) are when Gen0 is between 6Mb and 16Mb |
I'm not familiar with the logic here, so sorry for the question in advance/explanation of thoughts in advance if this has already been considered/etc... Why are we predicting the L3 size rather than just getting it from the OS (such as There is a range of hardware and configurations here and as the core counts and layouts increase there is a lot more interesting details than just "how much L3 exists". So it seems we are potentially missing loads of potentially important information by not pulling the relevant info from the OS/hardware. For example, lets consider just:
The Ryzen CPUs are comprised of CCX modules where each CCX has its own share of the cores, L1/L2/L3 cache, etc. The CCXs are technically distinct units and communicate with each other over the Infinity Fabric. While communicating over the Infinity Fabric is possible and fast, its also slower than accessing resources on the same CCX. Likewise, while two separate cores on the same CCX can communicate, it is slower than accessing the resources that are directly meant for that core. And finally, hyperthreading exists by basically splitting the resources of a single core in half with each thread getting roughly half of the resources available to it and so this is something that can be important to consider as well. So while both of these CPUs provide 64MB of L3 cache and both have 16-cores/32-threads, the performance and considerations for the L3 cache here are quite a bit different. In both setups, each core has roughly 4MB L3 to itself and each thread, roughly 2MB L3. However, on the 3950X each core has access to an additional 12MB L3 at a "medium, speed" and the other 48MB of L3 over Infinity Fabric at an additional cost ("slow speed"). The 5950X on the other hand has access to 28MB of L3 in the same CCX at a "medium speed" and the other 32MB over Infinity Fabric at an additional cost ("slow speed"). There have been several articles and deep dives on the Ryzen architecture including https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 which shows some of the profiled Core-To-Core latencies and how same core access is extremely fast (6ns), accessing other cores on the same CCX was about 2-3x slower (~17ns) and accessing over infinity fabric was up to 4-5x slower than that (~80ns).
There are similar considerations in the Intel Alder Lake technology with their power/efficient split and in other upcoming CPUs like the Zen "3D" which will have up to 192MB L3 accessible. With the introduction of the power/efficient split and CPUs with many cores/threads there are also a lot of considerations that come into play around thread scheduling and things that I think it would be good if we were considering and designing around. https://www.intel.com/content/www/us/en/developer/articles/guide/alder-lake-developer-guide.html goes a bit in depth on some of the considerations. This specific article is somewhat game focused but many of the rules/guidelines are reiterated in the Intel and AMD optimization guides and are more generally applicable. It calls out a lot of things that I don't believe we are accounting for today; like how caches are split/accessible by resources (called out above) or how hyper-threads share resources and so scheduling threads to the main thread of each core before scheduling to the secondary threads is important (some of which is expected to be handled by the OS, but which advanced usage scenarios may also take advantage of or provide additional hints around). |
I agree that just L3 size is rather a questionable metric without additional context like how many cores share it, etc. but the current problem that for Windows-ARM64 and Linux-ARM64 there is no way (we're aware of) to get any information about L3 at all, e.g. on Windows GetLogicalProcessorInformation simply only reports L1-L2 (Windows team is helping us atm) and same on Linux. On macOS we have everything we need from sysctl - L3 size, how many performance cores share it, etc... |
Sorry, are you saying that I'm notably not seeing the same on any of my 3 ARM64 devices (Surface Pro X, Samsung GalaxyBook2, or the Qualcomm ECS Liva dev box). A simple C++ app using |
Exactly, it doesn't report L3 on our Windows11-arm64 machines with lots-of-cores-hardware, Windows team is aware. So it's a reasonable workaround till we find a 100% reliable way to get the cache or switch to some other method to calculate Gen0 size. |
More aspnet/TechEmpower benchmarks from PerfLab: Vertical Axis - RPS or P90 (ms) So far, the optimal results are between 6Mb an 16Mb for Gen0, 7.5Mb as this PR proposes sounds like a good default. @Maoni0 does it look good now? While we're looking for a better solution. Plaintext-MVC baseline vs this PR (tested binaries):
|
What is the actual L3 size on those machines? If it is 32 MiB, then our 5/8 factor may be inadequate even if we remove 3x scaling. I am afraid you are optimizing for very specific hardware. To change the formula, we need to run tests on more than one type of hardware. |
The machines have 32Mb of L3, the heuristic reports 4Mb which results in 7.5Mb for Gen0 (max possible size for this heuristic). For these benchmarks on this CPU it produces the best "RPS/working set size" ratio. Can be decreased down to 2Mb L3 (3.75Mb Gen0) without losing much benefits (~10%) if 7.5Mb is too much. This PR is not a scientific paper, it just tries to use a reasonable default which is much better than what we have now - 256Kb (480Kb gen0). It noticeably improves all GC-intensive benchmarks, even for desktop scenarios. I propose we merge it so we can have a better ground for upcoming Preview2, the L3 cache issue was found ~3 month ago. |
This PR increases working set from ~170Mb to ~370Mb while the same benchmark reports 440Mb on our Xeon. So values after 8Mb gen0 dramatically increase working set without much benefits (e.g. Gen0=28mb == 1Gb of working set) |
How do we know this formula change does not affect negatively other types of hardware? For instance, in case of 8 cores the reported L3 size is changed by this PR from 8 MiB to just 1 MiB, which is a significant reduction. Your finding that the optimal Gen0 size is between 1/5 and 1/2 of the L3 size (instead of the currently used 3*5/8 factor) for this particular hardware is quite interesting; however, I think we should also test some other types of hardware before changing the general formula. |
I think we never use that formula currently at all and just rely on whatever comes from API which is mostly something small (100% small on Linux-arm64). |
Where is the logic for I only see logic that queries
I'd expect the logic to actually do things like:
For reference, this is all documented here: https://github.com/torvalds/linux/blob/master/Documentation/ABI/testing/sysfs-devices-system-cpu |
The problem that |
And we can't rely on it not being reported as the heuristic that says "do the fallback"? This is another case where on my own boxes (both WSL and directly running Linux natively -- Ubuntu 20.04.03 LTS), I am seeing the numbers accurately reported for ARM64. |
Also noting that there are some chips, such as the Raspberry PI, which may have no L3 and assuming it has one may also be incorrect/unoptimal. |
Interesting, what kind of hardware you use for it? Also, if it reports L3 correctly then the heuristic won't be used, It is highly unlikely its value will be bigger than the real one (maybe +/- 0.5mb). |
Let's not keep the current value in sake of Raspberry PI ;-) |
Raspberry PI - (Booting Ubuntu) Reports no L3, because it doesn't have an L3 |
Thanks for data, I assume all of them (except Pi) use popular Qualcomm chips where cache is reported via a special register accessible by kernel, I even have a snippet somewhere with raw arm asm. While we mostly care about custom server/cloud hardware in this issue. The heuristic won't hurt any of the devices you listed - if L3 is reported correctly than it will be bigger than what the heuristic predicts. |
@EgorBo Have you been testing server GC only? I am wondering whether the optimal range for workstation GC might be different. |
@EgorBo does this need more thought or is it ready to merge? |
|
||
cacheSize = logicalCPUs * std::min(1536, std::max(256, (int)logicalCPUs * 128)) * 1024; | ||
} | ||
// It is currently expected to be missing cache size info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of information in this comment is no longer relevant. Could you please update this comment to only include what is still relevant?
Closing since, apparently, was overtaken by #71029 |
While we're trying to address the L3 cache issue (#60166) for both Win-arm64 and Linux-arm64 (osx-arm64 is fine already #64576), I think it makes sense to at least use the existing heuristic as a good default (based on logical cores count) if it's bigger than what we found (e.g. L2). This heuristic looks like this:
(it's not mine, it existed in the code for cases when we can't even get any cache info at all)
Same heuristic but visualized:

I think it's better than the current default and it also won't hurt small CPUs or won't report something gigantic for some 80cores CPU etc.
Gen0 size is currently calculated like this:
((L3size * 3) * 5 / 8)
where 5/8 is a general heuristic and* 3
is some arm64 specific, seeruntime/src/coreclr/vm/gcenv.os.cpp
Lines 649 to 652 in 51056be
Also, here is a graph for Gen0 size -- RPS for Plaintext-MVC benchmark (the most GC-bound we have in PerfLab currently):

The red line is where we're now: 256kb L3 -> 480Kb Gen0 -> 380k RPS.
The green line is what we'll have with this heuristic: 1.5Mb L3 -> ~2.8Mb Gen0 -> 920k RPS
For this specific benchmark the best RPS (1086k RPS) corresponds to ~16Mb Gen0 (L3 cache function should report something between 12mb and 16mb)
database-fortunes benchmark:

I also ran a couple of simple micro-benchmarks locally on a workstation GC and it seems like performance gets steady after gen0 at least 4Mb
@Maoni0 @mangod9 @jkotas