-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize write barriers for server gc on arm64 #106934
Conversation
Tagging subscribers to this area: @mangod9 |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Looks like we should rather port Regions-specific write-barriers to ARM64 (#67389 did it only for x64) - |
@dotnet/gc is porting |
we have not scheduled any work item for it. if you have cycles, you are of course welcome to work on it. |
It doesn't look to hard to implement to me, so I might try. Closing this PR since we're not going to need it if we implement the Regions-specific WBs. |
Implementing precise WB was the initial impetus around the write barrier investigation recently. But that showed that there might be other improvements which are required as well. Does your recent experimentation show that might not be the case? |
I've got an impression that that investigation was about raw performance of the current write-barrier rather than making it more precise. Correct me if I am wrong, but the region based write-barrier technically can be a bit slower. E.g. I see that the x64 PR had a lot of regressions in microbnehcmarks attached (while making good GC pause improvements for real world apps). |
yeah the investigation was about improving overall arm64 WB perf since the measurements were showing that its apparently order of magnitude slower than x64. Hoping to have a concentrated effort to improve in 10, but the challenge is whether its perf would have material impact on real world scenarios. |
I'm confused - the bit/byte versions are the regions specific WBs. they do not apply to segments. last I heard the WB perf on arm is still significantly slower than on x64 so it'd be good to figure that out before implementing the bit/byte versions. |
Should we spend efforts on making segments faster? My main motivation is to improve Linux-arm64 on big 1P apps, which, presumably, use Regions. |
I think there are two things currently:
|
Are arm64 jump stubs as efficient as possible on modern arm64 hardware? |
Can't say for sure, but I had pretty confusing results recently, I compared two benchmarks where we were able to preallocate the loader heap next to libcoreclr (hence, direct call without jump stubs) vs default with a jump stub. For some reason the direct call was notably slower 😕 |
adding @janvorli here as well, since he had done some experimentation around jump stub perf during W^X work I think. Phoebe's results also showed a few anomalous results, so there seems to be a question over accuracy of perf tooling on arm64. |
Introduce ServerGC-specific write barriers on arm64. The only difference in them is removed "is gen0" check:
since server gc has multiple heaps and this check doesn't work. Copied from x64 which has
JIT_WriteBarrier_SVR64
I'll add windows and NAOT support if this is approved as the right thing to do