-
-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set a fixed number of worker threads in debug mode to make race conditions more reproducible #72577
Conversation
I'm not sure this is the right way to cause "multi-threading stress." The idea may have merit, but enabling something like that by default, even if dev-only, can also hinder debugging when one is interested in the default behavior, which is the case most of the time. I'd suggest, in case there's consensus among core & rendering developers (so maybe a proposal about this should be opened) that there's a new build setting that enables a define (say, |
Yeah I agree, there's a similar change in https://github.com/godotengine/godot/pull/72346/files#diff-f34c21fb879df1303b79da30fb0cbb9801e17cb0dbbb9e429249603e50d60603R102 which is also only intended for stress testing of the implementation, and shouldn't force all contributors to pay that price when working on unrelated areas. Usually for this kind of stuff we have custom defines in the file which can be enabled manually (e.g. If that's too cumbersome / not applicable for the purpose here (automated benchmarking and regression testing?), then we could indeed have another SCons compile option that enables one define common to all these automated testing code branches. |
Additionally, this is just trading one issue for another. This code would make it so that only the multithreaded codepath would be tested and the single threaded codepath would almost never be run. |
For context, the following code block from void RendererSceneCull::_scene_cull_threaded(uint32_t p_thread, CullData *cull_data) {
uint32_t cull_total = cull_data->scenario->instance_data.size();
uint32_t total_threads = WorkerThreadPool::get_singleton()->get_thread_count();
uint32_t cull_from = p_thread * cull_total / total_threads;
uint32_t cull_to = (p_thread + 1 == total_threads) ? cull_total : ((p_thread + 1) * cull_total / total_threads);
_scene_cull(*cull_data, scene_cull_result_threads[p_thread], cull_from, cull_to);
}
...
if (cull_to > thread_cull_threshold) {
//multiple threads
for (InstanceCullResult &thread : scene_cull_result_threads) {
thread.clear();
}
WorkerThreadPool::GroupID group_task = WorkerThreadPool::get_singleton()->add_template_group_task(this, &RendererSceneCull::_scene_cull_threaded, &cull_data, scene_cull_result_threads.size(), -1, true, SNAME("RenderCullInstances"));
WorkerThreadPool::get_singleton()->wait_for_group_task_completion(group_task);
for (InstanceCullResult &thread : scene_cull_result_threads) {
scene_cull_result.append_from(thread);
}
} else {
//single threaded
_scene_cull(cull_data, scene_cull_result, cull_from, cull_to);
} Essentially, a heuristic determines whether to just use the single-threaded codepath, or to use Right now,
In my testing, the overhead of context switching due to thread oversubscription appears to be minimal. I'm seeing 5.3% slowdown from running 64 threads on my 2c4t computer compared to 4 threads, see the below average frame time in milliseconds: --- /dev/fd/63 2023-02-02 19:45:13.387513929 -0500
+++ /dev/fd/62 2023-02-02 19:45:13.390847272 -0500
@@ -79,13 +79,13 @@
"category": "Rendering > Lights And Meshes",
"name": "Omni 100",
"results": {
"idle": 0,
"physics": 0,
- "render_cpu": 107.37,
- "render_gpu": 17.14,
+ "render_cpu": 113.12,
+ "render_gpu": 17.67,
}
},
{
"category": "Rendering > Lights And Meshes",
"name": "Speed Fast", Given the reproducibility benefits of having this enabled by default for everyone, I'd argue that 5.3% is a very small price to pay.
I've changed the implementation so that Footnotes
|
@RandomShaper See #72689 |
After some more testing and deliberation, I've reduced the number of default worker threads from 64 to 16 to be more in line with typical systems as according to Steam's Hardware Survey (80% of users fall between 4c8t to 8c16t). The theory being that if a performance drop is observed in a piece of multithreaded code, it probably has limited scalability, and 16 is probably a reasonable standard to use for testing that. |
…tions more reproducible
Makes multithreading behavior more consistent, see here for an example of an issue that was made more difficult to reproduce due to differing behavior between machines with different core counts.
Having a higher number of threads by default also makes race condition bugs easier to detect by normal users at almost no overhead, without having to resort to compiling a custom build with ThreadSanitizer enabled. This is especially helpful on low core count systems where such bugs would come up much less often with the default behavior, as race conditions only manifest more often with an increased number of threads.
Note that this does not affect any decisions on when to take single vs multithreaded paths and that the number of worker threads can still be overridden with
threading/worker_pool/max_threads
when necessary.Also note that this has been edited to incorporate feedback and address concerns, so most points raised below are outdated.