Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a fixed number of worker threads in debug mode to make race conditions more reproducible #72577

Closed
wants to merge 1 commit into from

Conversation

myaaaaaaaaa
Copy link
Contributor

@myaaaaaaaaa myaaaaaaaaa commented Feb 2, 2023

Makes multithreading behavior more consistent, see here for an example of an issue that was made more difficult to reproduce due to differing behavior between machines with different core counts.

Having a higher number of threads by default also makes race condition bugs easier to detect by normal users at almost no overhead, without having to resort to compiling a custom build with ThreadSanitizer enabled. This is especially helpful on low core count systems where such bugs would come up much less often with the default behavior, as race conditions only manifest more often with an increased number of threads.

Note that this does not affect any decisions on when to take single vs multithreaded paths and that the number of worker threads can still be overridden with threading/worker_pool/max_threads when necessary.

Also note that this has been edited to incorporate feedback and address concerns, so most points raised below are outdated.

@myaaaaaaaaa myaaaaaaaaa requested review from a team as code owners February 2, 2023 08:48
@RandomShaper
Copy link
Member

I'm not sure this is the right way to cause "multi-threading stress."

The idea may have merit, but enabling something like that by default, even if dev-only, can also hinder debugging when one is interested in the default behavior, which is the case most of the time.

I'd suggest, in case there's consensus among core & rendering developers (so maybe a proposal about this should be opened) that there's a new build setting that enables a define (say, MT_STRESS_ENABLED) that tweaks defaults in such a way. and even maybe including other areas of the engine. Or maybe a new command-line switch that, again, is only available in dev builds, that lets you change the setting without rebuilding. Now I think of it, a setting that does the very opposite (restricting the worker to one thread) may also make sense. At least, I've found myself doing so manually often times while debugging some piece of code that otherwise would be running 16 times simultaneously.

@akien-mga
Copy link
Member

Yeah I agree, there's a similar change in https://github.com/godotengine/godot/pull/72346/files#diff-f34c21fb879df1303b79da30fb0cbb9801e17cb0dbbb9e429249603e50d60603R102 which is also only intended for stress testing of the implementation, and shouldn't force all contributors to pay that price when working on unrelated areas.

Usually for this kind of stuff we have custom defines in the file which can be enabled manually (e.g. //#define STRESS_TEST, uncomment before compiling to run that logic).

If that's too cumbersome / not applicable for the purpose here (automated benchmarking and regression testing?), then we could indeed have another SCons compile option that enables one define common to all these automated testing code branches.

@clayjohn clayjohn removed this from the 4.0 milestone Feb 2, 2023
@clayjohn
Copy link
Member

clayjohn commented Feb 2, 2023

Additionally, this is just trading one issue for another. This code would make it so that only the multithreaded codepath would be tested and the single threaded codepath would almost never be run.

@myaaaaaaaaa
Copy link
Contributor Author

myaaaaaaaaa commented Feb 3, 2023

For context, the following code block from servers/rendering/renderer_scene_cull.cpp illustrates how WorkerThreadPool is used in Godot in practice:

void RendererSceneCull::_scene_cull_threaded(uint32_t p_thread, CullData *cull_data) {
	uint32_t cull_total = cull_data->scenario->instance_data.size();
	uint32_t total_threads = WorkerThreadPool::get_singleton()->get_thread_count();
	uint32_t cull_from = p_thread * cull_total / total_threads;
	uint32_t cull_to = (p_thread + 1 == total_threads) ? cull_total : ((p_thread + 1) * cull_total / total_threads);

	_scene_cull(*cull_data, scene_cull_result_threads[p_thread], cull_from, cull_to);
}

...

	if (cull_to > thread_cull_threshold) {
		//multiple threads
		for (InstanceCullResult &thread : scene_cull_result_threads) {
			thread.clear();
		}

		WorkerThreadPool::GroupID group_task = WorkerThreadPool::get_singleton()->add_template_group_task(this, &RendererSceneCull::_scene_cull_threaded, &cull_data, scene_cull_result_threads.size(), -1, true, SNAME("RenderCullInstances"));
		WorkerThreadPool::get_singleton()->wait_for_group_task_completion(group_task);

		for (InstanceCullResult &thread : scene_cull_result_threads) {
			scene_cull_result.append_from(thread);
		}

	} else {
		//single threaded
		_scene_cull(cull_data, scene_cull_result, cull_from, cull_to);
	}

Essentially, a heuristic determines whether to just use the single-threaded codepath, or to use WorkerThreadPool to spawn multiple threads that each call the single-threaded codepath with a different subset of the workload. Of course, it would be best if WorkerThreadPool had a function that could handle this automatically,1 but that's a separate issue.

Right now, WorkerThreadPool is only ever run when the multithreaded codepath is desired, and defaults to the number of threads on the system. The idea with increasing this by default is so that any bugs caused by multithreading would be easier for normal users to find and reproduce,2 since multithreading bugs have a higher likelihood of popping up as the number of threads increases.3

Usually for this kind of stuff we have custom defines in the file which can be enabled manually (e.g. //#define STRESS_TEST, uncomment before compiling to run that logic).

In my testing, the overhead of context switching due to thread oversubscription appears to be minimal. I'm seeing 5.3% slowdown from running 64 threads on my 2c4t computer compared to 4 threads, see the below average frame time in milliseconds:

--- /dev/fd/63	2023-02-02 19:45:13.387513929 -0500
+++ /dev/fd/62	2023-02-02 19:45:13.390847272 -0500
@@ -79,13 +79,13 @@
       "category": "Rendering > Lights And Meshes",
       "name": "Omni 100",
       "results": {
         "idle": 0,
         "physics": 0,
-        "render_cpu": 107.37,
-        "render_gpu": 17.14,
+        "render_cpu": 113.12,
+        "render_gpu": 17.67,
       }
     },
     {
       "category": "Rendering > Lights And Meshes",
       "name": "Speed Fast",

Given the reproducibility benefits of having this enabled by default for everyone, I'd argue that 5.3% is a very small price to pay.

Now I think of it, a setting that does the very opposite (restricting the worker to one thread) may also make sense. At least, I've found myself doing so manually often times while debugging some piece of code that otherwise would be running 16 times simultaneously.

I've changed the implementation so that threading/worker_pool/max_threads takes priority if it's set, does this work for you?

Footnotes

  1. Ideally with a better way of determining whether to use the single-threaded or multithreaded codepath than adding a project setting every time WorkerThreadPool is used, along with a project setting to force single/multiple threads always

  2. In other words, it should actually be enabled in all debug builds rather than just developer builds, I've updated the PR to reflect this

  3. As an example, the issues Vulkan: Random meshes vanish in large scenes with lights that have a projector texture #68274 and Vulkan: Models/surfaces randomly stop rendering if too many lights are present #70406 turned from visual glitches to outright crashes for me after this patch was enabled

@Chaosus Chaosus added this to the 4.x milestone Feb 3, 2023
@myaaaaaaaaa
Copy link
Contributor Author

Or maybe a new command-line switch that, again, is only available in dev builds, that lets you change the setting without rebuilding. Now I think of it, a setting that does the very opposite (restricting the worker to one thread) may also make sense. At least, I've found myself doing so manually often times while debugging some piece of code that otherwise would be running 16 times simultaneously.

@RandomShaper See #72689

@myaaaaaaaaa myaaaaaaaaa changed the title Increase number of worker threads in developer mode to aid debugging Increase default number of worker threads in debug mode to make race conditions more reproducible Feb 8, 2023
@myaaaaaaaaa myaaaaaaaaa changed the title Increase default number of worker threads in debug mode to make race conditions more reproducible Set a fixed number of worker threads in debug mode to make race conditions more reproducible Mar 2, 2023
@myaaaaaaaaa
Copy link
Contributor Author

myaaaaaaaaa commented Mar 2, 2023

After some more testing and deliberation, I've reduced the number of default worker threads from 64 to 16 to be more in line with typical systems as according to Steam's Hardware Survey (80% of users fall between 4c8t to 8c16t).

The theory being that if a performance drop is observed in a piece of multithreaded code, it probably has limited scalability, and 16 is probably a reasonable standard to use for testing that.

@myaaaaaaaaa myaaaaaaaaa deleted the task-splitter branch August 21, 2023 15:11
@AThousandShips AThousandShips removed this from the 4.x milestone Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants