-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MT: few fixes and improvements #14089
Conversation
This is a still lot of fibers, and can take over 24 days if spawning a fiber every millisecond, but it can happen.
It used to call: 1. Fiber.yield 2. Crystal::Scheduler.yield 3. Thread.current.scheduler.yield 4. sleep(0.seconds) 5. Crystal::Scheduler.sleep 6. Thread.current.scheduler.sleep() Now it's a little more direct: 1. Fiber.yield 2. Crystal::Scheduler.yield 3. Thread.current.scheduler.sleep(0.seconds)
It's simpler to call `#resume(fiber)` than going through: 1. Fiber#resume 2. Crystal::Scheduler.resume(fiber) 3. Thread.current.scheduler.resume(fiber)
This is mostly for readability since `Thread.current` is usually a `@[ThreadLocal]` and the result may be cached locally by LLVM to avoid multiple resolutions, though that won't be the case for targets that don't support ThreadLocal and for which we must use thread data storage (e.g. Android, OpenBSD).
This avoids manipulating `fiber.@current_thread` which ain't very pretty, and moves the responsibility to manipulate it to Crystal::Scheduler. That Crystal::Scheduler is also responsible from making sure a fiber will always be enqueued or resumed on the thread it's been associated to. Lastly, we remove the current_thread store that was always replacing any previous value on context swap, which sadly doesn't seem to improve performance...
The stack pool was only used to create new stacks when MT was enabled. The stacks were then pushed to each scheduler's free stacks, and eventually dropped on reschedule (after context swap, so we're sure to only ever deallocated stacks that are no longer used). This led the stacks to never be reused with MT. Only created then deallocated. This patch changes the behavior to have a stack pool running on each scheduler, and to use it to create and collect the stacks, and reuse them when possible. It also drops the mutex since the stack pool can never be accessed in parallel (in fact it never was). Also starts a collecting fiber on each thread. It may only lead to better performance if there are different fibers, running on multiple threads that are spawning fibers. It won't have much (or any) impact if there is only one fiber spawning other fibers (e.g. a HTTP::Server) as the stack of fibers that run on another thread won't be reused (different stack pool).
There's an error on the Aarch64 stdlib test suite, but that seems unrelated? |
This will force updates to my code, but I still love these changes and most of them are total nobrainers. There is so much needless indirection in these parts, and MT really needs polish.
I'm not surprised. These all look like minor things (from a performance perspective) except maybe the stackpool change if you create a bazillion fibers at once. Things like stack switching or sending fibers over the fiber channel would probably dominate most of these by far. How do you benchmark it? I've used https://gist.github.com/yxhuvud/9eec4be4f85fd700ed768965bb731389 back when I tried to implement work stealing (I never finished). Perhaps 87f414e is the reason it locked up for bigger slices (go and zig handled a lot bigger ones). Or perhaps it is some interaction with the fiber channels. |
As requested: I'm closing this PR in favor to individual pull requests:
@yxhuvud Silly heavy load benchmarks, either CPU (Blake3) or IO bound (HTTP::Server). Thanks for the links 👀 |
FWIW: your qsort benchmark leads to interesting numbers:
Which says:
I ran each test at least 8 times, retried them later, and they always led to the same values. |
Well, rewrite it to not use channels or spawns and see where it lands 🙈
This however looks like a really nice win. |
First a couple fixes: an issue in the MT scheduler (math overflow when spawning Int32::MAX + 1 fibers), and a workaround to avoid #13211.
The scheduler always replaces the current thread for a fiber (which is inefficient) and it's breaking "a fiber will only ever live on the same thread". I refactored how fibers are associated to a thread, and enforced
#resume(fiber)
and#yield(fiber)
to verify that the fiber we're trying to resume has indeed been associated with the current scheduler. The two methods are barely used (only#resume
in specs), so there shouldn't be no impact.Note: we may want to deprecate
Fiber#resume
andCrystal::Scheduler#yield(fiber)
that try to immediately resume the fiber in the current thread, in favor toFiber#enqueue
that will properly dispatch. Again,#resume
is only used in the stdlib specs, and#yield(fiber)
isn't used at all.I simplified a couple indirections, for example
Fiber.yield
that went through 5 inner calls when it could be simplified to only 2, or the scheduler callingFiber#resume
when it could call#resume(fiber)
directly.I also tried to improve on the design, for example have one Fiber::StackPool per scheduler so we can (safely) recycle stacks, move
#set_current_thread
from the scheduler to Fiber, instead of a couple systems (Fiber::StackPool that only creates stacks, and Scheduler#free_stacks that will do some clean on#reschedule
).Sadly, all these changes don't seem to improve the performance of MT 😞
See all the individual commits and their description for more details.