MT: few fixes and improvements #14089

ysbaddaden · 2023-12-14T10:08:36Z

First a couple fixes: an issue in the MT scheduler (math overflow when spawning Int32::MAX + 1 fibers), and a workaround to avoid #13211.

The scheduler always replaces the current thread for a fiber (which is inefficient) and it's breaking "a fiber will only ever live on the same thread". I refactored how fibers are associated to a thread, and enforced #resume(fiber) and #yield(fiber) to verify that the fiber we're trying to resume has indeed been associated with the current scheduler. The two methods are barely used (only #resume in specs), so there shouldn't be no impact.

Note: we may want to deprecate Fiber#resume and Crystal::Scheduler#yield(fiber) that try to immediately resume the fiber in the current thread, in favor to Fiber#enqueue that will properly dispatch. Again, #resume is only used in the stdlib specs, and #yield(fiber) isn't used at all.

I simplified a couple indirections, for example Fiber.yield that went through 5 inner calls when it could be simplified to only 2, or the scheduler calling Fiber#resume when it could call #resume(fiber) directly.

I also tried to improve on the design, for example have one Fiber::StackPool per scheduler so we can (safely) recycle stacks, move #set_current_thread from the scheduler to Fiber, instead of a couple systems (Fiber::StackPool that only creates stacks, and Scheduler#free_stacks that will do some clean on #reschedule).

Sadly, all these changes don't seem to improve the performance of MT 😞

See all the individual commits and their description for more details.

closes crystal-lang#13211

This is a still lot of fibers, and can take over 24 days if spawning a fiber every millisecond, but it can happen.

It used to call: 1. Fiber.yield 2. Crystal::Scheduler.yield 3. Thread.current.scheduler.yield 4. sleep(0.seconds) 5. Crystal::Scheduler.sleep 6. Thread.current.scheduler.sleep() Now it's a little more direct: 1. Fiber.yield 2. Crystal::Scheduler.yield 3. Thread.current.scheduler.sleep(0.seconds)

It's simpler to call `#resume(fiber)` than going through: 1. Fiber#resume 2. Crystal::Scheduler.resume(fiber) 3. Thread.current.scheduler.resume(fiber)

This is mostly for readability since `Thread.current` is usually a `@[ThreadLocal]` and the result may be cached locally by LLVM to avoid multiple resolutions, though that won't be the case for targets that don't support ThreadLocal and for which we must use thread data storage (e.g. Android, OpenBSD).

This avoids manipulating `fiber.@current_thread` which ain't very pretty, and moves the responsibility to manipulate it to Crystal::Scheduler. That Crystal::Scheduler is also responsible from making sure a fiber will always be enqueued or resumed on the thread it's been associated to. Lastly, we remove the current_thread store that was always replacing any previous value on context swap, which sadly doesn't seem to improve performance...

The stack pool was only used to create new stacks when MT was enabled. The stacks were then pushed to each scheduler's free stacks, and eventually dropped on reschedule (after context swap, so we're sure to only ever deallocated stacks that are no longer used). This led the stacks to never be reused with MT. Only created then deallocated. This patch changes the behavior to have a stack pool running on each scheduler, and to use it to create and collect the stacks, and reuse them when possible. It also drops the mutex since the stack pool can never be accessed in parallel (in fact it never was). Also starts a collecting fiber on each thread. It may only lead to better performance if there are different fibers, running on multiple threads that are spawning fibers. It won't have much (or any) impact if there is only one fiber spawning other fibers (e.g. a HTTP::Server) as the stack of fibers that run on another thread won't be reused (different stack pool).

ysbaddaden · 2023-12-14T12:34:59Z

There's an error on the Aarch64 stdlib test suite, but that seems unrelated?

yxhuvud · 2023-12-14T16:36:34Z

This will force updates to my code, but I still love these changes and most of them are total nobrainers. There is so much needless indirection in these parts, and MT really needs polish.

Sadly, all these changes don't seem to improve the performance of MT

I'm not surprised. These all look like minor things (from a performance perspective) except maybe the stackpool change if you create a bazillion fibers at once. Things like stack switching or sending fibers over the fiber channel would probably dominate most of these by far.

How do you benchmark it? I've used https://gist.github.com/yxhuvud/9eec4be4f85fd700ed768965bb731389 back when I tried to implement work stealing (I never finished). Perhaps 87f414e is the reason it locked up for bigger slices (go and zig handled a lot bigger ones). Or perhaps it is some interaction with the fiber channels.

ysbaddaden · 2023-12-15T14:33:11Z

As requested: I'm closing this PR in favor to individual pull requests:

@yxhuvud Silly heavy load benchmarks, either CPU (Blake3) or IO bound (HTTP::Server). Thanks for the links 👀

ysbaddaden · 2023-12-18T12:24:10Z

FWIW: your qsort benchmark leads to interesting numbers:

master (no MT): 104-109ms
branch (no MT): 104-108ms
master (MT:1): 95-97ms
branch (MT:1): 95-98ms
master (MT:4): 78-91ms
branch (MT:4): 72-82ms (fastest)

Which says:

MT:4 is slightly faster with the changes on this branch
MT:1 is always faster than no MT 🤨

I ran each test at least 8 times, retried them later, and they always led to the same values.

yxhuvud · 2023-12-18T16:56:19Z

MT:1 is always faster than no MT 🤨

Well, rewrite it to not use channels or spawns and see where it lands 🙈

MT:4 is slightly faster with the changes on this branch

This however looks like a really nice win.

ysbaddaden added 8 commits December 14, 2023 10:39

MT: workaround regular timeouts in HTTP::Server specs

05522e4

closes crystal-lang#13211

MT: math overflow after spawning Int32::MAX fibers

87f414e

This is a still lot of fibers, and can take over 24 days if spawning a fiber every millisecond, but it can happen.

MT: simplify Crystal::Scheduler#run_loop

0d84549

It's simpler to call `#resume(fiber)` than going through: 1. Fiber#resume 2. Crystal::Scheduler.resume(fiber) 3. Thread.current.scheduler.resume(fiber)

fixup! MT: move set_current_thread to Fiber (from Crystal::Scheduler)

cd16a2c

HertzDevil added the topic:multithreading label Dec 14, 2023

Blacksmoke16 added topic:stdlib:concurrency kind:refactor labels Dec 14, 2023

ysbaddaden closed this Dec 15, 2023

ysbaddaden deleted the fix/mt-issues branch September 6, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MT: few fixes and improvements #14089

MT: few fixes and improvements #14089

ysbaddaden commented Dec 14, 2023

ysbaddaden commented Dec 14, 2023

yxhuvud commented Dec 14, 2023 •

edited

Loading

ysbaddaden commented Dec 15, 2023 •

edited

Loading

ysbaddaden commented Dec 18, 2023 •

edited

Loading

yxhuvud commented Dec 18, 2023

MT: few fixes and improvements #14089

MT: few fixes and improvements #14089

Conversation

ysbaddaden commented Dec 14, 2023

ysbaddaden commented Dec 14, 2023

yxhuvud commented Dec 14, 2023 • edited Loading

ysbaddaden commented Dec 15, 2023 • edited Loading

ysbaddaden commented Dec 18, 2023 • edited Loading

yxhuvud commented Dec 18, 2023

yxhuvud commented Dec 14, 2023 •

edited

Loading

ysbaddaden commented Dec 15, 2023 •

edited

Loading

ysbaddaden commented Dec 18, 2023 •

edited

Loading