Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MT: few fixes and improvements #14089

Closed
wants to merge 8 commits into from

Conversation

ysbaddaden
Copy link
Contributor

First a couple fixes: an issue in the MT scheduler (math overflow when spawning Int32::MAX + 1 fibers), and a workaround to avoid #13211.

The scheduler always replaces the current thread for a fiber (which is inefficient) and it's breaking "a fiber will only ever live on the same thread". I refactored how fibers are associated to a thread, and enforced #resume(fiber) and #yield(fiber) to verify that the fiber we're trying to resume has indeed been associated with the current scheduler. The two methods are barely used (only #resume in specs), so there shouldn't be no impact.

Note: we may want to deprecate Fiber#resume and Crystal::Scheduler#yield(fiber) that try to immediately resume the fiber in the current thread, in favor to Fiber#enqueue that will properly dispatch. Again, #resume is only used in the stdlib specs, and #yield(fiber) isn't used at all.

I simplified a couple indirections, for example Fiber.yield that went through 5 inner calls when it could be simplified to only 2, or the scheduler calling Fiber#resume when it could call #resume(fiber) directly.

I also tried to improve on the design, for example have one Fiber::StackPool per scheduler so we can (safely) recycle stacks, move #set_current_thread from the scheduler to Fiber, instead of a couple systems (Fiber::StackPool that only creates stacks, and Scheduler#free_stacks that will do some clean on #reschedule).

Sadly, all these changes don't seem to improve the performance of MT 😞

See all the individual commits and their description for more details.

This is a still lot of fibers, and can take over 24 days if spawning a
fiber every millisecond, but it can happen.
It used to call:

1. Fiber.yield
2. Crystal::Scheduler.yield
3. Thread.current.scheduler.yield
4. sleep(0.seconds)
5. Crystal::Scheduler.sleep
6. Thread.current.scheduler.sleep()

Now it's a little more direct:

1. Fiber.yield
2. Crystal::Scheduler.yield
3. Thread.current.scheduler.sleep(0.seconds)
It's simpler to call `#resume(fiber)` than going through:

1. Fiber#resume
2. Crystal::Scheduler.resume(fiber)
3. Thread.current.scheduler.resume(fiber)
This is mostly for readability since `Thread.current` is usually a
`@[ThreadLocal]` and the result may be cached locally by LLVM to avoid
multiple resolutions, though that won't be the case for targets that
don't support ThreadLocal and for which we must use thread data storage
(e.g. Android, OpenBSD).
This avoids manipulating `fiber.@current_thread` which ain't very
pretty, and moves the responsibility to manipulate it to
Crystal::Scheduler. That Crystal::Scheduler is also responsible from
making sure a fiber will always be enqueued or resumed on the thread
it's been associated to.

Lastly, we remove the current_thread store that was always replacing any
previous value on context swap, which sadly doesn't seem to improve
performance...
The stack pool was only used to create new stacks when MT was enabled.
The stacks were then pushed to each scheduler's free stacks, and
eventually dropped on reschedule (after context swap, so we're sure to
only ever deallocated stacks that are no longer used). This led the
stacks to never be reused with MT. Only created then deallocated.

This patch changes the behavior to have a stack pool running on each
scheduler, and to use it to create and collect the stacks, and reuse
them when possible. It also drops the mutex since the stack pool can
never be accessed in parallel (in fact it never was).

Also starts a collecting fiber on each thread.

It may only lead to better performance if there are different fibers,
running on multiple threads that are spawning fibers. It won't have much
(or any) impact if there is only one fiber spawning other fibers (e.g. a
HTTP::Server) as the stack of fibers that run on another thread won't be
reused (different stack pool).
@ysbaddaden
Copy link
Contributor Author

There's an error on the Aarch64 stdlib test suite, but that seems unrelated?

@yxhuvud
Copy link
Contributor

yxhuvud commented Dec 14, 2023

This will force updates to my code, but I still love these changes and most of them are total nobrainers. There is so much needless indirection in these parts, and MT really needs polish.

Sadly, all these changes don't seem to improve the performance of MT

I'm not surprised. These all look like minor things (from a performance perspective) except maybe the stackpool change if you create a bazillion fibers at once. Things like stack switching or sending fibers over the fiber channel would probably dominate most of these by far.

How do you benchmark it? I've used https://gist.github.com/yxhuvud/9eec4be4f85fd700ed768965bb731389 back when I tried to implement work stealing (I never finished). Perhaps 87f414e is the reason it locked up for bigger slices (go and zig handled a lot bigger ones). Or perhaps it is some interaction with the fiber channels.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Dec 15, 2023

As requested: I'm closing this PR in favor to individual pull requests:

@yxhuvud Silly heavy load benchmarks, either CPU (Blake3) or IO bound (HTTP::Server). Thanks for the links 👀

@ysbaddaden ysbaddaden closed this Dec 15, 2023
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Dec 18, 2023

FWIW: your qsort benchmark leads to interesting numbers:

  • master (no MT): 104-109ms
  • branch (no MT): 104-108ms
  • master (MT:1): 95-97ms
  • branch (MT:1): 95-98ms
  • master (MT:4): 78-91ms
  • branch (MT:4): 72-82ms (fastest)

Which says:

  • MT:4 is slightly faster with the changes on this branch
  • MT:1 is always faster than no MT 🤨

I ran each test at least 8 times, retried them later, and they always led to the same values.

@yxhuvud
Copy link
Contributor

yxhuvud commented Dec 18, 2023

MT:1 is always faster than no MT 🤨

Well, rewrite it to not use channels or spawns and see where it lands 🙈

MT:4 is slightly faster with the changes on this branch

This however looks like a really nice win.

@ysbaddaden ysbaddaden deleted the fix/mt-issues branch September 6, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants