-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fiber context switch, Stack pools, Safe select #7407
Fiber context switch, Stack pools, Safe select #7407
Conversation
edbf15f
to
3c8b861
Compare
The fiber context holds the current stack top pointer (for a saved stack), as well as a resumable flag that tells whether the fiber can be resumed (its context was fully saved) or is currently running, or is dead (its proc returned). The resumable flag is required to prevent a fiber to be resumed while it's still running. This should usually not happen with the monothreaded scheduler, but still useful to find a buggy synchronisation primitive that would enqueue the same fiber multiple times (e.g. the current `select` implementation). This will be very frequent with a multithreaded scheduler where a thread B could try to resume a fiber that has just been enqueued by thread A but didn't fully store its context yet. The fiber status, running or dead, is monitored using an `@alive` boolean. We can't set `context.resumable` to another value, because the scheduler eventually swaps the context to another fiber, which would overwrite `context.resumable`. The scheduler now aborts with a fatal error whenever it would resume a running or dead fiber, which would otherwise lead to a segfault. It doesn't just ignore the fiber since to enqueue a fiber twice or a dead fiber is an error that must be fixed.
Keeps the pool of free stacks to recycle out of Fiber itself, and makes it thread-safe using simple thread mutexes. A better algorithm could eventually be used (e.g. nonblocking or flat-combining) to speed up spawning fibers in parallel.
In some scenarios, for example concurrent send actions to buffered channels, it was possible to enqueue the current fiber *many* times which led to segfaults, trying to resume a dead fiber or to switch to the currently running fiber. All channels actions are now performed inside a dedicated fiber, so the wait operation is safe (on a single thread). Still not MT-safe: select actions need to compete on an atomic flag when ready, and be resilient to a failed execution (e.g. another thread already sent/receive from the channel).
3c8b861
to
824a2dd
Compare
Hi @ysbaddaden, ch1 = Channel(Int32).new(1)
ch2 = Channel(Int32).new
spawn do
select
when ch1.send(10)
when ch2.send(20)
end
select
when ch2.send(21)
when ch1.send(11)
end
end
select
when x = ch1.receive
puts x
when y = ch2.receive
puts y
end
puts ch2.receive Expected output
|
@forksaber thanks for testing! It sure hangs, but it's because
Once a clause is ready, we immediately cancel/unwait all other operations/fibers. Since we're single threaded, there is no race condition —with MT we'll need an atomic flag (easy) and be resilient to the execute operation failing (complex). |
I reported it only because the above code doesn't block with current version of crystal. I can assume there is slight behaviour change with your implementation. Whether that is considered a bug or not is debatable. Anyways here's another test case: ch1 = Channel(Int32).new(1)
ch2 = Channel(Int32).new
spawn do
sleep 0
ch1.send(10)
ch2.send(21)
end
select
when x = ch1.receive
puts x
when y = ch2.receive
puts y
end
puts ch2.receive Output
Any thoughts on my approach (#7414 ) to solving this issue? |
Oh, good catch: because |
Rework of #6990 that introduces
Fiber::Context
to hold the stack pointer along with aresumable
flag, used to determine whether the fiber is currently running (nice for MT), along with analive
flag on the fiber to know when it's dead and musn't be resumed anymore.The scheduler now only resumes a resumable fiber, and will abort with a fatal error whenever trying to resume a running or dead fibers, that would lead to segfaults. It doesn't just ignore these fibers because they're proof of synchronization issues.
This change makes it obvious that
select
is buggy (see #3900 and #7348). I propose a full rewrite ofChannel.select
that spawns a fiber for each clause.It's probably much slower than the current implementation, but it's concurrency safe —thought we can probably speed things up by checking if any clause is ready, avoiding to spawn fibers until we must wait.I didn't benchmark, but I expect this implementation to be slower, since unless a clause is ready or there is anelse
branch, it now spawns N fibers just to cancel them eventually.Notes: