Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fiber context switch, Stack pools, Safe select #7407

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Feb 10, 2019

Rework of #6990 that introduces Fiber::Context to hold the stack pointer along with a resumable flag, used to determine whether the fiber is currently running (nice for MT), along with an alive flag on the fiber to know when it's dead and musn't be resumed anymore.

The scheduler now only resumes a resumable fiber, and will abort with a fatal error whenever trying to resume a running or dead fibers, that would lead to segfaults. It doesn't just ignore these fibers because they're proof of synchronization issues.

This change makes it obvious that select is buggy (see #3900 and #7348). I propose a full rewrite of Channel.select that spawns a fiber for each clause. It's probably much slower than the current implementation, but it's concurrency safe —thought we can probably speed things up by checking if any clause is ready, avoiding to spawn fibers until we must wait. I didn't benchmark, but I expect this implementation to be slower, since unless a clause is ready or there is an else branch, it now spawns N fibers just to cancel them eventually.

Notes:

@ysbaddaden ysbaddaden force-pushed the core/fiber-context-switch-stack-pools-and-safe-select branch from edbf15f to 3c8b861 Compare February 10, 2019 10:35
src/channel.cr Outdated Show resolved Hide resolved
src/fiber.cr Outdated Show resolved Hide resolved
The fiber context holds the current stack top pointer (for a saved
stack), as well as a resumable flag that tells whether the fiber can
be resumed (its context was fully saved) or is currently running, or
is dead (its proc returned).

The resumable flag is required to prevent a fiber to be resumed
while it's still running. This should usually not happen with the
monothreaded scheduler, but still useful to find a buggy
synchronisation primitive that would enqueue the same fiber multiple
times (e.g. the current `select` implementation). This will be very
frequent with a multithreaded scheduler where a thread B could try
to resume a fiber that has just been enqueued by thread A but didn't
fully store its context yet.

The fiber status, running or dead, is monitored using an `@alive`
boolean. We can't set `context.resumable` to another value, because
the scheduler eventually swaps the context to another fiber, which
would overwrite `context.resumable`.

The scheduler now aborts with a fatal error whenever it would resume
a running or dead fiber, which would otherwise lead to a segfault.
It doesn't just ignore the fiber since to enqueue a fiber twice or a
dead fiber is an error that must be fixed.
Keeps the pool of free stacks to recycle out of Fiber itself, and
makes it thread-safe using simple thread mutexes. A better algorithm
could eventually be used (e.g. nonblocking or flat-combining) to
speed up spawning fibers in parallel.
In some scenarios, for example concurrent send actions to buffered
channels, it was possible to enqueue the current fiber *many* times
which led to segfaults, trying to resume a dead fiber or to switch
to the currently running fiber.

All channels actions are now performed inside a dedicated fiber, so
the wait operation is safe (on a single thread).

Still not MT-safe: select actions need to compete on an atomic flag
when ready, and be resilient to a failed execution (e.g. another
thread already sent/receive from the channel).
@ysbaddaden ysbaddaden force-pushed the core/fiber-context-switch-stack-pools-and-safe-select branch from 3c8b861 to 824a2dd Compare February 11, 2019 09:21
@forksaber
Copy link

forksaber commented Feb 12, 2019

Hi @ysbaddaden,
I found an issue with this PR which is caused by not calling unwait immediately before resume/enqueue. The following example hangs but it shouldn't

ch1 = Channel(Int32).new(1)
ch2 = Channel(Int32).new
spawn do
  select
  when ch1.send(10)
  when ch2.send(20)
  end

  select
  when ch2.send(21)
  when ch1.send(11)
  end
end

select
when x = ch1.receive
  puts x
when y = ch2.receive
  puts y
end
puts ch2.receive

Expected output

10
21

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Feb 12, 2019

@forksaber thanks for testing!

It sure hangs, but it's because ch2 is unbuffered (synchronous), while ch1 is unbuffered (asynchronous) an ch2 never gets a chance to actually send a value —it's always canceled—, so its empty, and the last ch2.receive will block. The flow goes like this:

ch1.send(10) # => success
ch1.send(11) # => waiting...
ch1.receive  # => 11
ch1.send(11) # => success
ch2.receive  # => waiting...

Once a clause is ready, we immediately cancel/unwait all other operations/fibers. Since we're single threaded, there is no race condition —with MT we'll need an atomic flag (easy) and be resilient to the execute operation failing (complex).

@forksaber
Copy link

forksaber commented Feb 12, 2019

I reported it only because the above code doesn't block with current version of crystal. I can assume there is slight behaviour change with your implementation. Whether that is considered a bug or not is debatable.

Anyways here's another test case:

ch1 = Channel(Int32).new(1)
ch2 = Channel(Int32).new
spawn do
  sleep 0
  ch1.send(10)
  ch2.send(21)
end

select
when x = ch1.receive
  puts x
when y = ch2.receive
  puts y
end
puts ch2.receive

Output

FATAL: tried to resume a dead fiber: #<Fiber:0x7fbb43997d80: 0>

Any thoughts on my approach (#7414 ) to solving this issue?

@ysbaddaden
Copy link
Contributor Author

Oh, good catch: because ch1.send is buffered it enqueues the fiber of the ch1.receive select clause, but because ch2.send is unbuffered it immediately resumes the fiber of the ch2.receive clause, which will unwait other clauses, thus enqueue the first fiber a second time.

@bcardiff bcardiff mentioned this pull request Mar 14, 2019
@ysbaddaden ysbaddaden closed this Mar 14, 2019
@ysbaddaden ysbaddaden deleted the core/fiber-context-switch-stack-pools-and-safe-select branch November 25, 2024 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants