Fiber context switch, Stack pools, Safe select #7407

ysbaddaden · 2019-02-10T10:20:41Z

Rework of #6990 that introduces Fiber::Context to hold the stack pointer along with a resumable flag, used to determine whether the fiber is currently running (nice for MT), along with an alive flag on the fiber to know when it's dead and musn't be resumed anymore.

The scheduler now only resumes a resumable fiber, and will abort with a fatal error whenever trying to resume a running or dead fibers, that would lead to segfaults. It doesn't just ignore these fibers because they're proof of synchronization issues.

This change makes it obvious that select is buggy (see #3900 and #7348). I propose a full rewrite of Channel.select that spawns a fiber for each clause. It's probably much slower than the current implementation, but it's concurrency safe —thought we can probably speed things up by checking if any clause is ready, avoiding to spawn fibers until we must wait. I didn't benchmark, but I expect this implementation to be slower, since unless a clause is ready or there is an else branch, it now spawns N fibers just to cancel them eventually.

Notes:

keeps Fiber::StackPool from Fiber context & Stack pool #6990 —maybe we don't need the mutex now;
maybe these 3 commits could be 3 distinct pull requests.

src/channel.cr

src/fiber.cr

The fiber context holds the current stack top pointer (for a saved stack), as well as a resumable flag that tells whether the fiber can be resumed (its context was fully saved) or is currently running, or is dead (its proc returned). The resumable flag is required to prevent a fiber to be resumed while it's still running. This should usually not happen with the monothreaded scheduler, but still useful to find a buggy synchronisation primitive that would enqueue the same fiber multiple times (e.g. the current `select` implementation). This will be very frequent with a multithreaded scheduler where a thread B could try to resume a fiber that has just been enqueued by thread A but didn't fully store its context yet. The fiber status, running or dead, is monitored using an `@alive` boolean. We can't set `context.resumable` to another value, because the scheduler eventually swaps the context to another fiber, which would overwrite `context.resumable`. The scheduler now aborts with a fatal error whenever it would resume a running or dead fiber, which would otherwise lead to a segfault. It doesn't just ignore the fiber since to enqueue a fiber twice or a dead fiber is an error that must be fixed.

Keeps the pool of free stacks to recycle out of Fiber itself, and makes it thread-safe using simple thread mutexes. A better algorithm could eventually be used (e.g. nonblocking or flat-combining) to speed up spawning fibers in parallel.

In some scenarios, for example concurrent send actions to buffered channels, it was possible to enqueue the current fiber *many* times which led to segfaults, trying to resume a dead fiber or to switch to the currently running fiber. All channels actions are now performed inside a dedicated fiber, so the wait operation is safe (on a single thread). Still not MT-safe: select actions need to compete on an atomic flag when ready, and be resilient to a failed execution (e.g. another thread already sent/receive from the channel).

forksaber · 2019-02-12T02:30:59Z

Hi @ysbaddaden,
I found an issue with this PR which is caused by not calling unwait immediately before resume/enqueue. The following example hangs but it shouldn't

ch1 = Channel(Int32).new(1)
ch2 = Channel(Int32).new
spawn do
  select
  when ch1.send(10)
  when ch2.send(20)
  end

  select
  when ch2.send(21)
  when ch1.send(11)
  end
end

select
when x = ch1.receive
  puts x
when y = ch2.receive
  puts y
end
puts ch2.receive

Expected output

10
21

ysbaddaden · 2019-02-12T09:52:30Z

@forksaber thanks for testing!

It sure hangs, but it's because ch2 is unbuffered (synchronous), while ch1 is unbuffered (asynchronous) an ch2 never gets a chance to actually send a value —it's always canceled—, so its empty, and the last ch2.receive will block. The flow goes like this:

ch1.send(10) # => success
ch1.send(11) # => waiting...
ch1.receive  # => 11
ch1.send(11) # => success
ch2.receive  # => waiting...

Once a clause is ready, we immediately cancel/unwait all other operations/fibers. Since we're single threaded, there is no race condition —with MT we'll need an atomic flag (easy) and be resilient to the execute operation failing (complex).

forksaber · 2019-02-12T11:18:14Z

I reported it only because the above code doesn't block with current version of crystal. I can assume there is slight behaviour change with your implementation. Whether that is considered a bug or not is debatable.

Anyways here's another test case:

ch1 = Channel(Int32).new(1)
ch2 = Channel(Int32).new
spawn do
  sleep 0
  ch1.send(10)
  ch2.send(21)
end

select
when x = ch1.receive
  puts x
when y = ch2.receive
  puts y
end
puts ch2.receive

Output

FATAL: tried to resume a dead fiber: #<Fiber:0x7fbb43997d80: 0>

Any thoughts on my approach (#7414 ) to solving this issue?

ysbaddaden · 2019-02-12T13:14:57Z

Oh, good catch: because ch1.send is buffered it enqueues the fiber of the ch1.receive select clause, but because ch2.send is unbuffered it immediately resumes the fiber of the ch2.receive clause, which will unwait other clauses, thus enqueue the first fiber a second time.

ysbaddaden force-pushed the core/fiber-context-switch-stack-pools-and-safe-select branch from edbf15f to 3c8b861 Compare February 10, 2019 10:35

wooster0 suggested changes Feb 10, 2019

View reviewed changes

src/channel.cr Outdated Show resolved Hide resolved

src/fiber.cr Outdated Show resolved Hide resolved

ysbaddaden added 3 commits February 11, 2019 10:20

Extract Fiber::StackPool

aed2edb

Keeps the pool of free stacks to recycle out of Fiber itself, and makes it thread-safe using simple thread mutexes. A better algorithm could eventually be used (e.g. nonblocking or flat-combining) to speed up spawning fibers in parallel.

ysbaddaden force-pushed the core/fiber-context-switch-stack-pools-and-safe-select branch from 3c8b861 to 824a2dd Compare February 11, 2019 09:21

fixup: blocking the fiber is the responsibility of SelectAction#wait

7e28997

straight-shoota mentioned this pull request Feb 11, 2019

channel: call ops.unwait as soon as fiber is resumed/enqueued; fixes select segfault #7414

Closed

bcardiff mentioned this pull request Mar 14, 2019

MT GC #7546

Merged

ysbaddaden closed this Mar 14, 2019

ysbaddaden deleted the core/fiber-context-switch-stack-pools-and-safe-select branch November 25, 2024 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fiber context switch, Stack pools, Safe select #7407

Fiber context switch, Stack pools, Safe select #7407

ysbaddaden commented Feb 10, 2019 •

edited

Loading

forksaber commented Feb 12, 2019 •

edited

Loading

ysbaddaden commented Feb 12, 2019 •

edited

Loading

forksaber commented Feb 12, 2019 •

edited

Loading

ysbaddaden commented Feb 12, 2019

Fiber context switch, Stack pools, Safe select #7407

Fiber context switch, Stack pools, Safe select #7407

Conversation

ysbaddaden commented Feb 10, 2019 • edited Loading

forksaber commented Feb 12, 2019 • edited Loading

ysbaddaden commented Feb 12, 2019 • edited Loading

forksaber commented Feb 12, 2019 • edited Loading

ysbaddaden commented Feb 12, 2019

ysbaddaden commented Feb 10, 2019 •

edited

Loading

forksaber commented Feb 12, 2019 •

edited

Loading

ysbaddaden commented Feb 12, 2019 •

edited

Loading

forksaber commented Feb 12, 2019 •

edited

Loading