Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fiber context & Stack pool #6990

Closed

Conversation

ysbaddaden
Copy link
Contributor

Here are my last refactors to Fiber (before MT experiments):

  1. Extract a Fiber::StackPool to hold & collect stacks that can be recycled, and made it thread safe;
  2. Create a Fiber::Context struct that is passed to the arch-specific ASM;
  3. Add a resumable property to the fiber context that tells whether the fiber is resumable (it's context is saved), running, or dead;
  4. Since a fiber can be enqueued before its context is saved —which should never happen with a single thread, unless a concurrency primitive is wrong— the scheduler waits for the fiber to be resumable, and raises if the fiber is dead.

Note: these changes shall impact the performance but hopefully not that much. With a single thread, mutexes should never be locked, and most checks will always succeed, unless something's wrong.

The fiber context holds the current stack top pointer (for a saved
stack), as well as a resumable flag that tells whether the fiber can
be resumed (its context was fully saved) or is currently running, or
is dead (its proc returned).

The resumable flag is required to prevent a fiber to be resumed
while it's still running. This should usually not happen with the
monothreaded scheduler, unless it's trying to resume a terminated
fiber. This will be very frequent with a multithreaded scheduler
where a thread B could try to resume a fiber that has just been
enqueued by thread A but didn't fully stored its context yet.

The resumable flag is also used to mark a fiber as dead, which means
that it can't be resumed anymore. In that case the scheduler should
raise an exception.
Keeps the pool of free stacks to recycle out of Fiber itself, and
makes it thread-safe using simple thread mutexes. A better algorithm
could eventually be used (e.g. nonblocking or flat-combining) to
speed up spawning fibers in parallel.
@Heaven31415
Copy link
Contributor

@ysbaddaden I got a question, it's a bit unrelated, but very important to consider for MT. Some libraries that are crucial for game and application development (like OpenGL) store thread local information which is used during their execution, thus forcing you to know on which thread you are currently running.

Will there be a mechanism to lock execution of exactly one fiber to a specific thread (which will allow proper usage of those libraries)?

This problem already happened in golang. You can read about their solution here and here.

@vladfaust
Copy link
Contributor

I'm a simple man, I see thread safe, you got my thumb up

@waj
Copy link
Member

waj commented Oct 26, 2018

Do you know how big is the impact within the same thread? Also, when is that spin wait actually meaningful? Why a running (and still runnable) fiber is being rescheduled into another thread instead of keep it running on the same thread?

I’m probably missing details on the big picture that you have in mind for the MT scheduler and I’d like to understand more about it.

@ysbaddaden
Copy link
Contributor Author

I'll make more testing and can replace the spin wait with a simpler resumable / dead check, that prove an erroneous synchronization primitive such as a wrong assumption in channel, mutex or a manual resume somewhere. It's better to detect this than crash :)

Resumable and the spin lock in scheduler are because of MT with stealing schedulers where thread A enqueues the current fiber and thread B steals it and resumes it... before thread A saved it's context. Oops. Segfault.

@ysbaddaden
Copy link
Contributor Author

Even with a dispatch scheduler, that would push enqueued fiber to any thread, since there is a time window between the moment we enqueue the current fiber and its context is saved, there is a risk that another thread tries to resume the fiber before it's context is fully saved.

@waj
Copy link
Member

waj commented Oct 26, 2018

I see, but I wonder, if the current fiber is about to be resumed in another thread, then it's runnable (not waiting for IO, timer or channel data), so why reschedule it in a different thread in the first place instead of keep it running on the current one? Mm.. maybe there is a Fiber.yield call. Is there any other case?

@ysbaddaden
Copy link
Contributor Author

Yes, furious yields with more schedulers than availablr fibers is what led me to crashes, until I understood the race condition, and fixed it with a resumable flag.

But still, when creating an IO event in thread A, maybe the IO is immediately ready and the event can be resumed or enqueued quickly from thread B, but thread A got suspended for some reason, and it didn't have enough time to swap.

The probability of such an event is very low and should never happen... but it may happen sporadicaly, and lead to weird crashes. Working with threads all summer made me aware that if a race can happen, even with the slightest chance, it will happen :D

Anyway, I'll run some benchmarks when I can to measure the impact on context switches. I still had incredible performance in muco, so I'm not too worried. Yet, if it's noticeably slower, I'll put this patch as the first of an MT branch!

@bcardiff
Copy link
Member

We could use a compile time flag to apply the proposed lock only when MT is enabled. This will still leave the fiber context refactor and the assembly code setting it, but that should really matter performance wise.

@ysbaddaden
Copy link
Contributor Author

I'll use this branch as a basis for MT schedulers.

TBH this resumable flag and the spin lock checks have been insignificant in muco. They always succeed on single thread, and provide incredible safety otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants