-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing Quiescent State-Based Reclamation to Chapel #8182
Conversation
runtime/src/chpl-privatization.c
Outdated
// Determines current instance. (MUST BE ATOMIC) | ||
atomic_int_least8_t currentInstanceIdx; | ||
|
||
chpl_priv_block_t chpl_priv_block_create() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never remember this, but in C 0 argument functions should take "void". In fact, I just messed this up yesterday: #8168
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, although I'm not 100% certain I understand the actual need (something about it being possible for someone to pass arguments to a no-argument function and mess with the stack contents... kinda interested in what happens in that case, actually...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility with old C, a no argument prototype actually means "Don't assume or check anything about the args", where's a void signature means "this is a 0 argument function" -- https://stackoverflow.com/questions/693788/is-it-better-to-use-c-void-arguments-void-foovoid-or-not-void-foo
runtime/include/chpl-privatization.h
Outdated
@@ -27,12 +27,7 @@ void chpl_privatization_init(void); | |||
|
|||
void chpl_newPrivatizedClass(void*, int64_t); | |||
|
|||
// Implementation is here for performance: getPrivatizedClass can be called | |||
// frequently, so putting it in a header allows the backend to fully optimize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved in #6212, and it had a pretty significant impact on performance of prk stencil. We should definetly do some performance analysis of this PR before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, its going to result in a minor performance regression in any-case for chpl_getPrivatizedClass
, and definitely not by too much, and now both chpl_newPrivatizedClass
(when it doesn't need to allocate more space) and chpl_clearPrivatizedClass
will be on-par with it... hopefully. This is bleeding-edge stuff, after-all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I can say that there is one way that will likely counter any regression, but that involves a complete overhaul to privatization as a whole... that'd be an interesting GSoC student project though :)
@LouisJenkinsCS - I have some feedback for you about this. |
May I ask in what way these look like reader-writer locks? While writers do require mutual exclusion, readers are not blocked by a writer to complete or by other readers. (Perhaps you could open a code review listing which parts I need to elaborate on?)
RCU is all about atomics for readers, normally in the form of memory barriers (such as the fact that reads must first atomically read the current instance using |
I agree with you - it's just that I didn't see anything I recognized as rcu_ terms.
I'm having some trouble figuring out what's going on in liburcu, but how would it compare with what you have done here? What would make us use liburcu instead of this mechanism, or vice versa? How is the performance different? Would liburcu help with your distributed ideas? |
Truth be told, the major differences boil down to optimization. Keep in mind that my algorithm has been developed Chapel-side, where I had to deal around problems with abstraction and lack of certain features (ahem task-local storage), and only in the span of a single day had to be revised to C. My algorithm also was built around maintenance over an entire cluster, while LibURCU's is built around maintenance over a single SMP system. Lastly, my algorithm was devised to solve a single problem (which it did), while LibURCU was meant to be reused anywhere (although my algorithm can, apparently, also be used similarly). Performance-wise, I have not attempted to produce benchmarks between the two (nor had the time to do so), but I'd imagine mine to be around a small fraction of LibURCU's at a single node, merely due to optimization. One plus is that my code is significantly smaller and less complex, making it easier to implement in languages that lack certain features (ahem). Finally, LibURCU wouldn't help for my purposes much in that all I needed was the basic premise (read-side critical sections, wait-for-readers, single-writer, etc.), I've gotten all I could out of that concept. |
I don't really follow - are you saying you couldn't use LibURCU for the distributed case? I seem to be confused about something here. Anyway, for the specific matter of this PR - the privatization arrays - I think we'll need a sense of the performance impact of this change in order to decide if we proceed with it or no. |
LibURCU is AFAIK for a SMP and so wouldn't have usage outside of a single node (at least not with the implementation I saw here)... but now that I think about it, you can have readers on each node could use LibURCU, and elect one writer over the entire cluster to perform an LibURCU write/update on each node. |
Naturally re-using liburcu (or any other single-node rcu implementation) has the advantage that we can have better within-node performance (since these implementations are tuned etc). Does having a RCU across multiple locales necessarily mean we have to start from scratch? Let's think about that some more. |
It depends on the application... for the case of distributed arrays that are both indexable and resizable, no, the issue of recycling memory has been addressed (making resizing possible), and loosening the classification for 'reads' to include writes to returned references is what leads to significant performance improvements The RCU itself is just used for memory management (again, if Chapel had garbage collection, we wouldn't really need this at all). LibURCU can do the job for that. I guess what I'm trying to say is that the RCU itself should be seen as just a memory management tool. |
As well, if the goal here is to implement LibURCU, that'd be a nightmare of a time. It requires Thread-Local Storage, and I mean it, and each implementation makes assumptions about it as well that could be disastrous (for example, the classic build uses I understand if this won't be accepted (didn't really expect it to, but sorry to disappoint), but I want to focus on the application it was originally purposed for: Global Atomic Objects (or at this case, distributed indexable resizable arrays) |
I did some quick perf testing, and unfortunately it looks like this adds a significant amount of overhead (2000x slowdown for prk-stencil): cd $CHPL_HOME/test/studies/prk/Stencil/optimized/
chpl stencil-opt.chpl --fast --set iterations=3 --set order=8000 --no-local
./stencil-opt For master:
For RCU-Privatization:
|
Does the benchmark do more read or write operations? I'll investigate it myself, but 2000x slower is high than I expected, I would have suspected no worse than 10x, unless its like all writes in which it'd be expected. |
Okay, I did a bit of profiling... I added an atomic counter for read and write operations respectively... the amount of reads I'm seeing are really large, like wow. Privatized Reads: 1214851089, Writes: 0 That's for one iteration... That 1.2B reads... I see why now you guys had I think I'd also be interested in helping out with it more. |
It should be almost entirely reads and very very few writes. The part that slows down is https://github.com/chapel-lang/chapel/blob/master/test/studies/prk/Stencil/optimized/stencil-opt.chpl#L151-L164 With param unfolding I think that will be ~10 calls to the getPrivatizedCopy per loop iteration. Even in the fast path for reading, I think your code still does at least 2 atomic operations, which I think is just going to be way too much overhead. Some rough numbers:
Something like #6184 would alleviate the number of times getPrivatizedCopy is being called, but I don't think we're going to get to that any time soon (and even if we did licm can't always run, so I'm not sure we could pay this kind of cost) |
Yeah, as you're seeing there are a lot of calls to The code we generate (especially with the param unfolding) starts to get pretty unwieldy, but it's worth noting that our performance is on par with the reference MPI+OpenMP version up to at least 256 locales |
It is actually interesting... there's another hit taken due to the amount of indirection needed (basically from void ** -> void ****), but that's required to make the algorithm work as a whole (having 2 instances, segmenting data into blocks rather than being contiguous memory, etc.) and probably has some significant impact too. Also didn't know that each index into the array is a call to |
I wonder... Do you think that under the FIFO tasking layer, it would be safe to use TLS? I'm thinking of trying it out. |
Okay, I have another idea to make this work... What if we disable preemption when you call I believe threads must be registered before use, but that can be performed during |
We don't multiplex tasks for fifo, so using thread local storage should be fine. For qthreads you might be able to use task local storage. Note that chpl_task_getId() does use task local storage for qthreads Also note that qthreads does not have preemptive scheduling, qthreads is a cooperative scheduler. If t1 and t2 are scheduled on pthread1, the only way for the tasks to switch is either with an explicit call to chpl_task_yield()/qthread_yield() or some higher level call that will end up calling them. |
I'm almost satisfied with the fact that the data structure managed to perform ~30M Op/Sec (honestly, I don't think its possible yet for any data structure to compare to an unprotected read). I have revised it yet again to make use of TLS, and interestingly the benchmark time didn't change much, if any, at all. I rebuilt the runtime too (and I have to correct a few errors here and there) so I know its running the new version. Right now, RCU readers perform absolutely zero RMW atomic operations (2 atomic reads), and just do a volatile write to their own thread-specific node; I maintain a global table (similar to Hazard Pointers) and use a lock-free approach to append a new TLS node to it once (the first time used), which allows the writer to see all 'thread-specific' data. The changes I made did yield a better runtime on swan, from 110 seconds to 70 seconds, but that means that the issue has to be due to the added indirection inherent in design of my data structure, but at this point there isn't anything else I can do. |
Actually... perhaps there is another thing... the reason for the indirection is so that indexes can count as 'reads' and so that updates from one instance carries into the other. However, if the most important thing is |
I believe I have done enough research to say that the type of RCU-like memory reclamation I was performing was the Epoch-based (without even knowing it), but there is another, more efficient one without actual need for any memory barriers, called Quiescent-Based State Reclamation, which unlike the Epoch-based where we declare the critical-section in which we are making use of memory, instead we have to inject some 'checkpoints' which declare that we aren't using the memory. The only place where this would be appropriate, I believe, is The significance is that we do not require any memory barriers, but TLS is still required (good thing I've managed to handle this myself). This will actually allow us to place the Edit: I have another idea... I believe a reader-writer lock for Man, I really need to write this stuff down in a journal rather than polluting the pull request. |
I've done it, there is now zero-overhead RCU implemented based on Quiescent State-Based Reclamation, and it passes both tests/distributions/privatization/* and that 'stencil-ppk' or whatever its called, within the same amount of time as before. I injected calls for the quiescent state
|
Quick misc portability notes:
|
Okay, so I see now that while there is no regression in reads, writes are hit too hard right now (as well I think I may be deadlocking on writers right now), but I'm beginning to see that I need to insert checkpoints at more places. @ronawho Since you're on, do you know if there is a particular callback for when a task finishes? It seems that when a task is finished, |
There is a callback interface that you can find at If you just want to play around you could add calls to the task shims (like you did with chpl_task_yield). See chapel_wrapper in qthreads (or search for the chpl_task_cb_event_kind_end sentinel to look for places where tasks finish) |
Now passes all of test/distributions/privatization and aces the stencil-ppk benchmarks. Although I haven't tested memory leakage, it should be apparent that leakage is impossible since writers always finish and they always delete the previous instance before doing so. I think this is 100% successful @mppf |
Output of Stencil...
|
@LouisJenkinsCS - now if we are doing Quiescent State-Based Reclamation it feels like real RCU to me. What would it take to generalize this into an RCU interface available to the C runtime? Is that possible, so we could use RCU in other places in the C runtime or from Chapel code? Or is this necessarily a one-off solution for some reason? (Note, I havn't dug into the code yet). |
…matically register the main thread
… value is zero or non-zero
…rnings of unused variables
…emented Michael's suggested changes; made adjustments to runtime's makefile so that it will add the appropriate variables without overwriting any of the others.
43c3259
to
50307e7
Compare
@@ -0,0 +1,16 @@ | |||
use PrivatizationWrappers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is missing a .good file, could you add it & check that start_test on it passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to add a 'notest' file for that, sorry.
quickstart / fifo configuration seems to cause core dumps, e.g. [Error matching program output for release/examples/hello3-datapar] Could you have a look? |
GASNet (local, testing failed 1 test with an apparent core dump: Does this program use a lot of memory or something? The core dump happened during a parallel test run but I'm not reproducing it in 100 trials. |
… qthread.h so craycc stops complaining about it
…_test; it was added to profile the average time for privatization
I'm manually running quickstart myself to ensure its fixed. I've been using GASNet + Qthreads local the entire time so my guess is that that issue is specific to not clobbering third-party/qthread and rebuilding it? I'll look into it anyway. |
Wait, when you said "I can't reproduce it in 100 trials" you mean its a race condition? |
I don't know what it is, but it's some sort of intermittent failure. It might be that the test will only fail on a loaded system. |
(I'll have to deal with it later, working on paper right now) |
Passed a gasnet testing run twice. |
Passed uGNI test for test/release/example/primers |
Comment out unused get_defer_list in chpl-qsbr.c It's unused and that causes compilation errors in some configurations. This is a follow-on to PR #8182. Trivial and not reviewed.
I see that this has been reverted already, but FWIW there were some outstanding issues, and ones that should be addressed prior to remerging this. I don't see an open issue for this, but if you have a list somewhere please add:
|
I introduce Quiescent State-Based Reclamation, a memory reclamation algorithm that can be from the confines of the runtime and potentially from Chapel user code (with some GOTCHAS). The memory reclamation algorithm comes with very little performance regression and can ensure the eventual cleanup of memory so long as checkpoints are periodically called from all threads (although their placement is up to debate).
QSBR can come in handy for any performance-critical data structure, and currently is used in Chapel's privatization table. Future uses can be for the thread-safety of Chapel's callback system and perhaps in more dynamic task local storage. In the future, a specific QSBR table is planned for users that will not be able to interfere with the runtime.
Potential Uses of QSBR:
There are many other potential uses, the sky is the limit.
Reviewed by @mppf, @gbtitus, @ronawho.
Testing: