Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing Quiescent State-Based Reclamation to Chapel #8182

Merged
merged 75 commits into from
Mar 15, 2018

Conversation

LouisJenkinsCS
Copy link
Member

@LouisJenkinsCS LouisJenkinsCS commented Jan 11, 2018

I introduce Quiescent State-Based Reclamation, a memory reclamation algorithm that can be from the confines of the runtime and potentially from Chapel user code (with some GOTCHAS). The memory reclamation algorithm comes with very little performance regression and can ensure the eventual cleanup of memory so long as checkpoints are periodically called from all threads (although their placement is up to debate).

QSBR can come in handy for any performance-critical data structure, and currently is used in Chapel's privatization table. Future uses can be for the thread-safety of Chapel's callback system and perhaps in more dynamic task local storage. In the future, a specific QSBR table is planned for users that will not be able to interfere with the runtime.

Potential Uses of QSBR:

  1. Non-Blocking Data Structures: It is safe to defer the deletion of anything logically removed from the data structure in question. For example, a lock-free queue that removes a node from the head can safely defer deletion as it will be the only thread to do so (since the CAS operation will return true for only one successful operation)
  2. Multiword Compare-And-Swap: This also applies to anything using descriptors for helping mechanisms, but a K-Word Compare-And-Swap where we make use of an object that holds enough state to represent some in-progress operation to claim ownership of the location. The descriptor can be accessed by concurrent threads and so it is safe to defer for deletion after the owning thread/task completes its operation.
  3. Software Transactional Memory: While things like reading and writing are trivial to implement, a huge problem that arises is the deletion of data during a transaction and even after committing it. Deleting the object immediately can result in a segmentation fault for tasks that have not yet aborted their transaction, so deferring deletion would be a solution so long as no checkpoint is invoked within a transaction.

There are many other potential uses, the sky is the limit.

Reviewed by @mppf, @gbtitus, @ronawho.

Testing:

  • 10000 trials of test/distributions/privatization/runtime/ in the standard configuration
  • full local testing
  • ugni testing for test/release/example/primers
  • full quickstart testing
  • full gasnet testing
  • full gasnet+fifo testing

// Determines current instance. (MUST BE ATOMIC)
atomic_int_least8_t currentInstanceIdx;

chpl_priv_block_t chpl_priv_block_create() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never remember this, but in C 0 argument functions should take "void". In fact, I just messed this up yesterday: #8168

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, although I'm not 100% certain I understand the actual need (something about it being possible for someone to pass arguments to a no-argument function and mess with the stack contents... kinda interested in what happens in that case, actually...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with old C, a no argument prototype actually means "Don't assume or check anything about the args", where's a void signature means "this is a 0 argument function" -- https://stackoverflow.com/questions/693788/is-it-better-to-use-c-void-arguments-void-foovoid-or-not-void-foo

@@ -27,12 +27,7 @@ void chpl_privatization_init(void);

void chpl_newPrivatizedClass(void*, int64_t);

// Implementation is here for performance: getPrivatizedClass can be called
// frequently, so putting it in a header allows the backend to fully optimize.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved in #6212, and it had a pretty significant impact on performance of prk stencil. We should definetly do some performance analysis of this PR before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, its going to result in a minor performance regression in any-case for chpl_getPrivatizedClass, and definitely not by too much, and now both chpl_newPrivatizedClass (when it doesn't need to allocate more space) and chpl_clearPrivatizedClass will be on-par with it... hopefully. This is bleeding-edge stuff, after-all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I can say that there is one way that will likely counter any regression, but that involves a complete overhaul to privatization as a whole... that'd be an interesting GSoC student project though :)

@mppf
Copy link
Member

mppf commented Jan 12, 2018

@LouisJenkinsCS - I have some feedback for you about this.
First, I was expecting you to create some sort of generally useful RCU mechanism. This looks to me more like manually re-written reader-writer locks. I was hoping a generally-useful RCU mechanism could serve as a building block for your future work.
Second, if it makes Chapel benchmarks slower, we'll probably keep the memory leak instead of doing these changes.
If I understand correctly, full-on RCU implementations can avoid the atomic operations on the "read" path. I think in this case the "reads" (get privatized object ID) are so far ahead of the "writes" (create a new privatized object) that we're likely to care about that.

@LouisJenkinsCS
Copy link
Member Author

This looks to me more like manually re-written reader-writer locks.

May I ask in what way these look like reader-writer locks? While writers do require mutual exclusion, readers are not blocked by a writer to complete or by other readers. (Perhaps you could open a code review listing which parts I need to elaborate on?)

If I understand correctly, full-on RCU implementations can avoid the atomic operations on the "read" path.

RCU is all about atomics for readers, normally in the form of memory barriers (such as the fact that reads must first atomically read the current instance using rcu_dereference, or must enter a read-side critical section using rcu_read_lock and rcu_read_unlock as per the official API). In which case my acquireRead and releaseRead serve as.

@mppf
Copy link
Member

mppf commented Jan 12, 2018

May I ask in what way these look like reader-writer locks? While writers do require mutual exclusion, readers are not blocked by a writer to complete or by other readers. (Perhaps you could open a code review listing which parts I need to elaborate on?)

I agree with you - it's just that I didn't see anything I recognized as rcu_ terms.

RCU is all about atomics for readers, normally in the form of memory barriers (such as the fact that reads must first atomically read the current instance using rcu_dereference, or must enter a read-side critical section using rcu_read_lock and rcu_read_unlock as per the official API). In which case my acquireRead and releaseRead serve as.

I'm having some trouble figuring out what's going on in liburcu, but how would it compare with what you have done here? What would make us use liburcu instead of this mechanism, or vice versa? How is the performance different? Would liburcu help with your distributed ideas?

@LouisJenkinsCS
Copy link
Member Author

Truth be told, the major differences boil down to optimization. Keep in mind that my algorithm has been developed Chapel-side, where I had to deal around problems with abstraction and lack of certain features (ahem task-local storage), and only in the span of a single day had to be revised to C. My algorithm also was built around maintenance over an entire cluster, while LibURCU's is built around maintenance over a single SMP system. Lastly, my algorithm was devised to solve a single problem (which it did), while LibURCU was meant to be reused anywhere (although my algorithm can, apparently, also be used similarly).

Performance-wise, I have not attempted to produce benchmarks between the two (nor had the time to do so), but I'd imagine mine to be around a small fraction of LibURCU's at a single node, merely due to optimization. One plus is that my code is significantly smaller and less complex, making it easier to implement in languages that lack certain features (ahem). Finally, LibURCU wouldn't help for my purposes much in that all I needed was the basic premise (read-side critical sections, wait-for-readers, single-writer, etc.), I've gotten all I could out of that concept.

@mppf
Copy link
Member

mppf commented Jan 12, 2018

Finally, LibURCU wouldn't help for my purposes much in that all I needed was the basic premise (read-side critical sections, wait-for-readers, single-writer, etc.), I've gotten all I could out of that concept.

I don't really follow - are you saying you couldn't use LibURCU for the distributed case? I seem to be confused about something here.

Anyway, for the specific matter of this PR - the privatization arrays - I think we'll need a sense of the performance impact of this change in order to decide if we proceed with it or no.

@LouisJenkinsCS
Copy link
Member Author

I don't really follow - are you saying you couldn't use LibURCU for the distributed case?

LibURCU is AFAIK for a SMP and so wouldn't have usage outside of a single node (at least not with the implementation I saw here)... but now that I think about it, you can have readers on each node could use LibURCU, and elect one writer over the entire cluster to perform an LibURCU write/update on each node.

@mppf
Copy link
Member

mppf commented Jan 12, 2018

Naturally re-using liburcu (or any other single-node rcu implementation) has the advantage that we can have better within-node performance (since these implementations are tuned etc). Does having a RCU across multiple locales necessarily mean we have to start from scratch? Let's think about that some more.

@LouisJenkinsCS
Copy link
Member Author

It depends on the application... for the case of distributed arrays that are both indexable and resizable, no, the issue of recycling memory has been addressed (making resizing possible), and loosening the classification for 'reads' to include writes to returned references is what leads to significant performance improvements The RCU itself is just used for memory management (again, if Chapel had garbage collection, we wouldn't really need this at all). LibURCU can do the job for that. I guess what I'm trying to say is that the RCU itself should be seen as just a memory management tool.

@LouisJenkinsCS
Copy link
Member Author

As well, if the goal here is to implement LibURCU, that'd be a nightmare of a time. It requires Thread-Local Storage, and I mean it, and each implementation makes assumptions about it as well that could be disastrous (for example, the classic build uses pthread_getspecific, but in a tasking layer like qthread it would mean multiplexed tasks on the same thread would be subject to undefined behavior), and I don't have a clue how we're going to ensure this can be incorporated into the tasking layer itself. That's a GSoC student project for sure though.

I understand if this won't be accepted (didn't really expect it to, but sorry to disappoint), but I want to focus on the application it was originally purposed for: Global Atomic Objects (or at this case, distributed indexable resizable arrays)

@ronawho
Copy link
Contributor

ronawho commented Jan 12, 2018

I did some quick perf testing, and unfortunately it looks like this adds a significant amount of overhead (2000x slowdown for prk-stencil):

cd $CHPL_HOME/test/studies/prk/Stencil/optimized/
chpl stencil-opt.chpl --fast --set iterations=3 --set order=8000 --no-local
./stencil-opt

For master:

Rate (MFlops/s): 56398.417036  Avg time (s): 0.0215393
stencil time = 0.0129623
increment time = 0.00855267
comm time = 2.26667e-05

For RCU-Privatization:

Rate (MFlops/s): 29.297862  Avg time (s): 41.4632
stencil time = 41.4546
increment time = 0.00866033
comm time = 2.6e-05

@LouisJenkinsCS
Copy link
Member Author

Does the benchmark do more read or write operations? I'll investigate it myself, but 2000x slower is high than I expected, I would have suspected no worse than 10x, unless its like all writes in which it'd be expected.

@LouisJenkinsCS
Copy link
Member Author

Okay, I did a bit of profiling... I added an atomic counter for read and write operations respectively... the amount of reads I'm seeing are really large, like wow.

Privatized Reads: 1214851089, Writes: 0

That's for one iteration... That 1.2B reads... I see why now you guys had chpl_getPrivatizedObject inlined into the header, thats insane. Hm, with that much, its likely the issue of having too much traffic on the read counter (necessary due to lack of TLS). In reality, the only additional code over what was in master boils down to 1 Fetch-And-Add, 1 Fetch-And-Sub, and 2 Atomic Reads. The Fetch-And-Add and Fetch-And-Sub would cause a load of bus traffic on that cache line its on. Definitely would need LibURCU for single-node.

I think I'd also be interested in helping out with it more.

@ronawho
Copy link
Contributor

ronawho commented Jan 13, 2018

It should be almost entirely reads and very very few writes. The part that slows down is https://github.com/chapel-lang/chapel/blob/master/test/studies/prk/Stencil/optimized/stencil-opt.chpl#L151-L164

With param unfolding I think that will be ~10 calls to the getPrivatizedCopy per loop iteration. Even in the fast path for reading, I think your code still does at least 2 atomic operations, which I think is just going to be way too much overhead.

Some rough numbers:

  • atomic operations are relatively slow -- 1,000,000 uncontested (serial) atomic adds takes ~.01 second (depends on machine and other factors, but this is a rough estimate.)
  • For this small problem size (8,000 x 8,000 matrix) that's going to be at least ~1,280,000,000 atomic ops ( 8,000**2 iters * 10 getPrivatizedCopy per iter * 2 atomics per getPrivatizedCopy) and these atomic ops won't be serial, they'll be concurrent. So at a min that would be ~10 seconds just for the atomic ops, and from the benchmark we see it's more like 40 seconds.

Something like #6184 would alleviate the number of times getPrivatizedCopy is being called, but I don't think we're going to get to that any time soon (and even if we did licm can't always run, so I'm not sure we could pay this kind of cost)

@ronawho
Copy link
Contributor

ronawho commented Jan 13, 2018

Yeah, as you're seeing there are a lot of calls to getPrivatizedCopy for this benchmark, basically one per array index operation, and there's a ton of them. Ideally we would hoist them or something, but that's hard, so for now we took the "easy" way out of having a really fast implementation that can be fully inlined and optimized by the backend.

The code we generate (especially with the param unfolding) starts to get pretty unwieldy, but it's worth noting that our performance is on par with the reference MPI+OpenMP version up to at least 256 locales

@LouisJenkinsCS
Copy link
Member Author

It is actually interesting... there's another hit taken due to the amount of indirection needed (basically from void ** -> void ****), but that's required to make the algorithm work as a whole (having 2 instances, segmenting data into blocks rather than being contiguous memory, etc.) and probably has some significant impact too.

Also didn't know that each index into the array is a call to getPrivatizedCopy, so many subtle details here and there...

@LouisJenkinsCS
Copy link
Member Author

I wonder... Do you think that under the FIFO tasking layer, it would be safe to use TLS? I'm thinking of trying it out.

@LouisJenkinsCS
Copy link
Member Author

Okay, I have another idea to make this work...

What if we disable preemption when you call chpl_getPrivatizedCopy? The primary issue I'm seeing is that multiplexed threads would have issues sharing the same TLS, but what if we make it so that only a single task per thread can use any of the privatized runtime calls? (As in, make a call to disable preemption before making the call to rcu_read_lock, enable preemption after rcu_read_unlock) Doing so means if multiple tasks on a given thread request chpl_getPrivatizedCopy, then they just become serial (for that thread), but you can still have parallelism from other threads (Plus this only becomes an issue with oversubscribing anyway...)

I believe threads must be registered before use, but that can be performed during chpl_privatization_init, right? If so, with the combination of toggling preemption during calls into the runtime for privatization, then we might be able to let LibURCU work its magic.

@ronawho
Copy link
Contributor

ronawho commented Jan 13, 2018

We don't multiplex tasks for fifo, so using thread local storage should be fine. For qthreads you might be able to use task local storage. Note that chpl_task_getId() does use task local storage for qthreads

Also note that qthreads does not have preemptive scheduling, qthreads is a cooperative scheduler. If t1 and t2 are scheduled on pthread1, the only way for the tasks to switch is either with an explicit call to chpl_task_yield()/qthread_yield() or some higher level call that will end up calling them.

@LouisJenkinsCS
Copy link
Member Author

I'm almost satisfied with the fact that the data structure managed to perform ~30M Op/Sec (honestly, I don't think its possible yet for any data structure to compare to an unprotected read). I have revised it yet again to make use of TLS, and interestingly the benchmark time didn't change much, if any, at all. I rebuilt the runtime too (and I have to correct a few errors here and there) so I know its running the new version. Right now, RCU readers perform absolutely zero RMW atomic operations (2 atomic reads), and just do a volatile write to their own thread-specific node; I maintain a global table (similar to Hazard Pointers) and use a lock-free approach to append a new TLS node to it once (the first time used), which allows the writer to see all 'thread-specific' data.

The changes I made did yield a better runtime on swan, from 110 seconds to 70 seconds, but that means that the issue has to be due to the added indirection inherent in design of my data structure, but at this point there isn't anything else I can do.

@LouisJenkinsCS
Copy link
Member Author

Actually... perhaps there is another thing... the reason for the indirection is so that indexes can count as 'reads' and so that updates from one instance carries into the other. However, if the most important thing is chpl_getPrivatizedCopy, then I can make it so that all of the methods will perform the extreme heavy write operations. If I do this, I can go back, similar to how it was before, and just have a single void ** array. I can probably look into the epoch-based memory reclamation or snapshot-based.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jan 14, 2018

I believe I have done enough research to say that the type of RCU-like memory reclamation I was performing was the Epoch-based (without even knowing it), but there is another, more efficient one without actual need for any memory barriers, called Quiescent-Based State Reclamation, which unlike the Epoch-based where we declare the critical-section in which we are making use of memory, instead we have to inject some 'checkpoints' which declare that we aren't using the memory. The only place where this would be appropriate, I believe, is chpl_task_yield, or whichever handler is for preemption (so long as preemption never occurs inside chpl-privatization.c.

The significance is that we do not require any memory barriers, but TLS is still required (good thing I've managed to handle this myself). This will actually allow us to place the extern chpl_privatizedObjects and chpl_getPrivatizedCopy back into the header file with 0-overhead. This is actually interesting, in that the QSBR reclamation strategy can work for any application where we can call that 'checkpoint' repeatedly, so while its usage is limited it fits our needs perfectly. I'll see if I can get this done by tomorrow.

Edit:

I have another idea... I believe a reader-writer lock for chpl_newPrivatizedCopy and chpl_clearPrivatizedCopy is in order... The chpl_getPrivatizedCopy can still use zero-overhead Quiescent State-Based Reclamation where we inject a call to some 'checkpoint' to update the current epoch, but for the other two we can easily allow concurrent writes to the current snapshot of the chpl_privatizedObjects array so we can have that be the 'read' portion of the reader-writer lock and have the resizing be apart of the 'write' portion of the reader-writer lock. This way, reads have zero-overhead and privatization only pays a performance toll during rare times of resizing.

Man, I really need to write this stuff down in a journal rather than polluting the pull request.

@LouisJenkinsCS
Copy link
Member Author

@mppf

I've done it, there is now zero-overhead RCU implemented based on Quiescent State-Based Reclamation, and it passes both tests/distributions/privatization/* and that 'stencil-ppk' or whatever its called, within the same amount of time as before. I injected calls for the quiescent state

Parallel Research Kernels Version 2.17
Serial stencil execution on 2D grid
Grid size            = 8000
Radius of stencil    = 2
Type of stencil      = star
Data type            = real(64)
Number of iterations = 3
Distribution         = Stencil
Solution validates
Rate (MFlops/s): 8326.713990  Avg time (s): 0.14589
stencil time = 0.0893613
increment time = 0.0565093
comm time = 1.6e-05

@ronawho
Copy link
Contributor

ronawho commented Jan 15, 2018

Quick misc portability notes:

  • Avoid using __sync primitives directly, and instead use the chpl-atomics wrappers.
  • Can you use runtime/include/chpl-thread-local-storage.h instead of pthread_getspecific() and friends?
  • Try to avoid direct calls to chpl_malloc and friends, and instead use the chpl_mem_allocMany wrappers (which hook into our memory tracking interface)

@LouisJenkinsCS
Copy link
Member Author

Okay, so I see now that while there is no regression in reads, writes are hit too hard right now (as well I think I may be deadlocking on writers right now), but I'm beginning to see that I need to insert checkpoints at more places.

@ronawho Since you're on, do you know if there is a particular callback for when a task finishes? It seems that when a task is finished, chpl_task_yield is not called (which makes sense) so my checkpoint isn't, meaning the writer is blocked waiting forever. In fact, I see now that the only way I can make this work is to registers threads with at least one task, and unregister threads without any so we don't wait for them)

@ronawho
Copy link
Contributor

ronawho commented Jan 15, 2018

There is a callback interface that you can find at runtime/include/chpl-tasks-callbacks.h, but note that this is mostly intended for debug or profiling tools. We've optimized the 0-callback case, but there will probably be a non-trivial amount of overhead added to task create/begin/end if any callbacks are registered, so it's probably not appropriate for something that will be used for "fast" code.

If you just want to play around you could add calls to the task shims (like you did with chpl_task_yield). See chapel_wrapper in qthreads (or search for the chpl_task_cb_event_kind_end sentinel to look for places where tasks finish)

@LouisJenkinsCS
Copy link
Member Author

Now passes all of test/distributions/privatization and aces the stencil-ppk benchmarks. Although I haven't tested memory leakage, it should be apparent that leakage is impossible since writers always finish and they always delete the previous instance before doing so. I think this is 100% successful @mppf

@LouisJenkinsCS
Copy link
Member Author

Output of Stencil...

Parallel Research Kernels Version 2.17
Serial stencil execution on 2D grid
Grid size            = 8000
Radius of stencil    = 2
Type of stencil      = star
Data type            = real(64)
Number of iterations = 1
Distribution         = Stencil
Solution validates
Rate (MFlops/s): 15427.595586  Avg time (s): 0.078741
stencil time = 0.059808
increment time = 0.018887
comm time = 4.3e-05

@mppf
Copy link
Member

mppf commented Jan 15, 2018

@LouisJenkinsCS - now if we are doing Quiescent State-Based Reclamation it feels like real RCU to me. What would it take to generalize this into an RCU interface available to the C runtime? Is that possible, so we could use RCU in other places in the C runtime or from Chapel code? Or is this necessarily a one-off solution for some reason? (Note, I havn't dug into the code yet).

@@ -0,0 +1,16 @@
use PrivatizationWrappers;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is missing a .good file, could you add it & check that start_test on it passes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to add a 'notest' file for that, sorry.

@mppf
Copy link
Member

mppf commented Mar 13, 2018

quickstart / fifo configuration seems to cause core dumps, e.g.

[Error matching program output for release/examples/hello3-datapar]
[Error matching program output for release/examples/hello4-datapar-dist]

Could you have a look?

@mppf
Copy link
Member

mppf commented Mar 13, 2018

GASNet (local, testing failed 1 test with an apparent core dump:
[Error matching program output for memory/qsbr/serial_deferDeletion]

Does this program use a lot of memory or something? The core dump happened during a parallel test run but I'm not reproducing it in 100 trials.

… qthread.h so craycc stops complaining about it
…_test; it was added to profile the average time for privatization
@LouisJenkinsCS
Copy link
Member Author

I'm manually running quickstart myself to ensure its fixed. I've been using GASNet + Qthreads local the entire time so my guess is that that issue is specific to not clobbering third-party/qthread and rebuilding it? I'll look into it anyway.

@LouisJenkinsCS
Copy link
Member Author

Wait, when you said "I can't reproduce it in 100 trials" you mean its a race condition?

@mppf
Copy link
Member

mppf commented Mar 14, 2018

Wait, when you said "I can't reproduce it in 100 trials" you mean its a race condition?

I don't know what it is, but it's some sort of intermittent failure. It might be that the test will only fail on a loaded system.

@LouisJenkinsCS
Copy link
Member Author

(I'll have to deal with it later, working on paper right now)

@mppf
Copy link
Member

mppf commented Mar 14, 2018

Passed a gasnet testing run twice.

@LouisJenkinsCS
Copy link
Member Author

Passed uGNI test for test/release/example/primers

@mppf mppf merged commit 2adbbb4 into chapel-lang:master Mar 15, 2018
mppf added a commit that referenced this pull request Mar 15, 2018
Comment out unused get_defer_list in chpl-qsbr.c

It's unused and that causes compilation errors in some configurations.

This is a follow-on to PR #8182.

Trivial and not reviewed.
mppf added a commit to mppf/chapel that referenced this pull request Mar 16, 2018
…rivatization"

This reverts commit 2adbbb4, reversing
changes made to 1b8a7a7.
@mppf mppf mentioned this pull request Mar 16, 2018
17 tasks
mppf added a commit that referenced this pull request Mar 16, 2018
Revert QSBR PR #8182 

This PR reverts the QSBR PR #8182. The QSBR work is valuable but it is not ready yet to be on the master branch. Once the issues noted in this PR are addressed, QSBR can be added back in again.

Passed full local testing.
LouisJenkinsCS added a commit to LouisJenkinsCS/chapel that referenced this pull request Mar 16, 2018
@ronawho
Copy link
Contributor

ronawho commented Mar 17, 2018

I see that this has been reverted already, but FWIW there were some outstanding issues, and ones that should be addressed prior to remerging this. I don't see an open issue for this, but if you have a list somewhere please add:

  • We should contribute the qthreads-core changes upstream. There might be a few more schedulers not in the release, and we'd need to come up with some sort of standalone tests for this new feature (and I'd probably want to spend a little time looking at your changes). Note that I can help with contributing upstream
  • At least to experiment, I would try just using _thread and later we can figure out how to use the portable CHPL_TLS wrappers

ben-albrecht pushed a commit to ben-albrecht/chapel that referenced this pull request Mar 23, 2018
…rivatization"

This reverts commit 2adbbb4, reversing
changes made to 1b8a7a7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants