-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arbitrary numbers of threads for memory allocation. #1739
Conversation
Hmm... I'm not sure why Windows suddenly can't find ::Tls*. |
Checking if this is a problem with the CI - local Linux builds show no errors so far. |
I wonder if the Windows build issue comes simply from using C++ idiom in a plain C file (which would also imply the problem was in your earlier code as well, just masked by the default use of compiler TLS). |
Aha, you're probably right. I've removed the ::; let's see what happens. In the meantime I can reproduce the test failures and am actively debugging. |
Yep, that was it. I've fixed the tests, too. The failures are make timeouts (45 minutes). Is there a way I can restart them? |
Not sure if you as the submitter already have the right to restart individual jobs, or perhaps would need to set up a free account with travis. (you should see a symbol with two concetric circles to the right of the time column in the travis jobs table if you can restart jobs by clicking this). I have now restarted the three jobs in question, but the issue with the timeouts could only be resolved by changing the limit in the .travis.yml file. However, these three jobs used to take around twenty minutes like the others, so "make timeout" here probably means a lockup in the tests. (We will see...) |
I've tried all three locally and everything was fine... not sure what's happening with travis. |
I cannot reproduce this either. Travis status blog has a few recent entries about jobs that require "sudo: true" intermittently failing or hanging. It is just worrying that these three jobs should fail repeatedly... |
Yeah, it is. Is there any way to get more verbose output of those jobs? |
I could remove the settings for "quiet make" from the travis.yml, but ironically travis used to abort jobs for creating too much output without it. |
Trying some things in my fork now. But without "QUIET", travis tends to kill the job as it exceeds 4MB of output (mostly from warnings in the LAPACK part)... |
Travis also kills the absolutely silent jobs... |
Turns out it is the fork safety utest that is hanging. |
Linux timeout command can do it, no posix equivalent, but particular test is only linux... |
@brada4 this does not answer the question why the fork test fails with this change. |
Just to fix up CI |
The hang is reproducible with (a vm running) the same version of Ubuntu locally, so probably related to either glibc 2.15 or gcc 4.6.3 . gdb shows one thread in __libc_thread_freeres() called from clone() via start_thread, with the other thread in pthread_join called from blas_thread_shutdown at line 982 of blas_server.c. Not sure yet why this would happen only after this change, when the only relevant difference appears to be in blas_memory_cleanup that happens after the call to blas_thread_shutdown. |
helgrind sees a race between calls to blas_memory_alloc from memory.c:1071 (reading with no lock held, called from dgemm, gemm.c:403) and memory.c:1097 (writing with a lock held), and another between get_memory_table at memory.c:489 and pthread_key_create called from memory.c:1031 |
Seems access to the newly introduced local_storage_key variable is racey, but adding another lock to protect it does not help. |
I believe I fixed the race conditions in the "mem" branch of my fork (at the expense of another mutex lock), but the issue with travis looks like a glibc < 2.23 bug. If I move the CI to Ubuntu xenial, the utest passes. |
I've added a tiny bit of locking that ought to resolve things... we'll see. I suppose it depends on what the issue with glibc is. The change in logic here might be enough to avoid the issue altogether (crosses fingers). |
My impression was that it was somehow related to using malloc in the threads, but I do not have the bug ids handy now. We'll see what travis says now, and I'll also give it a spin with the ubuntu vm later when I have time. |
Does not look good unfortunately, I see local segfaults (at memory.c:1128) even with relatively current systems as well. I must admit I am beginning to think we should revert to the 0.3.0 memory.c for 0.3.3 to give users a stable release to work with, and reintroduce the rewrite when it is stabilized and tested. |
@martin-frbg what tests are you running? |
Normal build (on opensuse leap 42.2, gcc-4.8.5, glibc 2.22) was enough to trigger a segfault in the sblat2 test with your latest version... |
All I have is 2.24 but I can't reproduce any crashes. Everything seems to run fine, which makes all of this more complicated. I'm trying to see if there's a good way for me to get something were I can reproduce this, but if you have any ideas, I'd be grateful. |
Virtualbox with a bunch of vm's from osboxes.org or similar ? Also running |
BTW the sblat2 test crashes with alloc_table=0x0 (making alloc_table[0] an illegal access) in line 1128 of memory.c |
Seems to me the crucial problem is that your revised code is missing a call to blas_tls_init() at the start of blas_memory_alloc() (or the |
Actually it gets called from blas_memory_init, from blas_memory_alloc. And somehow that line is gone. I'll add it back in. |
e9ced31
to
0b0177b
Compare
When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up. This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH. This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation. The overall performance impact was < 1%, anyway. Removing it also simplifies the code. Support arbitrary numbers of threads for memory allocation. When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up. This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH. This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation. The overall performance impact was < 1%, anyway. Removing it also simplifies the code.
Seems the timeouts/lock ups in the "fork" utest are still there on older systems. My quick test (with the Ubuntu CI setup updated to avoid the perceived glibc "fork test" problem) passed everything except the mingw cross-build. (Not sure yet what happened there, could be same glibc problem). Still not quite happy with putting this in 0.3.3, at least the previous version did not lock up on older systems. |
Hmm... I have to admit it is rather frustrating that I can run something locally and have everything pass and run fine (no seg faults), and end up with failures on Travis. I don't envy the huge portability aspect in pulling this project together. I do wish that the logs at least would have more details in the case of failure (it seems like only one job gives verbose output). |
I have hacked my .travis.yml into your branch now, but the problem with the old Ubuntu "Precise" version will probably need to be debugged in a local installation. I believe the earlier problems with segfaults were a direct consequence of invoking undefined behaviour (by omitting tls key initialization in that one code path), so did not affect all Linux flavors int the CI and may not have occured on whatever you use for local testing. |
Checking "my" version locally, I "only" get the fork hang with the old Ubuntu systems, but not the segfaults of "your" version. Note the version in my fork has added safeguards against alloc_table itself being NULL, where your original only checked alloc_table[position]... |
I have now put my modified version of the code into #1742, which would again make the original memory.c the default unless compiled with -DUSE_TLS (and on a system with a sufficiently capable libc). |
When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen (#1735) show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.
This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.
This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation (#1720). The overall performance impact was < 1%, anyway. Removing it also simplifies the code.