-
Notifications
You must be signed in to change notification settings - Fork 6.8k
libc getenv is not threadsafe #13438
Comments
@samskalicky Thanks for identifying the thread safety issue |
@mxnet-label-bot add [Backend, Bug] |
FYI @KellenSunderland and @larroy |
@samskalicky I don't think the problem is running different processes, one process shouldn't affect the other in terms of environment and per separation of the process space. The problem is using getenv / setenv from the same process in different threads, as is the case from the code you linked. |
Could you decode the stack trace via c++filt? |
Looks like the evidence points to where we suspected, initialize.cc We should rework that code. |
|
What if we set a process wide mutex here: https://github.com/dmlc/dmlc-core/blob/600bc28bc476a1b7866cfe9f444c91b6d49760b7/include/dmlc/parameter.h#L1062 Should fix the issue right? No deadlocks? We'd have to make sure the mutex wasn't threadpool scoped as it's getting set in a new threadpool |
as you said, what about libraries like openmp which might access the environment variable without a lock?. We could try monkey patching getenv with our own wrapper at link time or runtime. Best is not to use setenv after we have multiple threads running. |
I don't think we need any mutex. We can just pass any config to the thread as data if we need. Definitely we shouldn't be using the environment to config workers. |
In an attempt to reproduce the issue (assuming its coming from simultaneous processes initializing at the same time) I created a cpp engine unit test by modifying tests/cpp/engine/threaded_engine_test.cc and adding the following two functions:
The handler attempts to capture the segfault and print the stack trace, and the envsegfault test creates a threaded engine object and repeatedly stops and starts it to trigger the SetEnv call in initialize.cc and GetEnv call in threaded_engine_perdevice.cc. Then I wrote this python script to run multiple independent processes doing this:
I was able to trigger a segfault, but I hadnt added the handler yet to print the stack trace. I need to run again and try to reproduce again. |
Do you think the multiple process have any impact? Env is not shared across processes. |
process != thread. In this case I think you mean thread. Other than that I agree with your analysis. I think we need to understand if not using this variable as in #13472 makes unwanted threads to be spawn inside the engine workers, because it might actually not be the case. Depending on that, we need to see if we need to use a different mechanism in the engine to control OMP threads, then run performance regression tests. As you see from my PR, not changing the environment is functionally correct, but I don't know if it has performance impact because of the reason explained above. |
I would suggest using an atomic counter in pthread_atfork that is incremented whenever we create a thread. In this way see how many threads we are creating and printing it out and comparing two runs with, and without the variables that I removed to see if there's any difference, that would be a quick check. |
Heres the relevant code from the version of libc (2.17) that was being used in the failure: It looks like the issue is the __environ variable being iterated over in lines 15, 32, 55. This variable is defined in setenv.c:
So there is this nasty global pointer to the environment. While setenv has a lock around changing the environment, theres no lock for getenv so (as mentioned in the block post linked in the description) :
So I think @KellenSunderland's suggestion of adding a lock/mutex in dmls parameter.h's GetEnv and SetEnv should prevent any failure due to MXNet's various threads accessing the environment simultaneously. Although without a reproducible test case we're unable to validate this. Anybody have any ideas for reproducing something similar to this code from the blog post with mxnet's engine tests so that we have a reproducible test case?
I tried simplifying the start_stop test https://github.com/apache/incubator-mxnet/blob/master/tests/cpp/engine/threaded_engine_test.cc#L124 for just the threaded_engine_perdevice and stop/started it a few hundred times and it didnt cause a segfault. |
We should not use locking in the handlers in pthread_atfork, it will likely have negative impacts in terms of lock contention and performance and architecturally is not a good solution. We should find a solution using good architecture to pass the number of threads in the engine. threads get a data pointer where you can put configuration or pass any other data at creation time. |
@larroy We were already using locks inside the handlers. setenv uses a lock in libc around modifying the environment. This doesnt change anything. Are you suggesting that we dont use an EnvVar to pass the OMP_NUM_THREADS=1 to subsequent processes? What do you think the impact will be the other processes that are using MXNet using more threads? Was there a conscious decision to use an EnvVar to limit OMP_NUM_THREADS in subsequent processes? |
The problem as provided in the article linked in this issue and related article here: https://rachelbythebay.com/w/2017/01/30/env/ is that if the main thread spawns another thread, which calls setenv and while we call setenv the process is forked, the mutex is currently in locked state in the child process and it will never be unlocked since there is no thread to release the lock which causes it to hang.
Run the following script:
Now run the script, you will be able to see the process hangs.
which means it is stuck trying to acquire the lock: https://github.com/lattera/glibc/blob/master/stdlib/setenv.c#L133 I checked the mxnet codebase to see if we are calling SetEnv anywhere else and we dont seem to be calling it anywhere except here. Also, pthread_at_fork statement calls |
Good analysis @anirudh2290 agree with it and makes sense. I would suggest we don't use environment for this and have a propper configuration object that we can pass when needed, or in the worst case, can be a singleton, what do you think? |
@larroy Can you elaborate more on the singleton solution . For example even if its a singleton its still possible that we may fork during setenv call inside a thread and run across the same issue. Removing env variables and changing it to config would be a breaking change. I dont think we can do it before 2.0. |
@anirudh2290 good one! in prepare method, we should try to set bogus env variable, which will unlock the lock and then child will get that lock in unlocked state. |
Thanks for the suggestion @Vikas89 . I think with your suggested change, it may be less likely to be encountered, but it is possible that mutex is in locked state in child process, if the thread calls setenv after the prepare step and before fork. |
If there is only 2 places that too after fork where we setEnv(initialize.cc), why are we setting env ? Can we just keep these values in memory or in some file tied to this mxnet processID ? |
I was able to run the user's test script and reproduce in GDB. Looks like its failing when calling getenv("MXNET_EXEC_BULK_EXEC_INFERENCE"). This only seems to happen here: This matches the original stack trace where we found it failing in GraphExecutor::InitOpSegs |
So far, we've been able to verify that compiling mxnet without openmp does not trigger the segfault. This validates that the problem is between one thread in mxnet running openmp and another thread in mxnet calling getenv. Thus the current workaround (to avoid the segfault) is to recompile mxnet without openmp. This will cause performance degredation for any operators that used to run using openmp (ie. those that are not using BLAS/LAPACK routines or MKLDNN). |
@Vikas89 and @anirudh2290 have discussed a couple of options for implementing a permanent solution:
None of these seem like the best options, any other comments or thoughts? |
I think to arrive at the right solution need to understand more on the openmp behavior and why it calls the library initialize within an operator execution when an omp for loop is called. Need to also confirm if this happens only during forking but also outside forking and what specific openmp versions the customer has and if this can be reproduced on other openmp versions. |
Good analysis. Has anyone checked to see this problem is reproducible with all versions of OMP? I'm wondering if another workaround that may not have a perf impact would be to use a different implementation than what is used by default. |
Regarding option 1. I actually think this is a good design, but several tests are currently relying on the fact that we can change behaviour via env var changes during runtime. However, we could probably fix this be reinit'ing the library whenever we set an env var in a test. I think we should avoid reading env vars at runtime in any case for performance reasons, so it makes sense to read these once at startup and not again after. |
I would go for option 1. |
Thanks @larroy @KellenSunderland @anirudh2290, just found out that the problem was due to using Intel OpenMP rather than the LLVM OpenMP submodule. Upon switching back to using LLVM OpenMP the problem did not occur. I think the problem is effectively avoided since we're not advocating for using Intel OpenMP. @TaoLv @pengzhao-intel any interest in seeing MXNet switch to Intel OpenMP and investigating this issue? Are there any perceived benefits? |
thanks @samskalicky It's a long thread, we need to warm up and understand the background :) @TaoLv could you help take a look for the questions? |
@samskalicky Can you explain more about how to use Intel OMP in MXNet and how to switch back to LLVM OMP? I think LLVM OMP is using the same runtime as Intel OMP. |
@TaoLv according to this page: https://software.intel.com/en-us/forums/intel-c-compiler/topic/793552 they should be the same but may differ slightly as they have different release schedules. So this is a possibility that the functionality differs between a particular versions of LLVM/Intel OMP. As for how to use it, I didnt do the building myself but I imagine that those who did simply modified the makefile to link against the Intel OMP binary rather than building the LLVM OMP submodule. Unfortunately I dont have step-by-step instructions for you. I was just hoping to see what your thoughts were on the current way MXNet is architected in regards to OMP and if what you thought could be improved (possibly by using Intel OMP). And if you had any interest in seeing MXNet switch to Intel OMP and what any benefits might be. Of course we'd still need to maintain a build without Intel OMP for other architectures. |
@samskalicky Yes. I would expect ICC + Intel OMP should have better performance and I know there is an effort ongoing for enabling them for MXNet. Currently MXNet cannot be successfully built with ICC. |
…cal concurrency crashes. (apache#15762) * Refactor LibraryInitializer so it's thread safe. Fixes apache#13438 Fixes apache#14979 * Refactor around lib loading * Fix lint * CR * Add option to choose between OMP implementations * Fix bug * Fix from CR
Description
getenv() calls in libc are not threadsafe according to:
https://rachelbythebay.com/w/2017/01/30/env/
and
OpenMathLib/OpenBLAS#716
There are indirect calls to dmlc::GetEnv() all across the mxnet codebase, here are a few:
https://github.com/apache/incubator-mxnet/blob/266de6bef4da5769431557288d41fab2a02e52ca/src/engine/threaded_engine_perdevice.cc#L79
or
https://github.com/apache/incubator-mxnet/blob/5a83b6b563211f430688e41eab4752c6de4ecf22/src/executor/graph_executor.cc#L1194
Error Message:
/lib64/libc.so.6(+0x35250) [0x7fdc5b99b250]
/lib64/libc.so.6(getenv+0xad) [0x7fdc5b99e0cd]
/opt/amazon/lib/libmxnet.so(ZN4dmlc6GetEnvIbEET_PKcS1+0x1b) [0x7fdc4a1bedab]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor10InitOpSegsEv+0x1cb) [0x7fdc4a1b6e0b]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor15FinishInitGraphEN4nnvm6SymbolENS2_5GraphEPNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryENS_7NDArrayENS2_13NodeEntryHashENS2_14NodeEntryEqualESaISt4pairIKS8_S9_EEE+0x71b) [0x7fdc4a1b760b]
/opt/amazon/lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_EEERKSt6vectorIS4_SaIS4_EESL_SL_RKSt13unordered_mapISsNS2_6TShapeESt4hashISsESt8equal_toISsESaISA_ISB_SN_EEERKSM_ISsiSP_SR_SaISA_ISB_iEEES11_RKSH_INS_9OpReqType
ESaIS12_EERKSt13unordered_setISsSP_SR_SaISsEEPSH_INS_7NDArrayESaIS1C_EES1F_S1F_PSM_ISsS1C_SP_SR_SaISA_ISB_S1C_EEEPNS_8ExecutorERKSM_INS2_9NodeEntryES1C_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISA_IKS1M_S1C_EEE+0x75d) [0x7fdc4a1b958d]
/opt/amazon/lib/libmxnet.so(ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vectorIS3_SaIS3_EESK_SK_RKSt13unordered_mapISsNS1_6TShapeESt4hashISsESt8equal_toISsESaIS9_ISA_SM_EEERKSL_ISsiSO_SQ_SaIS9_ISA_iEEES10_RKSG_INS_9OpReqTypeESaI
S11_EERKSt13unordered_setISsSO_SQ_SaISsEEPSG_INS_7NDArrayESaIS1B_EES1E_S1E_PSL_ISsS1B_SO_SQ_SaIS9_ISA_S1B_EEEPS0+0x1a6) [0x7fdc4a1b9f46]
/opt/amazon/lib/libmxnet.so(MXExecutorSimpleBind+0x1e38) [0x7fdc4a1031a8]
/opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fdc5af9b858]
Minimum reproducible example
TODO
Steps to reproduce
TODO
The text was updated successfully, but these errors were encountered: