Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling openblas_set_num_threads in the main process of Sage can lead to segfaults #26585

Closed
ClementPernet opened this issue Oct 28, 2018 · 20 comments

Comments

@ClementPernet
Copy link
Contributor

Note: This does not happen currently in Sage, but does result from attempts to add multi-threading support to fflas-ffpack.

Multiplying a 1000x1000 matrix over GF(11) causes a segfault on a 32 core server.

sage: a=random_matrix(GF(11),1000)
sage: b = a*a
Erreur de segmentation

gdb says it comes from OpenBLAS thread manager.

Thread 48 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fe738c31700 (LWP 18829)]
0x00007fffef552d0e in blas_thread_server ()
   from /home/pernet/sage/local/lib/libopenblas.so.0
(gdb) where
#0  0x00007fffef552d0e in blas_thread_server ()
   from /home/pernet/sage/local/lib/libopenblas.so.0
#1  0x00007ffff7d73f2a in start_thread (arg=0x7fe738c31700)
    at pthread_create.c:463
#2  0x00007ffff7b08f0f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

but in valgrind mode, everything works just fine.

Upstream PR: https://github.com/xianyi/OpenBLAS/pull/1837/files

Upstream: Fixed upstream, but not in a stable release.

CC: @embray

Component: linear algebra

Author: Erik Bray

Branch: 9dc0772

Reviewer: Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/26585

@embray
Copy link
Contributor

embray commented Oct 28, 2018

comment:2

I have a suspicion that this is an openblas bug but it's not clear yet.

@embray
Copy link
Contributor

embray commented Oct 28, 2018

comment:3

Somehow, it seems like, the blas_thread_init routine is being called multiple times, when it shouldn't be.

@embray
Copy link
Contributor

embray commented Oct 28, 2018

comment:4

Hmm; I think I see the problem.
When the openblas library is first initialized it runs through a number of initialization routines, including a function called blas_thread_init. By default the blas_num_threads is 1, so nothing very interesting happens in blas_thread_init. Nevertheless, it concludes by setting a global variable blas_server_avail = 1. This prevents any subsequent calls to blas_thread_init from doing anything.

Normally the only way blas_server_avail gets reset to zero is if the function blas_thread_shutdown is called. This can be called manually, I think, though there is usually any reason to. But there is one case in which it does get called transparently: When openblas is initialized it registers a pthread_atfork handler to call blas_thread_shutdown prior to forking (so that the child process gets a clean threading state). In the child process the thread manager is then re-enabled (blas_thread_init). But in the parent it is not immediately re-initialized.

Rather, most functions in openblas that use threads (e.g. blas_exec_async) check blas_server_avail and call blas_thread_init if it is zero. However, openblas_set_num_threads does not perform such a check. But if it is called to increase the number of threads, it does go ahead and create new threads without checking blas_server_avail at all. The end result is as if blas_thread_init were called (more or less, with some exceptions), but openblas doesn't think threads have been initialized. Then when blas_thread_init does get called it starts mutating global data (such as mutexes) that the now running threads are already using, leading to segfaults in those threads.

@embray
Copy link
Contributor

embray commented Oct 28, 2018

comment:5

Confirmed. Simply adding if (unlikely(blas_server_avail == 0)) blas_thread_init(); at the beginning of openblas_set_num_threads was good enough to fix it.

@embray
Copy link
Contributor

embray commented Oct 29, 2018

comment:6

Added upstream pull request; if someone wants to review it I can also add the patch to Sage's openblas package.

@embray
Copy link
Contributor

embray commented Oct 29, 2018

Upstream: Reported upstream. No feedback yet.

@embray

This comment has been minimized.

@videlec
Copy link
Contributor

videlec commented Oct 29, 2018

comment:7

Erik openblas patch is available for testing at branch public/26585-openblas_patch.

@embray
Copy link
Contributor

embray commented Oct 29, 2018

Branch: public/26585-openblas_patch

@embray
Copy link
Contributor

embray commented Oct 29, 2018

Commit: 9dc0772

@embray
Copy link
Contributor

embray commented Oct 29, 2018

comment:8

Cool, might as well add it to the ticket. If we need to update it later nbd.


New commits:

9dc0772patch for openblas

@embray
Copy link
Contributor

embray commented Oct 31, 2018

Author: Erik Bray

@embray
Copy link
Contributor

embray commented Oct 31, 2018

Changed upstream from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.

@embray
Copy link
Contributor

embray commented Oct 31, 2018

Reviewer: Erik Bray

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:9

Fix accepted upstream as-is.

@videlec
Copy link
Contributor

videlec commented Oct 31, 2018

comment:11

Good news. Thanks Erik for taking care of it!

@embray
Copy link
Contributor

embray commented Oct 31, 2018

comment:12

Updated the title of the ticket to reflect the broader problem (if I need to look for this later I won't remember the exact example of multiplying matrices in GF(11) :)

@embray embray changed the title Matrix product over GF(11) segfaults on multicore server Calling openblas_set_num_threads in the main process of Sage can lead to segfaults Oct 31, 2018
@vbraun
Copy link
Member

vbraun commented Nov 1, 2018

Changed branch from public/26585-openblas_patch to 9dc0772

@embray
Copy link
Contributor

embray commented Nov 2, 2018

comment:14

I just realized this probably should have increased the patch version of the SPKG. Otherwise it won't get updated in incremental builds.

It's not clear to me that this patch is immediately needed though...

@embray
Copy link
Contributor

embray commented Nov 2, 2018

Changed commit from 9dc0772 to none

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants