Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy problem #990

Closed
1a1a11a opened this issue Oct 20, 2016 · 16 comments
Closed

numpy problem #990

1a1a11a opened this issue Oct 20, 2016 · 16 comments

Comments

@1a1a11a
Copy link

1a1a11a commented Oct 20, 2016

see here, numpy/numpy#8185.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 20, 2016

Which version of OpenBLAS are you using in numpy ? As noted in the numpy issue multithreading may actually slow down the code with small matrix sizes. Recent versions are a bit better at limiting thread creation but you could also try to set the environment variable OMP_NUM_THREADS to some low value. If you choose to build your own OpenBLAS from the develop branch here, you can also try replacing the sched_yield() in the definition of YIELDING (in file common.h) with usleep(1) as suggested in #923

@brada4
Copy link
Contributor

brada4 commented Oct 22, 2016

Thread 1 "python" hit Breakpoint 1, dgesdd (jobz=..., m=5000, n=5000, a=..., lda=5000, s=..., u=..., ldu=5000, vt=..., ldvt=5000, work=..., lwork=-1, iwork=..., info=6, _jobz=1662181904) at dgesdd.f:216
Thread 1 "python" hit Breakpoint 1, dgesdd (jobz=..., m=5000, n=5000, a=..., lda=5000, s=..., u=..., ldu=5000, vt=..., ldvt=5000, work=..., lwork=75035000, iwork=..., info=1, _jobz=1823445212) at dgesdd.f:216

and dgesdd complexity is approx m^3+n^3, it uses about 1% yeilding all in all (it would be immoral to cite computation complexity obtained here in your homework)

i dont see how it is a problem? For me it takes about 3:20min with 1 thread openblas (0.2.18) and 1:50m on 2 threads and i stopped waiting after 1/2h for netlib BLAS.

Properly formatted repeater sample recovered after chasing your links attached
990.txt

Please post your timings and cpuid in addition to openblas and system version

@1a1a11a
Copy link
Author

1a1a11a commented Oct 23, 2016

The running time decreases a little bit when I reduces the number of threads(from 8 to 4 to 2), my numpy is 1.11.0, python2.7/3.4, ubuntu 14.04, tested on two machines, both running same environment. I don't know how to check openblas version as I have never used openblas, maybe you guys can point me to some direction.

@brada4
Copy link
Contributor

brada4 commented Oct 23, 2016

Can you elaborate on a repeatable test case XOR stop blaming OpenBLAS

Timings:
OPENBLAS_NUM_THREADS=1 /usr/bin/time python test.py
... ... =2 =3 =4 =number of cores...
And post at least 1st line of each output.
You can also post for comparison Ubuntu-build Atlas and Netlib results.

CPUID (from /proc/cpuinfo)

OpenBLAS version: 0.2.8 included with your Ubuntu is unsupported by Ubuntu (check ubuntu-support-status on your system), also it does not have support (Cpuid or optimal code) for CPUs released after August 2013 (i.e newer than 3 years). You can build your own OpenBLAS if you find this arrangement unfair.

System version - uname -a and /etc/lsb-release (at least tell if you run LTS or HWE kernel and 32bit or 16bit system)

@juliantaylor
Copy link

juliantaylor commented Oct 24, 2016

The issue is not performance, that is fine. The problem is that openblas is extremely wasteful on cpu resources for small matrices. This is due to poor choice on number of threads and poor pooling on linux systems. E.g. the linux pthread mutex is already a lightweight mutex which on cpus capable of it can also use elided locks, while shed_yield triggers a full requeing in the kernel scheduler which costs a lot of cpu. It used to be an acceptable function, but that changed many years ago which did cause its share of performance regressions (.e.g in postgres), openblas simply was never adapted to it.
Fwiw on a 2socket 16 core sandybridge kernel cpu with git master openblas removing the yield and simply using the lock directly reduces cpu usage by 50% and does not degrade performance at all.

@juliantaylor
Copy link

juliantaylor commented Oct 24, 2016

wasting cpu cycles on pooling is not a big deal on machines dedicated to do one task, but openblas is not only used in these scenarios. As it is the best free BLAS that is not cumbersome to install and deploy it is used in many projects where a machine may be dedicated to more than one task or programs where there are multiple levels of parallelism.
It would be great if there where an option to make openblas less greedy in terms of resources for those use cases.

@martin-frbg
Copy link
Collaborator

@juliantaylor what are you using for #define SCHED_YIELD - usleep(1) , asm(pause) or just asm(noop) ? I've been meaning to create a PR to use usleep(1) as that seemed to be the result of the discussions in #923 (and nobody came up in favor of sched_yield() for some reason :-) )

@juliantaylor
Copy link

I removed YIELDING completely and unconditionally went into pthread_mutex_lock in the pthread variant.
usleep is a poor option as this may cause an unnecessary context switches on uncontested locks.
pause and nop are redundant, a (glibc) pthread mutex is already lightweight and will not switch if there is no contention.

@martin-frbg
Copy link
Collaborator

Interesting point - I think just removing YIELDING completely did not figure in previous discussions. As you mention pthreads specifically, would the situation be different for OPENMP ?

@juliantaylor
Copy link

usually openmp handles the thread pools in a somewhat sane way (gomps spin count is a little aggressive but can at least be tuned with GOMP_SPINCOUNT). So probably it does not have the issue. A quick grep does not show any openmp lock or critical calls.

Btw. I think the pthread thread pool has a couple races, running my patched build one a couple times sometimes deadlocks due to assignments to the shared thread status not being locked in all places properly. The slow yielding probably just makes this very unlikely. The whole thing probably needs a overhaul.

@martin-frbg
Copy link
Collaborator

Probably warrants its own issue report - do you happen to have any pointers or a simple reproducer ?
(I think there are a few open issues that allude to similar problems, but IIRC typically with large and/or
unknown codes and/or the submitter dropped out of the thread rather quickly)

@juliantaylor
Copy link

on the 16 core machine only get it with the numpy testcase but on another 4 core machine I get it from the 2 thread sblat2 test.
Both with this change applied which should not be the cause of the deadlock by itself as it just short circuits the busy wait.

--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -313,7 +313,7 @@ static void* blas_thread_server(void *arg){

        YIELDING;

-       if ((unsigned int)rpcc() - last_tick > thread_timeout) {
+       if (1 || (unsigned int)rpcc() - last_tick > thread_timeout) {

          pthread_mutex_lock  (&thread_status[cpu].lock);

I don't really know the flow of the code, but e.g. the code that is fenced with WMB on weakly ordered arches is sketchy, async_wait also has no synchronization. Wouldn't be surprised when there is a loadstore reorder issue.

@brada4
Copy link
Contributor

brada4 commented Oct 24, 2016

In ideal world there is no such ordering race i..e each RW argument chunk is read, calculated and put back as a result in same thread just once. There is place for one-off issues, which exhibit as a numeric problem and performance problem at the same time (first is easier spotted)...
i.e complete avoidance of competing access makes sure no memory locking primitives are needed.
PS thats just my understanding how it works...

@martin-frbg
Copy link
Collaborator

helgrind (from the valgrind suite) finds a number of potential races with the sblat2 test and an unmodified OpenBLAS - some read/write conflicts in thread initialization where the code basically looks like "read shared variable ,then if value not to our liking acquire lock and change value". Probably more serious is a write/write conflict in alloc_mmap (memory.c:411) where the "release_pos" index into the "release_info" array is modified without a lock held in either thread. Time to splice off a separate issue I guess ?

@martin-frbg
Copy link
Collaborator

@1a1a11a I suspect this issue (and the associated numpy one) is solved for you by the coding suggestions you received on stackoverflow ?

@1a1a11a
Copy link
Author

1a1a11a commented Nov 6, 2016

Kind of, thank you! @martin-frbg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants