-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numpy problem #990
Comments
Which version of OpenBLAS are you using in numpy ? As noted in the numpy issue multithreading may actually slow down the code with small matrix sizes. Recent versions are a bit better at limiting thread creation but you could also try to set the environment variable OMP_NUM_THREADS to some low value. If you choose to build your own OpenBLAS from the develop branch here, you can also try replacing the sched_yield() in the definition of YIELDING (in file common.h) with usleep(1) as suggested in #923 |
and dgesdd complexity is approx m^3+n^3, it uses about 1% yeilding all in all (it would be immoral to cite computation complexity obtained here in your homework) i dont see how it is a problem? For me it takes about 3:20min with 1 thread openblas (0.2.18) and 1:50m on 2 threads and i stopped waiting after 1/2h for netlib BLAS. Properly formatted repeater sample recovered after chasing your links attached Please post your timings and cpuid in addition to openblas and system version |
The running time decreases a little bit when I reduces the number of threads(from 8 to 4 to 2), my numpy is 1.11.0, python2.7/3.4, ubuntu 14.04, tested on two machines, both running same environment. I don't know how to check openblas version as I have never used openblas, maybe you guys can point me to some direction. |
Can you elaborate on a repeatable test case XOR stop blaming OpenBLAS Timings: CPUID (from /proc/cpuinfo) OpenBLAS version: 0.2.8 included with your Ubuntu is unsupported by Ubuntu (check ubuntu-support-status on your system), also it does not have support (Cpuid or optimal code) for CPUs released after August 2013 (i.e newer than 3 years). You can build your own OpenBLAS if you find this arrangement unfair. System version - uname -a and /etc/lsb-release (at least tell if you run LTS or HWE kernel and 32bit or 16bit system) |
The issue is not performance, that is fine. The problem is that openblas is extremely wasteful on cpu resources for small matrices. This is due to poor choice on number of threads and poor pooling on linux systems. E.g. the linux pthread mutex is already a lightweight mutex which on cpus capable of it can also use elided locks, while shed_yield triggers a full requeing in the kernel scheduler which costs a lot of cpu. It used to be an acceptable function, but that changed many years ago which did cause its share of performance regressions (.e.g in postgres), openblas simply was never adapted to it. |
wasting cpu cycles on pooling is not a big deal on machines dedicated to do one task, but openblas is not only used in these scenarios. As it is the best free BLAS that is not cumbersome to install and deploy it is used in many projects where a machine may be dedicated to more than one task or programs where there are multiple levels of parallelism. |
@juliantaylor what are you using for #define SCHED_YIELD - usleep(1) , asm(pause) or just asm(noop) ? I've been meaning to create a PR to use usleep(1) as that seemed to be the result of the discussions in #923 (and nobody came up in favor of sched_yield() for some reason :-) ) |
I removed YIELDING completely and unconditionally went into pthread_mutex_lock in the pthread variant. |
Interesting point - I think just removing YIELDING completely did not figure in previous discussions. As you mention pthreads specifically, would the situation be different for OPENMP ? |
usually openmp handles the thread pools in a somewhat sane way (gomps spin count is a little aggressive but can at least be tuned with GOMP_SPINCOUNT). So probably it does not have the issue. A quick grep does not show any openmp lock or critical calls. Btw. I think the pthread thread pool has a couple races, running my patched build one a couple times sometimes deadlocks due to assignments to the shared thread status not being locked in all places properly. The slow yielding probably just makes this very unlikely. The whole thing probably needs a overhaul. |
Probably warrants its own issue report - do you happen to have any pointers or a simple reproducer ? |
on the 16 core machine only get it with the numpy testcase but on another 4 core machine I get it from the 2 thread sblat2 test. --- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -313,7 +313,7 @@ static void* blas_thread_server(void *arg){
YIELDING;
- if ((unsigned int)rpcc() - last_tick > thread_timeout) {
+ if (1 || (unsigned int)rpcc() - last_tick > thread_timeout) {
pthread_mutex_lock (&thread_status[cpu].lock);
I don't really know the flow of the code, but e.g. the code that is fenced with WMB on weakly ordered arches is sketchy, async_wait also has no synchronization. Wouldn't be surprised when there is a loadstore reorder issue. |
In ideal world there is no such ordering race i..e each RW argument chunk is read, calculated and put back as a result in same thread just once. There is place for one-off issues, which exhibit as a numeric problem and performance problem at the same time (first is easier spotted)... |
helgrind (from the valgrind suite) finds a number of potential races with the sblat2 test and an unmodified OpenBLAS - some read/write conflicts in thread initialization where the code basically looks like "read shared variable ,then if value not to our liking acquire lock and change value". Probably more serious is a write/write conflict in alloc_mmap (memory.c:411) where the "release_pos" index into the "release_info" array is modified without a lock held in either thread. Time to splice off a separate issue I guess ? |
@1a1a11a I suspect this issue (and the associated numpy one) is solved for you by the coding suggestions you received on stackoverflow ? |
Kind of, thank you! @martin-frbg |
see here, numpy/numpy#8185.
The text was updated successfully, but these errors were encountered: