numpy problem #990

1a1a11a · 2016-10-20T14:07:55Z

martin-frbg · 2016-10-20T15:22:54Z

Which version of OpenBLAS are you using in numpy ? As noted in the numpy issue multithreading may actually slow down the code with small matrix sizes. Recent versions are a bit better at limiting thread creation but you could also try to set the environment variable OMP_NUM_THREADS to some low value. If you choose to build your own OpenBLAS from the develop branch here, you can also try replacing the sched_yield() in the definition of YIELDING (in file common.h) with usleep(1) as suggested in #923

brada4 · 2016-10-22T21:24:14Z

Thread 1 "python" hit Breakpoint 1, dgesdd (jobz=..., m=5000, n=5000, a=..., lda=5000, s=..., u=..., ldu=5000, vt=..., ldvt=5000, work=..., lwork=-1, iwork=..., info=6, _jobz=1662181904) at dgesdd.f:216
Thread 1 "python" hit Breakpoint 1, dgesdd (jobz=..., m=5000, n=5000, a=..., lda=5000, s=..., u=..., ldu=5000, vt=..., ldvt=5000, work=..., lwork=75035000, iwork=..., info=1, _jobz=1823445212) at dgesdd.f:216

and dgesdd complexity is approx m^3+n^3, it uses about 1% yeilding all in all (it would be immoral to cite computation complexity obtained here in your homework)

i dont see how it is a problem? For me it takes about 3:20min with 1 thread openblas (0.2.18) and 1:50m on 2 threads and i stopped waiting after 1/2h for netlib BLAS.

Properly formatted repeater sample recovered after chasing your links attached
990.txt

Please post your timings and cpuid in addition to openblas and system version

1a1a11a · 2016-10-23T00:54:22Z

The running time decreases a little bit when I reduces the number of threads(from 8 to 4 to 2), my numpy is 1.11.0, python2.7/3.4, ubuntu 14.04, tested on two machines, both running same environment. I don't know how to check openblas version as I have never used openblas, maybe you guys can point me to some direction.

brada4 · 2016-10-23T07:28:46Z

Can you elaborate on a repeatable test case XOR stop blaming OpenBLAS

Timings:
OPENBLAS_NUM_THREADS=1 /usr/bin/time python test.py
... ... =2 =3 =4 =number of cores...
And post at least 1st line of each output.
You can also post for comparison Ubuntu-build Atlas and Netlib results.

CPUID (from /proc/cpuinfo)

OpenBLAS version: 0.2.8 included with your Ubuntu is unsupported by Ubuntu (check ubuntu-support-status on your system), also it does not have support (Cpuid or optimal code) for CPUs released after August 2013 (i.e newer than 3 years). You can build your own OpenBLAS if you find this arrangement unfair.

System version - uname -a and /etc/lsb-release (at least tell if you run LTS or HWE kernel and 32bit or 16bit system)

juliantaylor · 2016-10-24T09:58:01Z

The issue is not performance, that is fine. The problem is that openblas is extremely wasteful on cpu resources for small matrices. This is due to poor choice on number of threads and poor pooling on linux systems. E.g. the linux pthread mutex is already a lightweight mutex which on cpus capable of it can also use elided locks, while shed_yield triggers a full requeing in the kernel scheduler which costs a lot of cpu. It used to be an acceptable function, but that changed many years ago which did cause its share of performance regressions (.e.g in postgres), openblas simply was never adapted to it.
Fwiw on a 2socket 16 core sandybridge kernel cpu with git master openblas removing the yield and simply using the lock directly reduces cpu usage by 50% and does not degrade performance at all.

juliantaylor · 2016-10-24T10:05:00Z

wasting cpu cycles on pooling is not a big deal on machines dedicated to do one task, but openblas is not only used in these scenarios. As it is the best free BLAS that is not cumbersome to install and deploy it is used in many projects where a machine may be dedicated to more than one task or programs where there are multiple levels of parallelism.
It would be great if there where an option to make openblas less greedy in terms of resources for those use cases.

martin-frbg · 2016-10-24T12:45:19Z

@juliantaylor what are you using for #define SCHED_YIELD - usleep(1) , asm(pause) or just asm(noop) ? I've been meaning to create a PR to use usleep(1) as that seemed to be the result of the discussions in #923 (and nobody came up in favor of sched_yield() for some reason :-) )

juliantaylor · 2016-10-24T13:24:13Z

I removed YIELDING completely and unconditionally went into pthread_mutex_lock in the pthread variant.
usleep is a poor option as this may cause an unnecessary context switches on uncontested locks.
pause and nop are redundant, a (glibc) pthread mutex is already lightweight and will not switch if there is no contention.

martin-frbg · 2016-10-24T17:39:40Z

Interesting point - I think just removing YIELDING completely did not figure in previous discussions. As you mention pthreads specifically, would the situation be different for OPENMP ?

juliantaylor · 2016-10-24T17:47:26Z

usually openmp handles the thread pools in a somewhat sane way (gomps spin count is a little aggressive but can at least be tuned with GOMP_SPINCOUNT). So probably it does not have the issue. A quick grep does not show any openmp lock or critical calls.

Btw. I think the pthread thread pool has a couple races, running my patched build one a couple times sometimes deadlocks due to assignments to the shared thread status not being locked in all places properly. The slow yielding probably just makes this very unlikely. The whole thing probably needs a overhaul.

martin-frbg · 2016-10-24T20:10:32Z

Probably warrants its own issue report - do you happen to have any pointers or a simple reproducer ?
(I think there are a few open issues that allude to similar problems, but IIRC typically with large and/or
unknown codes and/or the submitter dropped out of the thread rather quickly)

juliantaylor · 2016-10-24T20:17:12Z

on the 16 core machine only get it with the numpy testcase but on another 4 core machine I get it from the 2 thread sblat2 test.
Both with this change applied which should not be the cause of the deadlock by itself as it just short circuits the busy wait.

--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -313,7 +313,7 @@ static void* blas_thread_server(void *arg){

        YIELDING;

-       if ((unsigned int)rpcc() - last_tick > thread_timeout) {
+       if (1 || (unsigned int)rpcc() - last_tick > thread_timeout) {

          pthread_mutex_lock  (&thread_status[cpu].lock);

I don't really know the flow of the code, but e.g. the code that is fenced with WMB on weakly ordered arches is sketchy, async_wait also has no synchronization. Wouldn't be surprised when there is a loadstore reorder issue.

brada4 · 2016-10-24T20:49:12Z

In ideal world there is no such ordering race i..e each RW argument chunk is read, calculated and put back as a result in same thread just once. There is place for one-off issues, which exhibit as a numeric problem and performance problem at the same time (first is easier spotted)...
i.e complete avoidance of competing access makes sure no memory locking primitives are needed.
PS thats just my understanding how it works...

martin-frbg · 2016-10-25T12:01:54Z

helgrind (from the valgrind suite) finds a number of potential races with the sblat2 test and an unmodified OpenBLAS - some read/write conflicts in thread initialization where the code basically looks like "read shared variable ,then if value not to our liking acquire lock and change value". Probably more serious is a write/write conflict in alloc_mmap (memory.c:411) where the "release_pos" index into the "release_info" array is modified without a lock held in either thread. Time to splice off a separate issue I guess ?

martin-frbg · 2016-11-03T11:49:36Z

@1a1a11a I suspect this issue (and the associated numpy one) is solved for you by the coding suggestions you received on stackoverflow ?

1a1a11a · 2016-11-06T00:37:02Z

Kind of, thank you! @martin-frbg

martin-frbg mentioned this issue Oct 25, 2016

pthread locking oddities as exemplified by test/sblat2 #995

Closed

1a1a11a closed this as completed Nov 6, 2016

martin-frbg mentioned this issue Jan 23, 2017

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy problem #990

numpy problem #990

1a1a11a commented Oct 20, 2016

martin-frbg commented Oct 20, 2016 •

edited

Loading

brada4 commented Oct 22, 2016 •

edited

Loading

1a1a11a commented Oct 23, 2016

brada4 commented Oct 23, 2016 •

edited

Loading

juliantaylor commented Oct 24, 2016 •

edited

Loading

juliantaylor commented Oct 24, 2016 •

edited

Loading

martin-frbg commented Oct 24, 2016

juliantaylor commented Oct 24, 2016

martin-frbg commented Oct 24, 2016

juliantaylor commented Oct 24, 2016

martin-frbg commented Oct 24, 2016

juliantaylor commented Oct 24, 2016

brada4 commented Oct 24, 2016

martin-frbg commented Oct 25, 2016

martin-frbg commented Nov 3, 2016

1a1a11a commented Nov 6, 2016

numpy problem #990

numpy problem #990

Comments

1a1a11a commented Oct 20, 2016

martin-frbg commented Oct 20, 2016 • edited Loading

brada4 commented Oct 22, 2016 • edited Loading

1a1a11a commented Oct 23, 2016

brada4 commented Oct 23, 2016 • edited Loading

juliantaylor commented Oct 24, 2016 • edited Loading

juliantaylor commented Oct 24, 2016 • edited Loading

martin-frbg commented Oct 24, 2016

juliantaylor commented Oct 24, 2016

martin-frbg commented Oct 24, 2016

juliantaylor commented Oct 24, 2016

martin-frbg commented Oct 24, 2016

juliantaylor commented Oct 24, 2016

brada4 commented Oct 24, 2016

martin-frbg commented Oct 25, 2016

martin-frbg commented Nov 3, 2016

1a1a11a commented Nov 6, 2016

martin-frbg commented Oct 20, 2016 •

edited

Loading

brada4 commented Oct 22, 2016 •

edited

Loading

brada4 commented Oct 23, 2016 •

edited

Loading

juliantaylor commented Oct 24, 2016 •

edited

Loading

juliantaylor commented Oct 24, 2016 •

edited

Loading