Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opencl acceleration for csc and/or encoding #422

Closed
totaam opened this issue Aug 26, 2013 · 51 comments
Closed

opencl acceleration for csc and/or encoding #422

totaam opened this issue Aug 26, 2013 · 51 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Aug 26, 2013

Issue migrated from trac ticket # 422

component: core | priority: major | resolution: fixed

2013-08-26 08:45:01: totaam created the issue


References:

@totaam
Copy link
Collaborator Author

totaam commented Aug 26, 2013

2013-08-26 15:43:17: totaam uploaded file add-csc-opencl.patch (13.7 KiB)

stub opencl csc module

@totaam
Copy link
Collaborator Author

totaam commented Aug 27, 2013

2013-08-27 17:26:14: totaam uploaded file add-csc-opencl-v3.patch (19.7 KiB)

minor tweaks

@totaam
Copy link
Collaborator Author

totaam commented Aug 27, 2013

2013-08-27 17:33:36: totaam changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Aug 27, 2013

2013-08-27 17:33:36: totaam changed owner from antoine to totaam

@totaam
Copy link
Collaborator Author

totaam commented Aug 27, 2013

2013-08-27 17:33:36: totaam commented


More kernels we may be able to use:

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 08:05:10: totaam commented


Testing with plain x264 command line (running a couple of times to ensure the values are consistent - they are..):

  • OpenCL enabled:
$ time ./x264 --opencl  -o opencl.x264  video.mp4 
lavf [info]: 720x404p 0:1 @ 24000/1001 fps (vfr)
x264 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT
x264 [info]: OpenCL acceleration enabled with NVIDIA Corporation GeForce GTS 450 
x264 [info]: profile High, level 3.0
x264 [info]: frame I:364   Avg QP:15.09  size: 37254                           
x264 [info]: frame P:10936 Avg QP:20.31  size:  5108
x264 [info]: frame B:19868 Avg QP:23.11  size:   772
x264 [info]: consecutive B-frames: 10.2% 11.5%  8.4% 69.9%
x264 [info]: mb I  I16..4: 29.4% 17.4% 53.2%
x264 [info]: mb P  I16..4:  2.0%  2.6%  3.3%  P16..4: 11.9%  6.5%  4.6%  0.0%  0.0%    skip:69.2%
x264 [info]: mb B  I16..4:  0.1%  0.1%  0.2%  B16..8:  8.5%  2.2%  0.8%  direct: 0.7%  skip:87.4%  L0:48.4% L1:45.2% BI: 6.5%
x264 [info]: 8x8 transform intra:28.0% inter:27.9%
x264 [info]: coded y,uvDC,uvAC intra: 37.6% 57.8% 45.3% inter: 3.7% 4.7% 2.0%
x264 [info]: i16 v,h,dc,p: 64% 27%  8%  2%
x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 19% 15% 58%  1%  1%  1%  1%  1%  2%
x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 22% 23%  4%  4%  4%  5%  4%  4%
x264 [info]: i8c dc,h,v,p: 51% 24% 22%  4%
x264 [info]: Weighted P-Frames: Y:1.1% UV:1.0%
x264 [info]: ref P L0: 64.6%  7.0% 17.6% 10.7%  0.1%
x264 [info]: ref B L0: 79.6% 17.1%  3.3%
x264 [info]: ref B L1: 95.0%  5.0%
x264 [info]: kb/s:521.59

encoded 31168 frames, 175.77 fps, 521.59 kb/s

real	2m57.650s
user	10m12.278s
sys	0m36.051s
  • without OpenCL:
$ time ./x264  -o no-opencl.x264  video.mp4 
lavf [info]: 720x404p 0:1 @ 24000/1001 fps (vfr)
x264 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT
x264 [info]: profile High, level 3.0
x264 [info]: frame I:373   Avg QP:16.18  size: 36484                           
x264 [info]: frame P:12582 Avg QP:20.97  size:  4720
x264 [info]: frame B:18213 Avg QP:23.12  size:   681
x264 [info]: consecutive B-frames: 17.9% 10.8%  5.7% 65.7%
x264 [info]: mb I  I16..4: 23.1% 24.5% 52.5%
x264 [info]: mb P  I16..4:  1.6%  2.4%  2.8%  P16..4: 11.8%  6.5%  4.5%  0.0%  0.0%    skip:70.5%
x264 [info]: mb B  I16..4:  0.1%  0.1%  0.2%  B16..8:  7.6%  1.9%  0.7%  direct: 0.6%  skip:88.8%  L0:47.1% L1:46.3% BI: 6.6%
x264 [info]: 8x8 transform intra:31.5% inter:27.4%
x264 [info]: coded y,uvDC,uvAC intra: 36.9% 56.1% 43.1% inter: 3.9% 4.9% 2.1%
x264 [info]: i16 v,h,dc,p: 61% 29%  8%  2%
x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 20% 15% 58%  1%  1%  1%  1%  1%  2%
x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 22% 22%  4%  4%  4%  5%  4%  4%
x264 [info]: i8c dc,h,v,p: 51% 23% 22%  4%
x264 [info]: Weighted P-Frames: Y:0.8% UV:0.7%
x264 [info]: ref P L0: 64.9%  6.8% 17.7% 10.5%  0.0%
x264 [info]: ref B L0: 78.6% 18.1%  3.3%
x264 [info]: ref B L1: 95.4%  4.6%
x264 [info]: kb/s:525.55

encoded 31168 frames, 186.50 fps, 525.55 kb/s

real	2m47.235s
user	10m10.138s
sys	0m6.067s

Resulting files:

$ du -sk *opencl.x264
83404	no-opencl.x264
82776	opencl.x264

So this doesn't look like it makes much of a difference unfortunately (at least on my GTS 450), if anything it is a tad slower.

The one thing where this may still be useful is for motion detection, where we could increase the search diameter without incurring too much more CPU usage.

Enabling it looks simple enough, in x264.h:

int b_opencl;            /* use OpenCL when available */

(assuming that x264 is built with opencl support)

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 09:16:23: totaam edited the issue description

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 09:16:23: totaam commented


For the record, this is what I had to do to get pyopencl to build on Fedora 19 with the nvidia SDK to avoid this error at import time:

ImportError: /usr/lib/python2.7/dist-packages/pyopencl/_cl.so: \
    symbol clRetainDevice, version OPENCL_1.2 not defined in file libOpenCL.so.1 with link time reference

The existing headers look like this:

$ ls -la /usr/include/CL
lrwxrwxrwx. 1 root root 32 Aug 28 12:39 /usr/include/CL -> /etc/alternatives/opencl-headers

Edit: Just downgrading the version of opencl-headers to 1.1 is enough.


Alternatively, we can move the headers to a version specific directory and add the OpenCL 1.1 headers:

cd /etc/alternatives/
mv opencl-headers opencl-headers-1.2
mkdir opencl-headers-1.1
ln -sf opencl-headers-1.1 opencl-headers
cd opencl-headers-1.1

wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
wget http://www.khronos.org/registry/cl/api/1.1/opencl.h

Then we need to ensure pyopengl will be built against 1.1, so siteconf.py contains:

CL_PRETEND_VERSION = '1.1'

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 09:19:16: totaam commented


Having installed freeocl, I now have 3 providers available:

$ LD_LIBRARY_PATH=/opt/cuda/lib64/ XPRA_SWSCALE_DEBUG=0 PYTHONPATH=. python ./tests/xpra/codecs/test_csc_opencl.py 
PyOpenCL OpenGL support: True
found 3 OpenCL platforms:
* FreeOCL (FreeOCL developers) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 FreeOCL-0.3.6 / OpenCL C 1.2)
* NVIDIA CUDA (NVIDIA Corporation) - 1 devices:
 + GPU: GeForce GTS 450 (OpenCL 1.1 CUDA / OpenCL C 1.1 )
* Intel(R) OpenCL (Intel(R) Corporation) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 (Build 67279) / OpenCL C 1.2 )

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 16:55:46: totaam uploaded file add-csc-opencl-v6.patch (22.7 KiB)

works ok but only one format so far: YUV420P to RGB

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 17:26:18: totaam commented


Please try the patch above and report on performance.
You may need to adjust some env vars for finding the libraries in the cuda paths and for selecting the opencl platform/device:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/cuda/lib64/
export PYTHONPATH=.
XPRA_OPENCL_DEVICE_TYPE=GPU python ./tests/xpra/codecs/test_csc_opencl.py
XPRA_OPENCL_DEVICE_TYPE=CPU python ./tests/xpra/codecs/test_csc_opencl.py 

Note: careful with LD_LIBRARY_PATH, putting cuda ahead of regular libraries can cause some serious problems (conflicts with libopencl versions for example).

[[BR]]

Results deleted (those figures were wrong because of a bug)

The results aren't as bad as they look for nvidia:

  • cpu csc is already very fast since it is such as simple operation
  • hopefully the difference will be more noticeable when we add scaling
  • the gfx card is quite slow by modern standards (we'll see if faster ones help - not guaranteed it will make a huge difference here since the cost is mostly memory bandwidth)
  • most of the cpu time is spent copying buffers to and from the gfx card and on modern cpus that is slightly better than doing fpu or more general instruction decoding

Even then, I think there is room for improvement since we copy the pixels in and out and we may not need to (we just need a buffer interface).

Interestingly, the performance varies widely depending on the picture size.. will need to look into the worksize/localsize settings.

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 17:27:30: totaam uploaded file add-csc-opencl-v7.patch (23.0 KiB)

updated patch - fix crash with swscale

@totaam
Copy link
Collaborator Author

totaam commented Aug 28, 2013

2013-08-28 17:45:37: smo commented


Here are the results on Nvidia K1 (Nvidia) OpenCL

At 1920x1080
191 MPixels/s
223 MPixels/s
161 MPixels/s
184 MPixels/s
172 MPixels/s

@totaam
Copy link
Collaborator Author

totaam commented Aug 29, 2013

2013-08-29 17:18:43: totaam uploaded file add-csc-opencl-v10.patch (17.3 KiB)

working version with all yuv formats as input and both BGRX and RGBX as output

@totaam
Copy link
Collaborator Author

totaam commented Aug 29, 2013

2013-08-29 17:22:05: totaam changed status from assigned to new

@totaam
Copy link
Collaborator Author

totaam commented Aug 29, 2013

2013-08-29 17:22:05: totaam changed owner from totaam to smo

@totaam
Copy link
Collaborator Author

totaam commented Aug 29, 2013

2013-08-29 17:22:05: totaam commented


Please re-run with patch v10 which fixes some important bugs.

I am afraid that I cannot commit it as-is because the OpenCL shared libraries we end up loading cause some serious problems:

Traceback (most recent call last):
  File "/usr/bin/xpra", line 6, in <module>
    sys.exit(xpra.scripts.main.main(__file__, sys.argv))
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/main.py", line 432, in main
    return run_server(parser, options, mode, script_file, args)
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/server.py", line 454, in run_server
    import gtk.gdk          #@Reimport
  File "/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py", line 40, in <module>
    from gtk import _gtk
ImportError: dlopen: cannot load any more object with static TLS

@totaam
Copy link
Collaborator Author

totaam commented Aug 30, 2013

2013-08-30 15:04:00: antoine uploaded file add-csc-opencl-v13.patch (35.6 KiB)

updated patch with support for RGB to YUV444P (and more to come)

@totaam
Copy link
Collaborator Author

totaam commented Aug 31, 2013

2013-08-31 06:17:48: antoine changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Aug 31, 2013

2013-08-31 06:17:48: antoine changed owner from smo to antoine

@totaam
Copy link
Collaborator Author

totaam commented Aug 31, 2013

2013-08-31 06:17:48: antoine commented


Added support in r4247

According to Recommended 8-Bit YUV Formats for Video Rendering (section on "YUV Sampling"), MPEG2's subsampling code (BT.601) is more lazy than MPEG1's - but since OpenCL is so cheap to run (it is the memory transfers that cost us), I went for the MPEG1-like more exhaustive calculations instead (using an average of all source pixel values).

Still have to figure out the TLS issue before this can be of any use..

@totaam
Copy link
Collaborator Author

totaam commented Sep 4, 2013

2013-09-04 12:51:04: antoine commented


Testing on a dual Xeon E5-2670 with dual NVidia K1s (more results [/wiki/CSC here]), I found that the individual K1 GPU cores are actually slower than my GTS 450 and so using OpenCL with x264 actually makes it run slower (and I believe the CPU savings are not worth much either):

  • without OpenCL:
encoded 3347 frames, 148.74 fps, 1853.13 kb/s

real	0m22.759s
user	6m40.754s
sys	0m7.133s
  • with OpenCL:
encoded 3347 frames, 89.80 fps, 1866.38 kb/s

real	0m46.335s
user	4m42.685s
sys	0m26.054s

@totaam
Copy link
Collaborator Author

totaam commented Sep 6, 2013

2013-09-06 13:53:09: antoine changed status from assigned to closed

@totaam
Copy link
Collaborator Author

totaam commented Sep 6, 2013

2013-09-06 13:53:09: antoine changed resolution from ** to fixed

@totaam
Copy link
Collaborator Author

totaam commented Sep 6, 2013

2013-09-06 13:53:09: antoine commented


The TLS issue has been solved in r4282 by only properly initializing csc_opencl (getting a context) after we have loaded GTK... which works around the problem rather than solving it properly.

OpenCL is now enabled (r4298) and working well so closing this ticket.

Note: we may still want some enhancements:

  • handle more modes with generated kernel byteswapping for channel modes not handled by the runtime library (easy)
  • handle scaling (big!)
  • debug kernel build errors with FreeOCL and pocl

@totaam
Copy link
Collaborator Author

totaam commented Oct 7, 2013

2013-10-07 09:45:59: totaam commented


  • scaling was added in r4310
  • generating missing rgb modes was added in r4303

See also #437

@totaam
Copy link
Collaborator Author

totaam commented Oct 15, 2013

2013-10-15 13:19:02: totaam commented


There were many more changes and tweaks (too many to list).


Note: the TLS issue is discussed here on the PyOpenCL mailing list.
Looks like a PyOpenCL build issue - may need to revisit when testing with the Nvidia SDK which only supports OpenCL 1.1 ...

@totaam
Copy link
Collaborator Author

totaam commented Oct 18, 2013

2013-10-18 04:45:45: totaam changed status from closed to reopened

@totaam
Copy link
Collaborator Author

totaam commented Oct 18, 2013

2013-10-18 04:45:45: totaam changed resolution from fixed to **

@totaam
Copy link
Collaborator Author

totaam commented Oct 18, 2013

2013-10-18 04:45:45: totaam commented


Just found that the the AMD icd causes the client to get into a spin and waste CPU on a spinlock.
Simply having the AMD icd in /etc/OpenCL/vendors is enough to trigger the problem, so OpenCL should probably be disabled by default to prevent this. What is really odd is that this only affects the client, the server will happily run with the AMD icd (you can force it to be used with: XPRA_FORCE_CSC_MODE=YUV420P XPRA_CSC_TYPE=opencl xpra start ...)
We cannot do a runtime check as calling any OpenCL API will cause the loader to dlopen the problematic library.. and we're toast.

Beware: one cannot strace the xpra client (the machine locks up - need ssh to come and kill the strace process)

Here's what strace has to say:

open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
read(10, "0-7\n", 8192)                 = 4
close(10)                               = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f78e9007000
mprotect(0x7f78e9007000, 4096, PROT_NONE) = 0
clone(Process 2797 attached
 <unfinished ...>
[pid  2797] set_robust_list(0x7f78e98079e0, 24 <unfinished ...>
[pid  2655] <... clone resumed> child_stack=0x7f78e9806fb0, \
    flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, \
    parent_tidptr=0x7f78e98079d0, tls=0x7f78e9807700, child_tidptr=0x7f78e98079d0) = 2797
[pid  2797] <... set_robust_list resumed> ) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0x4008642a <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2655] <... ioctl resumed> , 0x7fff7aabbb08) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0xc03064a6 <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

The futex call repeats forever and the xpra client process consumes >70% CPU doing absolutely nothing.

@totaam
Copy link
Collaborator Author

totaam commented Nov 11, 2013

2013-11-11 09:59:47: totaam edited the issue description

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2013

2013-12-05 16:15:41: totaam commented


And another one for good measure, Intel this time, is doing an illegal memory access, caught with valgrind:

==27195## Invalid read of size 827195##    at 0x118DDA1C: __intel_sse2_strrchr (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)27195##    by 0x118C8531: tbb::internal::init_dl_data() (dynamic_link.cpp:290)27195##    by 0x118C8466: __sti__$E (dynamic_link.cpp:449)27195##    by 0x118E8001: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)27195##    by 0x118C367A: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)27195##    by 0x7FF000276: ???27195##    by 0x6E6F687479702E: ???27195##    by 0x6E69622F7273752E: ???27195##    by 0x746100617270782E: ???27195##    by 0x652D2D0068636173: ???27195##    by 0x3D676E69646F636D: ???27195##    by 0x6E2D2D0034363267: ???27195##  Address 0xec4c5d8 is 56 bytes inside a block of size 58 alloc'd27195##    at 0x4A06409: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)27195##    by 0x3452405C95: open_path (dl-load.c:2036)27195##    by 0x34524086DC: _dl_map_object (dl-load.c:2223)27195##    by 0x345240CAD1: openaux (dl-deps.c:63)27195##    by 0x345240F303: _dl_catch_error (dl-error.c:177)27195##    by 0x345240D1D1: _dl_map_object_deps (dl-deps.c:256)27195##    by 0x34524138BB: dl_open_worker (dl-open.c:265)27195##    by 0x345240F303: _dl_catch_error (dl-error.c:177)27195##    by 0x34524131EA: _dl_open (dl-open.c:656)27195##    by 0x3452C0102A: dlopen_doit (dlopen.c:66)27195##    by 0x345240F303: _dl_catch_error (dl-error.c:177)27195==    by 0x3452C0162C: _dlerror_run (dlerror.c:163)

@totaam
Copy link
Collaborator Author

totaam commented Dec 10, 2013

2013-12-10 09:01:27: totaam changed status from reopened to new

@totaam
Copy link
Collaborator Author

totaam commented Dec 10, 2013

2013-12-10 09:01:27: totaam changed owner from antoine to SmO

@totaam
Copy link
Collaborator Author

totaam commented Dec 10, 2013

2013-12-10 09:01:27: totaam commented


I have added the most important setup and configuration information here: CSC and the performance data now lives here: CSC


There are new SDKs available:

  • Intel SDK XE 2013 R2 - which I am unable to test on my AMD CPU, can you please check that it still runs OK and maybe add or update the [/wiki/CSC/Performance performance data] (hopefully they will have fixed the invalid 64-bit memory access from comment:15 - if you have time, run the minimal opencl tests under valgrind)
  • AMD APP SDK v2.9 - and I can no longer reproduce the client problems.

[[BR]]

Maybe this can be enabled by default server side?

I don't think we will ever bother using OpenCL or nvcuda (#384) for CSC on the client side, since we're better off using OpenGL for CSC, scaling and rendering (it is now stable enough to use).

@totaam
Copy link
Collaborator Author

totaam commented Dec 20, 2013

2013-12-20 00:46:54: smo commented


I've tested the Intel, AMD and Nvidia OpenCL ICD's and tested with no problem however there is an issue with the AMD ICD which prevents Xorg from receiving a kill signal. Even just having this ICD available seems to be enough to trigger it.

I'm going to work from a clean install and try to find a set of instructions that includes all the above info to install the Intel + Nvidia ICD's on Fedora 20 to work with xpra.

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2014

2014-01-04 05:35:17: totaam commented


I've just hit this error:

clFinish failed: invalid command queue

After a computer suspend-resume, it seems that the context becomes invalid (must have been cleared from the GPU during suspend). r5110 fixes that.

[[BR]]

Quite likely to affect nvenc (added to #466) and csc_nvcuda (added to #384)

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 00:29:31: smo commented


Trying to test with AMD OpenCL using HD 6870 GPU

Getting some strange output is this normal?

using new OpenCL context
YUV420P to BGRX    at  1920x1080        : 90 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV420P to RGBX    at  1920x1080        : 128 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to BGRX    at  1920x1080        : 113 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to RGBX    at  1920x1080        : 131 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to BGRX    at  1920x1080        : 141 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to RGBX    at  1920x1080        : 112 MPixels/s

Seems to be starting many new contexts.

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 00:59:47: smo commented


Tested a few suspend/resume with r5153 with an ATI HD6870 and no issue.

2014-01-08 17:55:44,912 PyOpenCL loaded, header version: 1.2, GL support: False
2014-01-08 17:55:44,913  using platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
2014-01-08 17:55:44,913  using device: GPU: Barts (OpenCL 1.2 AMD-APP (1348.4) / OpenCL C 1.2 )

Fore more info

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 02:08:07: totaam commented


From comment:20: that's odd, are you not seeing any using new OpenCL context after suspend/resume as I was? (I will try an intel chipset too)
The patch [/attachment/ticket/422/opencl-forcewait.patch] makes it easier to hit the context problems: adding a 10 second delay in the encoding so that we can more easily suspend a PC whilst the GPU context is active.

[[BR]]

Also, the log from comment:19 is worrying: the context should not have changed during the same run and I don't see how it could..
r5154 will tell us what has changed (the context or "program"), if you still get multiple occurrences of using new OpenCL context during the test run, please run the test with XPRA_OPENGL_DEBUG=1 and post the lines preceding these ones, they should read something like: old program=(..), new program=(..) or old context=(..), new context=(..).

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 02:21:04: totaam uploaded file opencl-forcewait.patch (0.5 KiB)

introduces a 10 second delay in the encoding to make it easier to suspend with a live context

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 05:08:11: smo commented


For comment:20

init_context(..) channel order=RGBA, filter mode=NEAREST
init_context(..) kernel_function RGB_to_YUV422P: <pyopencl._cl.Kernel object at 0x3300628>
old program=<pyopencl.Program object at 0x2e21510>, new program=<pyopencl.Program object at 0x2e21510>
using new OpenCL context (program changed)
init_context(..) kernel source=

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 06:12:38: totaam uploaded file opencl-programcompare.patch (0.9 KiB)

try to use the underlying int_ptr to compare opencl program instances

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 06:16:46: totaam commented


What the? the programs are clearly the same... yet fail the comparison test.

Looks like the docs are wrong: pyopencl.Program: Instances of this class are hashable, and two instances of this class may be compared using “==” and ”!=”. (Hashability was added in version 2011.2.) (unless you are using an outdated version of PyOpenCL?)

Can you please try once more with [/attachment/ticket/422/opencl-programcompare.patch] to see if the spurious using new OpenCL context still occur? (and post your version of the PyOpenCL package)
The easy alternative, would be to remove the program test altogether, I have manually verified that we always re-initialize the programs when we re-initialize the device so this would be safe, for now. But this would make the code much more brittle.

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 15:34:35: smo commented


Odd pyopencl seems to be installed 32 bit??

Using /usr/lib/python2.7/site-packages/pyopencl-2013.2-py2.7-linux-x86_64.egg
I installed this with easy_install -Z pyopencl I may have to do it by hand we'll see.

I applied your patch and they seem to be all gone now.

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 15:42:46: totaam commented


OK, I'll try to produce a test case to report the bug to PyOpenCL, which I will have to ask you to test for me since I can't reproduce this weirdness.
In the meantine, r5157 merges the workaround with a long comment explaining its purpose.

FYI: /usr/lib/python2.7/site-packages/ can contain both 32-bit and 64-bit extensions..

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 15:46:16: smo commented


Thanks for the clarification. I'll update the performance chart with my numbers from this machine and a quick instruction set for being able to run it.

AMD drivers require some extra stuff like exporting COMPUTE=:0 so I assume you actually have to have an X server running?

That said I think we've tried out opencl_csc on several platforms now and several opencl ICD's

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 23:42:57: smo commented


Install AMD OpenCL on Fedora 20

I did this from a fresh install with LXDE

From a root terminal

yum group install "Development Tools"; yum install kernel-devel opencl-headers gcc-c++

cd /tmp
wget http://www2.ati.com/drivers/beta/amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip

unzip amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip
chmod +x Install-AMD-APP.sh; ./Install-AMD-APP.sh

I chose to do an express install. It may ask you to reboot I chose to do this after I installed the AMD App SDK.

Download AMD-APP-SDK-v2.9-lnx64.tgz from http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/

tar xfvz ../AMD-APP-SDK-v2.9-lnx64.tgz
./Install-AMD-App.sh

I rebooted after this install and proceed to install pyopencl with easyinstall

easy_install -Z pyopencl

Started and tested xpra with this command line

COMPUTE=:0 XPRA_OPENCL_DEVICE_TYPE=GPU xpra --no-daemon --bind-tcp=0.0.0.0:1300 --start-child="xterm -fg white -bg black" start :13

@totaam
Copy link
Collaborator Author

totaam commented Feb 12, 2014

2014-02-12 19:18:47: smo changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Feb 12, 2014

2014-02-12 19:18:47: smo changed resolution from ** to fixed

@totaam
Copy link
Collaborator Author

totaam commented Feb 12, 2014

2014-02-12 19:18:47: smo commented


Works well with both AMD and Nvidia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant