Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in xor #1749

Closed
totaam opened this issue Jan 22, 2018 · 18 comments
Closed

crash in xor #1749

totaam opened this issue Jan 22, 2018 · 18 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Jan 22, 2018

Issue migrated from trac ticket # 1749

component: encodings | priority: major | resolution: fixed | keywords: crash xor 64bit

2018-01-22 07:22:53: mptei created the issue


I have two 64bit Linux installations of Xpra. Both crash in xor after r17513 which changes xoring from 8bit to 64bit.

That means I took 2.2.x (r18057) and it crashes some seconds after attaching a client.

Thread 3 "xpra" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f14cb52a700 (LWP 19210)]
0x00007f14ddecb625 in __pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str () from /usr/lib64/python2.7/site-packages/xpra/codecs/xor/cyxor.so

When I reverse patch r17560,r17518 and r17513 it isn't crashing.

When I change the new xor routine from 64bit to 32bit it's also not crashing. (see 32bit.patch)

Is it possible that not all buffers are aligned on 8Byte boundaries?

By reading the new xor implementation I also find:

for 0 <= i < steps:
    obuf[i] = abuf[i] ^ bbuf[i]

as an error. If the buffer is smaller than 8byte then 'steps' will be 0, but the 64bit xor is accessing 8bytes of the buffer (which has for instance only 2 bytes).

A debug build also isn't crashing!

python -c 'import struct;print( 8 * struct.calcsize("P"))'

answers '64' (=> python is 64 bit).

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:23:39: mptei uploaded file 32bit.patch (1.9 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:29:09: antoine changed owner from antoine to Mike

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:29:09: antoine commented


Is it possible that not all buffers are aligned on 8Byte boundaries?
It is possible, though unlikely. In any case, all modern CPUs can handle unaligned accesses.

If the buffer is smaller than 8byte then 'steps' will be 0
Are you sure that this loops runs at all when steps=0?
I thought that this Cython for loop syntax was the equivalent to:

for i in range(steps):

Which definitely does not run if steps=0.

Does r18096 fix things for you?

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:40:23: mptei commented


No r18096 doesn't change it. I tested that already before opening the ticket.

Only reversing to the old implementation, switching to 32bit xor or using a debug build avoids the crash.

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:47:48: antoine commented


Only reversing to the old implementation, switching to 32bit xor or using a debug build avoids the crash.
What sort of CPU are you using?

cat /proc/cpuinfo

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:50:32: mptei uploaded file machine1 (7.4 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:50:46: mptei uploaded file machine2 (3.6 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:51:50: mptei commented


Added machine1, machine2 with output of

cat /proc/cpuinfo

of both machines.

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 07:54:42: antoine commented


The code definitely looks correct and I don't have time to figure out why this crashes on some CPUs. You may want to run it in gdb to see what the real problem is.
In the meantime, does r18097 fix things?

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 08:02:43: mptei commented


r18097 isn't compiling (misses uint64_t used four out buffer).

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 08:04:20: antoine commented


Oops, sorry. Compile tested this time: r18098.

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 08:08:27: mptei commented


Now it compiles and runs without crash.

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 08:13:54: antoine changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 08:13:54: antoine set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented Jan 22, 2018

2018-01-22 08:13:54: antoine commented


Backported to v2.2.x in 18099.

I'd love to know why those CPUs struggle with a simple 64-bit XOR but I don't have the time to look into it. Feel free to post the GDB backtrace and more details though.

@totaam totaam closed this as completed Jan 22, 2018
@totaam
Copy link
Collaborator Author

totaam commented Jan 23, 2018

2018-01-23 06:28:07: mptei commented


I had a look at assembly level.
The compiler (gcc) generates code which uses the SSE instruction pxor for 128 bit xor.
The SSE instruction needs data aligned on 16 byte boundaries. There is no "unaligned" version of this instruction. Getting the data from mem and storing is done via 16 byte transfers, but the "unaligned" version of the load/store instruction is used (therefore there no crash happens).

When casting to uint64_t* the compiler seems to assume at least an 8 byte alignmend, which would make sense. Therefore in this case there is no pre-check on the alignment and the pxor instruction gives an exception when the data is unaligned.

Now the good news: When casting to uint32_t* the compiler also uses 128 bit xor, but now assumes a 4 byte alignment of the buffers. So it generates code which deals with data not aligned on 16 byte boundaries so that the 128 bit pxor still gets data on a 16 byte boundary.
At the end there will be no difference in performance between the uint32_t* and the uint64_t* cast and the current solution is absolutely ok.

I guess that the buffers allocated by python/cython will still be 4 byte aligned. Bad: this is still a guessing.
Best would be to add assertions to check the 4 byte alignmend.

This is the 128 Bit main xor loop:

   ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
   │0x7f813205f61a <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+362>        movdqu 0x0(%[r13](../commit/cebe4aa94413d8a8ff9ff031edb7eec998103d5b),%rsi,1),%xmm0                                                          │
   │0x7f813205f621 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+369>        add    $0x1,%r10d                                                                      │
  >│0x7f813205f625 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+373>        pxor   (%[r14](../commit/4f52574dc1620c9283b71b1aae937cfb9cddd3aa),%rsi,1),%xmm0                                                             │
   │0x7f813205f62b <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+379>        movups %xmm0,(%rcx,%rsi,1)                                                             │
   │0x7f813205f62f <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+383>        add    $0x10,%rsi                                                                      │
   │0x7f813205f633 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+387>        cmp    %r12d,%r10d                                                                     │
   │0x7f813205f636 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+390>        jb     0x7f813205f61a <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+362>                │
   └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

At the end, not the CPU has problems to perform a simple 64 bit xor, but the cast to uint64_t* is simply invalid.

@totaam
Copy link
Collaborator Author

totaam commented Jan 23, 2018

2018-01-23 07:12:32: antoine commented


Excellent investigation. I'm pretty sure I've seen somewhere that newer CPU models can deal with 64-bit unaligned access, this would explain why I have not seen any problems at all on any of my test systems.

I guess that the buffers allocated by python/cython will still be 4 byte aligned. Bad: this is still a guessing.
The buffers that we allocate explicitly are all memaligned on Linux (to 64 by default), aligned to 16-bit on macos and not aligned at all on win32... (for details see: [/browser/xpra/trunk/src/xpra/buffers/memalign.c])
In any case, the input to the xor function isn't always directly under our control and this is meant as an opaque call anyway. FWIW: the python string buffers always seem to be 4-byte aligned, but we accept any kind of buffer as input, including memoryviews and those can be sliced to any value...

So r18124 adds a slow byte-at-a-time path for unaligned addresses - the improved unit test exercises it.

@mike: I think this covers everything?

@totaam
Copy link
Collaborator Author

totaam commented Jan 23, 2018

2018-01-23 07:32:39: mptei commented


I've tested it and had no crash anymore.

Many thanks for your quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant