Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server hangs in XSync() #3503

Closed
mdavidsaver opened this issue Mar 25, 2022 · 8 comments
Closed

Server hangs in XSync() #3503

mdavidsaver opened this issue Mar 25, 2022 · 8 comments
Labels
bug Something isn't working

Comments

@mdavidsaver
Copy link
Contributor

Describe the bug

I'm seeing instances of a server hanging in XSync(). I don't yet have a useful stack trace with debug symbols. What I have is attached below, and looks somewhat similar to #475.

To Reproduce

tbd. I'm not yet sure how to trigger this issue. I'm going to rebuild with debug information and wait for another occurance. Other suggestions for troubleshooting are very welcomed.

System Information (please complete the following information):

  • Server OS: Debian 11 / amd64
  • Client OS: html5
  • Xpra repo: 6cda08d
  • Xpra html5 repo: 3d65c16439904b9fc6f80068226cb83ceb92b9bc

Additional context
Add any other context about the problem here.
Please see "reporting bugs" in the wiki section.

instance1.txt
instance2.txt

@mdavidsaver mdavidsaver added the bug Something isn't working label Mar 25, 2022
@totaam
Copy link
Collaborator

totaam commented Mar 26, 2022

Please specify more details about your environment, versions, etc. As per:
https://github.com/Xpra-org/xpra/wiki/Reporting-Bugs

I see this in your backtraces which tells me that this is not a standard setup: /opt/xpra/usr/b

It would help to have debug symbols and to know where in the python event loop code it is failing (not the cython .c generated file):

__pyx_f_4xpra_3x11_4gtk3_12gdk_bindings_parse_xevent (__pyx_v_e_gdk=0x7ffe20c75d70) at xpra/x11/gtk3/gdk_bindings.c:17627

I doubt this is the same problem as #475, but you could always try: XPRA_XSHM=0 xpra start ...

@mdavidsaver
Copy link
Contributor Author

mdavidsaver commented Mar 26, 2022

Please specify more details ...

xpra showconfig

/opt/xpra/usr/bin/xpra start --daemon=no \
 --chdir=/home/mdavidsaver --start=/usr/local/bin/perpetual-xterm \
 --terminate-children=yes --mdns=no \
 --bind-tcp=0.0.0.0:14500 --tcp-auth=sys \
 :10

I'm seeing this issue in conjunction with the html5 client. I have a group of ~20 users, and only 4 report hangs. Though each of these has had multiple occurrences. These users run a variety of browsers (Safari, Firefox, Chrome), and I don't yet see any commonality.

The first symptom is that the xpra server process "freezes". eg. I then see that new http connections are not accept()ed. This, and seeing other threads in PyThread_acquire_lock_timed() suggests to me that the call to gdk_flush() is being made while the GIL is locked.

My searches for "xsync hang" and "gdb_flush hang" have not been helpful. Reading the man page for XSync() and the source makes it clear that this function will block without timeout until the X server replies (apparently to a GetInputFocus message). The fact that the thread is making a poll() as opposed to futex() suggests to me that this is not a deadlock in xpra, and that the X server is involved somehow.

I guess I can get stack traces from the Xvfb process next time. Maybe I'll get lucky and it will be obviously stalled.

Could this be triggered by a misbehaving X client application?

I'm working with a java/openjfx application, which I know to be troublesome wrt. gtk usage. I'm using xpra is part because the combination seems to have the fewest glitches.

So I'm not sure if a stack trace would show if eg. some client application has grabbed the server.

I see this in your backtraces which tells me that this is not a standard setup

I'm running a local build of the git revisions mentions above against debian packaged dependencies. The only local change is to xpra/platform/xposix/menu_helper.py. I'm having problems figuring out xdg menu files, so I changed load_xdg_menu_data() to return a static dict. (I still plan to get back to #3471)

It would help to have debug symbols ...

I'm planning to rebuild xpra, passing --with-debug. It looks like Debian 11 no longer packages debug symbols for X related things (cf. dbgsym section and find-dbgsym-packages), or debuginfod (which can be really slow!).

I doubt this is the same problem as #475

I concur. I linked that issue because it is the only other mention of XSync().

@totaam
Copy link
Collaborator

totaam commented Mar 27, 2022

  • try setting XPRA_X_SYNC=1 when starting the server.
    This will enable XSynchronize.
  • xtrace / xtruss are very chatty but may be useful to show the last few exchanges before the hang.
  • long shot: swap Xvfb for Xdummy

I'm working with a java/openjfx application ..

Ah. Those are notoriously flaky.
Sometimes, simply updating the JDK solves the problem!

So I'm not sure if a stack trace would show if eg. some client application has grabbed the server.

It would not - it would look exactly the same as what we have here.
You would need to trace that specific application to see it.

@mdavidsaver
Copy link
Contributor Author

I had one more occurrence, from which I am able to collect a little more information. I am able to leave things running in the hung state for the time being, so I could perform additional postmortem tests if any come to mind.

I was able to capture stack traces of all processes associated (by systemd) with this xpra instance. Unfortunately, while I did install some Debian debug info packages, it looks like I didn't point to a debug build of xpra (oops...).

This may be moot, as the Xvfb process appears to be idling normally. I also don't see anything abnormal in the 4 (of 71) threads in the java/jfx application making glib/gtk calls. (I'll continue looking at the java process as there is a reasonable chance I'm missing something)

I also checked (with netstat and ss) the state of the various socket buffers. The TX/RX queues for all of the unix domain connections are empty, including the X related ones. This is consistent with Xvfb idling normally. (maybe it could be inspected by some X client?)

# ss -xn
Netid State Recv-Q Send-Q                                          Local Address:Port    Peer Address:Port   Process
...
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5265897            * 5265896       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5264770            * 5264769       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5264774            * 5264773       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 4812931            * 4812930       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 2267950            * 2268372       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5264776            * 5264775

The TCP connection queues are not, which is as expected with the GIL being locked for the XSync().

# netstat -tpn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
...
tcp      870      0 10.136.0.22:14500       training:39066          CLOSE_WAIT 
...

Also, it looks like we're running Sun JDK 11.0.2 atm. Which of course has no debug symbols... openjdk 17.0.2 is also install, and I thought this was being used. sigh... maybe next time.

Finally, it is unlikely I'll be able to trigger this hang again in the near term. I haven't been able to do so myself, and the event which provided additional users (a training class) has ended. My suspicion atm. is that the xpra hang is somehow a side effect of misbehavior by OpenJFX. As you say, gtk support in jfx is notoriously buggy. (I've looked at the gtk2/3 binding code for both openjfx and SWT, and both are nightmarish rats nests!) So this ticket could be closed if, as I expect, nothing further can be learned from the information I have provided.

@totaam
Copy link
Collaborator

totaam commented Apr 4, 2022

I was wrong when I said:

It would not - it would look exactly the same as what we have here.
You would need to trace that specific application to see it.
You would not be able to connect to the X11 server until the lock is released.

As per my previous comment: #3503 (comment)
It could be useful to know which line corresponds to xpra/x11/gtk3/gdk_bindings.c:17627

Without that, I can only suggest running with:

XPRA_X11_DEBUG_EVENTS=all xpra start ...

Which is going to generate a huge amount of debug logging but may show us the event that's triggering the bug.
(or it could just turn it into a Heisenbug and make it disappear)

@mdavidsaver
Copy link
Contributor Author

wrt. X server locking. Is there some way I can probe this without restarting the Xvfb process? How complete would this lockout be? eg. could something like xset be expected to succeed?

It could be useful to know which line corresponds to xpra/x11/gtk3/gdk_bindings.c:17627

Sorry, I didn't pick up on this. The full gdk_bindings.c. The first comment above gdk_bindings.c:17627 is:

        /* "xpra/x11/gtk3/gdk_bindings.pyx":1035
 *         elif etype == PropertyNotify:
 *             pyev.window = _gw(d, e.xany.window)
 *             pyev.atom = trap.call_synced(_get_pyatom, d, e.xproperty.atom)             # <<<<<<<<<<<<<<
 *             pyev.time = e.xproperty.time
 *         elif etype == ConfigureNotify:
 */

totaam added a commit that referenced this issue Apr 4, 2022
also make things consistent and always use an X11 trap sync context so that X11 BadAtom errors will be caught here
@totaam
Copy link
Collaborator

totaam commented Apr 4, 2022

could something like xset be expected to succeed?

Yes.

trap.call_synced(_get_pyatom, d, e.xproperty.atom)

Ah, now that is interesting!
IIRC, we did have a problem like this one before with Java applications and atoms that don't exist.
I am hoping that the commit above will fix that. It has been a while since I've had to touch this sensitive X11 / GDK glue, but the commit does look correct.

The PropertyNotify was one of a few places that was already using a trap.call_synced context, but perhaps this was still confusing GTK when the atom doesn't / no longer exists.
If not, --debug x11 may have what we're looking for.

@totaam
Copy link
Collaborator

totaam commented Apr 26, 2022

Feel free to re-open if you can still reproduce with 1e56be6 or later.

@totaam totaam closed this as completed Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants