-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catch limited bandwidth issues sooner #999
Comments
afarr forwarded some links: |
2015-10-25 05:29:13: antoine commented
|
I have not found a simple solution to this problem - not one that can be merged this late in the release cycle. Re-scheduling. (hopefully some of the changes can be backported). But I did find a huge bug in the process: r11376. (backported in 11380). |
2016-05-13 00:21:31: afarr commented
|
The low-level network code is a bit messy, in large part because of win32 and the way it (doesn't) handle blocking sockets...
At the moment, we detect the network bottleneck because the network write call takes longer to return, handling Related reading: |
Far too late to make intrusive changes to the network layer. Good read: https://github.com/TigerVNC/tigervnc/wiki/Latency |
The big hurdle for fixing this is that we have a number of queues and threads sitting in between the window damage events and the network sockets. Things to figure out:
Things to do:
|
Testing with r15691 Fedora 26 server and a win7 64-bit client using the default encoding on a 4k screen, connecting over 100Mbps LAN and using glxspheres to generate constant high fps. Here are the settings we change during the tests (showing the default value here):
Example of tc changes we can make:
For collecting statistics:
Initial notes:
|
2017-11-16 14:31:08: antoine uploaded file
|
Simple way to reproduce the problems:
Launch glxgears and then enlarge the window, the sudden increase in bandwidth consumption causes a spike in send latency:
The heuristics quickly adapt to hitting the ceiling in this case: it takes under a second to go back to no delay at all (probably thanks to the batch delay adjusting dynamically with the backlog). In other cases, things take longer to settle:
And in the meantime, packets (and that's any packet, not just picture updates...) can take a second to send.. so the user experience suffers. |
2017-11-17 12:55:02: antoine uploaded file
|
Some preparatory / related work:
The patch above applies to r17449. Still TODO:
|
2017-11-18 18:00:34: antoine uploaded file
|
Much improved patch attached:
Most work items from comment:16 remain, and also some new ones:
|
Available bandwidth detection added in r17452. |
2017-11-19 08:08:15: antoine uploaded file
|
More reliable and accurate bandwidth limit auto-detection in r17455. Main todo items for this release:
|
Updates:
The big remaining problem: a single png frame can kill the bandwidth... |
2017-11-20 16:09:26: antoine uploaded file
|
Updates:
|
2017-11-23 07:58:28: antoine uploaded file
|
More updates:
I can now run glxspheres at 4 to 8fps on a 1Mbps connection, without getting any stuttering! If anything, we use less bandwidth than we should (~250Kbps). Being under the limit is good, as it allows the odd large frame to go through without causing too many problems, but this means that the bandwidth-limit detection code never fires, and maybe we could raise the quality or speed (framerate) a bit. New related ticket for followup: #1700 "faster damage processing" For very limited bandwidth situations, it would be nice to have a more progressive picture refresh, either flif (#992) or just raising the jpeg / webp quality up, only doing true lossless if we have time and bandwidth. |
With all the latest improvements and fixes (r17491, r17493, r17515, r17516, etc) and running glxspheres for testing, the server ends up turning on b-frames (#800), video scaling and YUV420P subsampling. It then reaches 25fps on a 1Mbps connection! I think this is as good as it is going to get for this release. We can improve it some more in #1700. @maxmylyn: bear in mind that a lossless refresh will use up a few seconds' worth of bandwidth, this is unavoidable. We try to delay the refresh as much as possible, especially when we know the bandwidth is limited - this works better if we know in advance that there are limits (#417). We need results from the automated tests to ensure that the performance has not regressed for the non-bandwidth-constrained case. When debugging, we need: "-d refresh,compress,regionrefresh", and the usual "xpra info". And a reproducible test case that does not involve user interaction, as I've launched glxspheres literally thousands of times to get here. Some links for testing bandwidth constraints:
|
To try to mitigate the problems with the auto-refresh killing the bandwidth, r17540 switches to "almost-lossless" (using high quality lossy) when bandwidth is scarce. |
2017-12-11 23:53:24: maxmylyn commented
|
Aggressively lowering the quality is one thing, but this sounds like the auto-refresh is kicking in too slowly.
Well, this ticket has been set to "critical" and was assigned to you for the 2.2 release, that would be a more obvious way.
Any other later revision than r17539? There are ~68 revisions since then.
No.
This would manifest itself as an auto-refresh that is lossy but near lossless, this does not affect how quickly it kicks in. It does nothing to the "quality" setting, and AFAICR - the only changes to the quality setting calculations only apply when there are bandwidth constraints (r17456). You didn't include any debug information (ie: "xpra info" would be a start, also "-d refresh,regionrefresh" since this a refresh issue), so we can't say for sure if this is at play here. The changesets that do change how we handle auto-refresh are: r17491, r17481, r17480, r17458, r17456.
|
@maxmylyn commented:
Okay I had a feeling that I was mistaken but wanted to make sure, thanks for confirming that. So I did what I should have done yesterday and turned on the OpenGL paint boxes to investigate what's actually going on rather than blindly guessing. What I'm seeing is that as of 17607 it's painting an entire Firefox window with h264 (that is still the blue one, right?) under certain situations. In certain situations it makes - like when I switch from tab to tab rather quickly or click through a series of Wikipedia links quickly. But there are other cases where it doesn't make sense - the two most notably:
After some further investigation - it's much easier to repro what I'm seeing by opening an Xterm and running So before I walk through the revisions you listed, try a few other things, and then get some logs / Xpra info, do you want to create a new ticket or is that related enough to this one to continue here? (In case you're still awake, but I think that's highly unlikely) |
2018-04-06 00:25:27: maxmylyn commented
|
2018-04-06 00:26:23: maxmylyn uploaded file
|
2018-04-06 17:28:49: antoine commented
|
2018-04-19 19:11:44: maxmylyn commented
|
2018-04-22 09:16:05: antoine commented
|
2018-05-03 21:57:40: maxmylyn commented
|
2018-05-03 21:58:21: maxmylyn uploaded file
|
2018-05-06 18:31:06: antoine commented
|
2018-05-10 21:31:29: maxmylyn commented
|
Found another wrapper tool which may be useful for simulating various network conditions: comcast |
Got some logs which show:
Which means that it takes around 44ms to compress and send the packet out to the network layer, often less.
Except that in some cases it can take 1391ms!!
There is another one, which isn't quite as bad:
At that point the UI became sluggish, about 0.5s behind the actual actions.
Not entirely sure what we should be doing here: by the time the OS is pushing back to us, it is too late already and things will be slow because there isn't enough bandwidth to service us.
Maybe we can watch the "damage out latency" more carefully and immediately increase the batching delay to prevent further degradation?
The text was updated successfully, but these errors were encountered: