-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance - Data batching #3336
Conversation
Additionally I'm removing a call to disable pinch-zoom, it's disable by default since Electron 2 (https://electronjs.org/releases#2.0.0)
|
||
// Max size of a session data batch. Note that this value can be exceeded by ~4k | ||
// (chunk sizes seem to be 4k at the most) | ||
const BATCH_MAX_SIZE = 200 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried different values but this one (200k) had the best throughput. Using, e.g. 100k, can lower the throughput by ~25% 🔥, but increasing it further (say 300k) lowers the framerate to <30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A future improvement could determine this value dynamically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juancampa what strategies would you use for dynamic computation of this value? Maybe we can create an RFC issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few notes from my side:
Keep in mind that the 16ms stack on top of the input roundtrip latency and ppl might start complaining about lousy terminal response while typing. In the xterm.js demo app I found 5 ms helping alot while not yet being noticeable, 15 ms already introduced a tiny lag when typing really fast.
4k is the default pty buffer size on many linux distros, but you cannot rely on that (can be changed for every pty and system wide). BSDs (and macOS) use an adaptive buffer size typically ranging from 16k to 64k depending on the current data pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw 200k seems quite big to me, you might want to test this with yes
, it triggers a very exp. 'lineFeed' event in the terminal for every second byte. Additionally the event can be hooked by external handler, that might introduce even more workload (not sure if you allow plugins to consume it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that the 16ms stack on top of the input roundtrip latency and ppl might start complaining about lousy terminal response while typing. In the xterm.js demo app I found 5 ms helping alot while not yet being noticeable, 15 ms already introduced a tiny lag when typing really fast.
I had the same concern (main reason I use kitty) but I haven't felt any lag so far. We'll have to keep an eye on feedback.
4k is the default pty buffer size on many linux distros, but you cannot rely on that (can be changed for every pty and system wide). BSDs (and macOS) use an adaptive buffer size typically ranging from 16k to 64k depending on the current data pressure.
Good point, I actually have to remove the "Note..." part, It was happening in my origin implementation, but it didn't feel right to go over a "max", so I fixed it to never exceed BATCH_MAX_SIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what strategies would you use for dynamic computation of this value? Maybe we can create an RFC issue?
I was thinking the parser could generate "back pressure". Basically "Hey, the chunks you are sending are too big for me to parse/render in one frame, please send smaller ones". This could be less important if we run the parser on a WebWorker which I think should be the next step (limiting factor on the renderer process).
UPDATE:
Might be better to implement this in xterm.js since it overlaps a lot with xtermjs/xterm.js#1818 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw 200k seems quite big to me, you might want to test this with yes, it triggers a very exp. 'lineFeed' event in the terminal for every second byte
I just tested yes
and the buffering definitely doesn't help (drops FPS from ~50 to ~30). Basically 200k is a good value for find ~
which is arguably a more common output than yes
. This is all more evidence for implementing @jerch's proposal here: xtermjs/xterm.js#1818 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think its better suited in Terminal.write
. We already do xon/xoff flow control there, imho its better to keep it in one place. Otherwise every buffering step involved would have to implement flow control on its own to get a working back pressure chain, which gets quite cumbersome with node's async nature.
Be careful, this PR contains #3329 commits (that should be merged before) |
Yeah, I didn't want to test this on canary since it might change the benchmark results. I just made the warning more prominent :) |
constructor(uid) { | ||
super(); | ||
this.uid = uid; | ||
this.decoder = new StringDecoder('utf8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF8? hyper is already using binary transport for the pty? We might add this eventually to xterm.js in the future, had some good results with it lifting some burden from the input parsing (currently there are some nonsense forth and back conversions going on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow this one. IIUC electron can't send Buffer
objects to the renderer process so we turn the binary input into a String. If we move parsing to a worker, we might be able to use SharedArrayBuffer
to minimize conversions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then why are you dealing with UTF8 at all? node-pty is supposed to return JS strings already, if encoding is omitted (defaults to UTF8 being decoded by the underlying Socket
).
So how do you send data over to the window process? This is not able to handle binary data? Not a websocket? (Im not used to electron internals.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bummer - electron's IPC mechanism supports ArrayBuffers since v3, but gets BASE64 encoded and a JSON envelope (electron/electron#13055).
From my own tests electron performs worse than our demo app based on websockets (about half as fast), maybe the ipc mechanism is the culprit? Even strings get the JSON envelope, thats not any good from the perf perspective. Maybe you want try out our demo app and compare yourself, not sure if directly invoking node-pty in the renderer or falling back to websockets would solve this.
Ok enough thread hijacking 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but gets BASE64 encoded and a JSON envelope
Exactly :(
falling back to websockets
IPC doesn't seem to cost that much though, V8 strings are pretty fast. In this profile of one frame, IPC only takes a tiny sliver (<0.5ms), not a lot of gains there. It might become more prominent when we do WebWorkers though, with an extra hop, but then again, we'll probably use SharedArrayBuffer
for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that real gains are hidden in the parsing of strings part (i.e. your fix xtermjs/xterm.js#1796), so ignore that last part 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thats the basic idea behind my optimization plan (xtermjs/xterm.js#1811) - by getting rid of any string handling in the input chain and working directly on arraybuffer we can lower the runtime needed for input handling by >50%.
This gets even more if we change to direct utf8 binary input as it would save the nonsense conversions. If you are interested I have a heavily optimized playground branch, where you can test it in the demo app (https://github.com/jerch/xterm.js/tree/tttt, also needs https://github.com/jerch/node-textdecode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cannot get it run, here are my results: xtermjs/xterm.js#791 (comment) 😄
Problem
When the pty produces a lot of output in a very short period of time (e.g.
cat very-long-file
), many IPCsession data
messages are passed to the renderer. This causes a lot of thrashing (see profiler screenshot) due toipcListener
emitting a ton of chunks.Each chunk is typically 4k but the vt can parse/render a lot more per frame (especially with the coming WebGL renderer). Each message is prefixed with the window's
uid
, which is then removed on the renderer process before parsing, adding GC pressure (probably) and CPU cost on both processes (we're CPU-bound on the renderer process due to VT parsing being very costly)Solution
Batch data in a buffer before sending it to the renderer.
Benchmarks
(left: no batching, i.e. canary, right: batching, i.e. this branch)
find ~
This change makes a huge difference with the WebGL renderer enabled (compare 1️⃣ vs. 2️⃣) because the renderer is focused on processing one big chunk instead of a ton of tiny ones + canvas rendering. Without WebGL, it makes the framerate less stuttery (compare 3️⃣ vs 4️⃣)
[*]: while the benchmark process is still generating output
Process: Throughput is measured as the time between running the command until the next prompt is rendered. FPS are taken from DevTools renderer FPS widget.