Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance - Data batching #3336

Merged
merged 19 commits into from
Dec 28, 2018
Merged

Conversation

juancampa
Copy link
Contributor

@juancampa juancampa commented Dec 16, 2018

Problem

When the pty produces a lot of output in a very short period of time (e.g. cat very-long-file), many IPC session data messages are passed to the renderer. This causes a lot of thrashing (see profiler screenshot) due to ipcListener emitting a ton of chunks.

Each chunk is typically 4k but the vt can parse/render a lot more per frame (especially with the coming WebGL renderer). Each message is prefixed with the window's uid, which is then removed on the renderer process before parsing, adding GC pressure (probably) and CPU cost on both processes (we're CPU-bound on the renderer process due to VT parsing being very costly)

Solution

Batch data in a buffer before sending it to the renderer.

Benchmarks

(left: no batching, i.e. canary, right: batching, i.e. this branch)
2018-12-16_15-25

find ~

canary canary + webgl this branch this branch + webgl ⭐
Throughput ~35s ~36s 1️⃣ ~35s ~15s 2️⃣
FPS 3* then 17 3️⃣ 40+ 12-16 4️⃣ 40+

This change makes a huge difference with the WebGL renderer enabled (compare 1️⃣ vs. 2️⃣) because the renderer is focused on processing one big chunk instead of a ton of tiny ones + canvas rendering. Without WebGL, it makes the framerate less stuttery (compare 3️⃣ vs 4️⃣)

[*]: while the benchmark process is still generating output

Process: Throughput is measured as the time between running the command until the next prompt is rendered. FPS are taken from DevTools renderer FPS widget.


// Max size of a session data batch. Note that this value can be exceeded by ~4k
// (chunk sizes seem to be 4k at the most)
const BATCH_MAX_SIZE = 200 * 1024;
Copy link
Contributor Author

@juancampa juancampa Dec 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried different values but this one (200k) had the best throughput. Using, e.g. 100k, can lower the throughput by ~25% 🔥, but increasing it further (say 300k) lowers the framerate to <30

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future improvement could determine this value dynamically

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juancampa what strategies would you use for dynamic computation of this value? Maybe we can create an RFC issue?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few notes from my side:
Keep in mind that the 16ms stack on top of the input roundtrip latency and ppl might start complaining about lousy terminal response while typing. In the xterm.js demo app I found 5 ms helping alot while not yet being noticeable, 15 ms already introduced a tiny lag when typing really fast.

4k is the default pty buffer size on many linux distros, but you cannot rely on that (can be changed for every pty and system wide). BSDs (and macOS) use an adaptive buffer size typically ranging from 16k to 64k depending on the current data pressure.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw 200k seems quite big to me, you might want to test this with yes, it triggers a very exp. 'lineFeed' event in the terminal for every second byte. Additionally the event can be hooked by external handler, that might introduce even more workload (not sure if you allow plugins to consume it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that the 16ms stack on top of the input roundtrip latency and ppl might start complaining about lousy terminal response while typing. In the xterm.js demo app I found 5 ms helping alot while not yet being noticeable, 15 ms already introduced a tiny lag when typing really fast.

I had the same concern (main reason I use kitty) but I haven't felt any lag so far. We'll have to keep an eye on feedback.

4k is the default pty buffer size on many linux distros, but you cannot rely on that (can be changed for every pty and system wide). BSDs (and macOS) use an adaptive buffer size typically ranging from 16k to 64k depending on the current data pressure.

Good point, I actually have to remove the "Note..." part, It was happening in my origin implementation, but it didn't feel right to go over a "max", so I fixed it to never exceed BATCH_MAX_SIZE

Copy link
Contributor Author

@juancampa juancampa Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what strategies would you use for dynamic computation of this value? Maybe we can create an RFC issue?

I was thinking the parser could generate "back pressure". Basically "Hey, the chunks you are sending are too big for me to parse/render in one frame, please send smaller ones". This could be less important if we run the parser on a WebWorker which I think should be the next step (limiting factor on the renderer process).

UPDATE:

Might be better to implement this in xterm.js since it overlaps a lot with xtermjs/xterm.js#1818 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw 200k seems quite big to me, you might want to test this with yes, it triggers a very exp. 'lineFeed' event in the terminal for every second byte

I just tested yes and the buffering definitely doesn't help (drops FPS from ~50 to ~30). Basically 200k is a good value for find ~ which is arguably a more common output than yes. This is all more evidence for implementing @jerch's proposal here: xtermjs/xterm.js#1818 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think its better suited in Terminal.write. We already do xon/xoff flow control there, imho its better to keep it in one place. Otherwise every buffering step involved would have to implement flow control on its own to get a working back pressure chain, which gets quite cumbersome with node's async nature.

@juancampa juancampa changed the title Perf data batching Performance - Data batching Dec 16, 2018
@juancampa juancampa requested review from chabou and rauchg December 16, 2018 22:10
@chabou
Copy link
Contributor

chabou commented Dec 16, 2018

Be careful, this PR contains #3329 commits (that should be merged before)

@chabou chabou added the 📊 Type: Performance Issue contains information about a performance issue in Hyper label Dec 16, 2018
@juancampa
Copy link
Contributor Author

juancampa commented Dec 16, 2018

this PR contains #3329 commits (that should be merged before)

Yeah, I didn't want to test this on canary since it might change the benchmark results. I just made the warning more prominent :)

constructor(uid) {
super();
this.uid = uid;
this.decoder = new StringDecoder('utf8');
Copy link

@jerch jerch Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF8? hyper is already using binary transport for the pty? We might add this eventually to xterm.js in the future, had some good results with it lifting some burden from the input parsing (currently there are some nonsense forth and back conversions going on).

Copy link
Contributor Author

@juancampa juancampa Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this one. IIUC electron can't send Buffer objects to the renderer process so we turn the binary input into a String. If we move parsing to a worker, we might be able to use SharedArrayBuffer to minimize conversions

Copy link

@jerch jerch Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then why are you dealing with UTF8 at all? node-pty is supposed to return JS strings already, if encoding is omitted (defaults to UTF8 being decoded by the underlying Socket).

So how do you send data over to the window process? This is not able to handle binary data? Not a websocket? (Im not used to electron internals.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer - electron's IPC mechanism supports ArrayBuffers since v3, but gets BASE64 encoded and a JSON envelope (electron/electron#13055).

From my own tests electron performs worse than our demo app based on websockets (about half as fast), maybe the ipc mechanism is the culprit? Even strings get the JSON envelope, thats not any good from the perf perspective. Maybe you want try out our demo app and compare yourself, not sure if directly invoking node-pty in the renderer or falling back to websockets would solve this.

Ok enough thread hijacking 😸

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but gets BASE64 encoded and a JSON envelope

Exactly :(

falling back to websockets

IPC doesn't seem to cost that much though, V8 strings are pretty fast. In this profile of one frame, IPC only takes a tiny sliver (<0.5ms), not a lot of gains there. It might become more prominent when we do WebWorkers though, with an extra hop, but then again, we'll probably use SharedArrayBuffer for that

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that real gains are hidden in the parsing of strings part (i.e. your fix xtermjs/xterm.js#1796), so ignore that last part 😅

Copy link

@jerch jerch Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats the basic idea behind my optimization plan (xtermjs/xterm.js#1811) - by getting rid of any string handling in the input chain and working directly on arraybuffer we can lower the runtime needed for input handling by >50%.
This gets even more if we change to direct utf8 binary input as it would save the nonsense conversions. If you are interested I have a heavily optimized playground branch, where you can test it in the demo app (https://github.com/jerch/xterm.js/tree/tttt, also needs https://github.com/jerch/node-textdecode)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you cannot get it run, here are my results: xtermjs/xterm.js#791 (comment) 😄

@juancampa
Copy link
Contributor Author

@chabou @rauchg now that electron 4 is merged, this is ready to go, if you guys approve.

@rauchg rauchg merged commit c07700a into vercel:canary Dec 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📊 Type: Performance Issue contains information about a performance issue in Hyper
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants