Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of stderr and stdout stream buffer #888

Merged
merged 1 commit into from
Mar 28, 2022

Conversation

MrBago
Copy link

@MrBago MrBago commented Mar 24, 2022

During testing I discovered that writing to stdout and stderr in ipykernel is bottlenecked on scheduling write events on IOPubThread. Specifically on this line which sends a message to notify the thread. Using a lock so we can avoid scheduling every write on the IOPubThread improves the performance more than 20x.

Here is the notebook I used for testing the performance, I start the notebook server with these arguments to prevent it from dropping output --NotebookApp.iopub_data_rate_limit=1000000000.0 --NotebookApp.iopub_msg_rate_limit=1000000:
Screen Shot 2022-03-25 at 10 26 40 AM

Also in ipynb format: https://gist.github.com/MrBago/06a8e79be3dd2ccbdd1e954a27dd5f86

@blink1073
Copy link
Contributor

Thanks! It looks like the test_subprocess_print are relevant, but since it happens on only some of the builds it looks like a race condition.

@MrBago
Copy link
Author

MrBago commented Mar 25, 2022

Thanks @blink1073. I looked into it, and on some platforms the subprocesses send the output to the parent processes over a zmq pipe that my PR removes :p. I was developing and testing this on osx, but i've verified that I broke it on linux.

I think another approach to get a similar speedup is to keep the zmp IOLoop and pipes, but protect the buffer with a simple lock so that we only need to call schedule for flush and not on every write.

@MrBago
Copy link
Author

MrBago commented Mar 25, 2022

Now that I think about it, I like the lock approach much better. The other approach has the problem that the main thread can schedule tasks much faster than the pub thread can handle them. Without any back pressure, the flushes can become very infrequent and large. Doing the writes in the main thread means there will be a natural back pressure.

Copy link
Contributor

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I'll leave this open for a bit in case anyone else has comments.

Copy link
Member

@jasongrout jasongrout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, and I get similar timings for the test @MrBago posted:

Before:

CPU times: user 4.85 s, sys: 2.1 s, total: 6.95 s
Wall time: 14.6 s

After

CPU times: user 390 ms, sys: 68.5 ms, total: 458 ms
Wall time: 500 ms

Seems like magic, befitting a PR with the number 888, the number of magic thrice!

@blink1073 - if/when this is merged, any chance we could release this soon?

Disclosure: @MrBago and I work together at Databricks

@blink1073
Copy link
Contributor

@blink1073 - if/when this is merged, any chance we could release this soon?

Yep, I'll release tomorrow.

@davidbrochart
Copy link
Collaborator

That's a great performance improvement, to say the least!
I'm wondering if you used the new kernel protocol over websocket to get these results, and how the performances compare between the old protocol and the new protocol.
Note that there is a new --NotebookApp.limit_rate that you can set to False to get the equivalent of --NotebookApp.iopub_data_rate_limit=1000000000.0 --NotebookApp.iopub_msg_rate_limit=1000000 (you could have used a value of -1 instead of an arbitrary high value in those, as well).

@jasongrout
Copy link
Member

I also ran the test above without changing the iopub throttle limits, and there is no throttling (I was afraid we might be too fast now, and trigger the rate throttling). In fact, if I don't clear the output periodically, I get timings that are about twice as good (250ms instead of 500ms), still without triggering the default rate throttling.

@davidbrochart - the timings for the test above with this PR applied do not seem to change much for me with:

  • jlab 2
  • latest jlab 3 and jupyter-server 1.13.0 (prior to your PR being merged)
  • latest jlab 3 and jupyter-server 1.15 (after your PR was merged)
  • latest jlab 4 prerelease and jupyter-server 1.15 (after your pr was merged)

@blink1073 blink1073 merged commit 3059fd9 into ipython:main Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants