-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocess exposition speed boost #421
Multiprocess exposition speed boost #421
Conversation
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
43cc95c
to
fa172b2
Compare
Thanks, taking a quick peek this looks okay. I'm surprised that mmap is slower, would all the files have been in page cache? |
I'm guessing the slower perf is because of small reads going through whatever wrappers Python has for mmaps, but who knows. (
I don't think so. The cache is decorated here to the lifetime of the |
416860c
to
d8910d3
Compare
Another small optimization that occurred to me on the bus to work: The bulk of the mmap files is likely to be pretty empty (I mean, those 2.8 GiB of files gzcompressed down to 9.2 megs), so we can only read |
It should only be hitting memory.
At most half should be empty, once it has expanded. |
for s in metric.samples: | ||
name, labels, value = s.name, s.labels, s.value | ||
name, labels, value, timestamp, exemplar = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp and exemplar aren't used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. This is another micro-optimization: unpacking a slice of s
would be slower than unpacking everything.
If I'm reading https://github.com/python/cpython/blob/142566c028720934325f0b7fe28680afd046e00f/Modules/mmapmodule.c#L837-L867 and
When some of the worker processes don't live for very long at all, I was seeing files with |
It'd have to involve some interesting locking, since any process may be writing into their metrics file while we're collating them. Of course, if compaction is an opt-in step and you're willing to risk dropping a measurement or few while it's happening, it'd be an interesting next step for this. :) |
Compaction can be done for all dead/inactive pids. It will avoid calculating the same sums etc over and over again without locking. The existing mechanism could then still be used for the active pids. No? |
Sure, if you can know when the PID becomes inactive. With uWSGI, it's hard to know which pids are still active; at least I don't know about a hook for that... |
716d7dc
to
5d0f877
Compare
A PID may also become active again when a new process starts. |
Which begs the question: does the |
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
5d0f877
to
0f544eb
Compare
That isn't going to help with churn ;) |
Thanks! |
Yeah, I figured gunicorn has hooks on dead/inactive workers but not all implementations might have/use this. Anyway, awesome fix, thanks a lot! @brian-brazil when do you think we can make this into a release? |
Tomorrow is the plan. |
Awesome plan, thanks guys! |
Hey there!
We at @valohai are still (#367, #368) bumping into situations where a long-lived multiproc (uWSGI) app's metric exposition steadily gets slower and slower, until it starts clogging up all workers and bumping into Prometheus's deadlines, and things break in a decidedly ungood way.
I figured I could take a little look at what exactly is taking time there, and I'm delighted to say I managed to eke out a roughly 5.8-fold speed increase in my test case (2857 files in the multiproc dir, totaling 2.8 GiB).
By far the largest boost here came from actually not using
mmap()
at all (f0319fa) when we're only reading the file; instead, simply reading the file fully into memory and parsing things from the memory buffer is much, much faster. Given each file (in my case anyway) is about 1 meg a pop, it shouldn't cause too much momentary memory pressure either.Another decent optimization (0d8b870) came from looking at
vmprof
's output (and remembering python-babel/babel#571); it said a lot of time was spent in small, numerous memory allocations within Python, and the trail led tojson.loads()
. Since the JSON blobs in the files are written withsort_keys=True
, we can be fairly certain that there's going to be plenty of string duplication. Simply adding an unboundedlru_cache()
to where the JSON strings are being parsed into (nicely immutable!) objects gave a nice speed boost and probably also reduced memory churn since the same objects get reused. Callingcache_info()
on the lru_cache validated the guess of duplication:CacheInfo(hits=573057, misses=3229, maxsize=None, currsize=3229)
A handful of other micro-optimizations brought the speed up a little more still.
My benchmark code was essentially 5 iterations of
and on my machine it took 22.478 seconds on 5132fd2 and 3.885 on 43cc95c91. 🎉