Multiprocess exposition speed boost #421

akx · 2019-06-04T14:46:20Z

Hey there!

We at @valohai are still (#367, #368) bumping into situations where a long-lived multiproc (uWSGI) app's metric exposition steadily gets slower and slower, until it starts clogging up all workers and bumping into Prometheus's deadlines, and things break in a decidedly ungood way.

I figured I could take a little look at what exactly is taking time there, and I'm delighted to say I managed to eke out a roughly 5.8-fold speed increase in my test case (2857 files in the multiproc dir, totaling 2.8 GiB).

By far the largest boost here came from actually not using mmap() at all (f0319fa) when we're only reading the file; instead, simply reading the file fully into memory and parsing things from the memory buffer is much, much faster. Given each file (in my case anyway) is about 1 meg a pop, it shouldn't cause too much momentary memory pressure either.

Another decent optimization (0d8b870) came from looking at vmprof's output (and remembering python-babel/babel#571); it said a lot of time was spent in small, numerous memory allocations within Python, and the trail led to json.loads(). Since the JSON blobs in the files are written with sort_keys=True, we can be fairly certain that there's going to be plenty of string duplication. Simply adding an unbounded lru_cache() to where the JSON strings are being parsed into (nicely immutable!) objects gave a nice speed boost and probably also reduced memory churn since the same objects get reused. Calling cache_info() on the lru_cache validated the guess of duplication: CacheInfo(hits=573057, misses=3229, maxsize=None, currsize=3229)

A handful of other micro-optimizations brought the speed up a little more still.

My benchmark code was essentially 5 iterations of

registry = prometheus_client.CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
metrics_page = prometheus_client.generate_latest(registry)
assert metrics_page

and on my machine it took 22.478 seconds on 5132fd2 and 3.885 on 43cc95c91. 🎉

Signed-off-by: Aarni Koskela <akx@iki.fi>

brian-brazil · 2019-06-04T15:19:15Z

Thanks, taking a quick peek this looks okay.

I'm surprised that mmap is slower, would all the files have been in page cache?
Do we have to worry about a memory leak in the LRU cache?

akx · 2019-06-04T15:47:52Z

I'm surprised that mmap is slower, would all the files have been in page cache?

I'm guessing the slower perf is because of small reads going through whatever wrappers Python has for mmaps, but who knows. (vmprof output was a little unreliable there.)

Do we have to worry about a memory leak in the LRU cache?

I don't think so. The cache is decorated here to the lifetime of the _parse_key() function, which is an inner function of _read_metrics. That is to say, it's only live in memory for the duration of MultiprocessCollector reading the files off disk.

akx · 2019-06-05T06:39:04Z

Another small optimization that occurred to me on the bus to work: The bulk of the mmap files is likely to be pretty empty (I mean, those 2.8 GiB of files gzcompressed down to 9.2 megs), so we can only read used bytes from them instead of the full megabyte (which is the initial mmap size), if we peek at the value first. (d8910d3)

brian-brazil · 2019-06-05T08:02:45Z

I'm guessing the slower perf is because of small reads going through whatever wrappers Python has for mmaps, but who knows. (vmprof output was a little unreliable there.)

It should only be hitting memory.

The bulk of the mmap files is likely to be pretty empty (I mean, those 2.8 GiB of files gzcompressed down to 9.2 megs),

At most half should be empty, once it has expanded.

brian-brazil · 2019-06-05T08:10:04Z

prometheus_client/multiprocess.py

            for s in metric.samples:
-                name, labels, value = s.name, s.labels, s.value
+                name, labels, value, timestamp, exemplar = s


timestamp and exemplar aren't used

True. This is another micro-optimization: unpacking a slice of s would be slower than unpacking everything.

prometheus_client/multiprocess.py

svanscho · 2019-06-05T15:10:10Z

Nice work! I honestly at first couldn't believe the current implementation cannot handle collecting and aggregating the metrics in a stable way over time.

So two questions:

this seems like a lot faster, which is very nice and something we can definitely use asap, can we merge?
wouldn't a compaction of the data be better - do we need to keep all data separately when the pid are no longer active?

This is the current scraping behaviour we get:

Where the resets are due to server restarts.

akx · 2019-06-05T16:08:46Z

It should only be hitting memory.

If I'm reading https://github.com/python/cpython/blob/142566c028720934325f0b7fe28680afd046e00f/Modules/mmapmodule.c#L837-L867 and
https://github.com/python/cpython/blob/142566c028720934325f0b7fe28680afd046e00f/Objects/bytesobject.c#L126 right, slicing an mmap object involves allocation and memcpy. 😞 That'd do it...

At most half should be empty, once it has expanded.

When some of the worker processes don't live for very long at all, I was seeing files with used ranging in some tens of kilobytes. (Perhaps I should tune the initial size variable for this use case...)

akx · 2019-06-05T16:13:27Z

wouldn't a compaction of the data be better - do we need to keep all data separately when the pid are no longer active?

It'd have to involve some interesting locking, since any process may be writing into their metrics file while we're collating them. Of course, if compaction is an opt-in step and you're willing to risk dropping a measurement or few while it's happening, it'd be an interesting next step for this. :)

svanscho · 2019-06-05T22:16:19Z

Compaction can be done for all dead/inactive pids. It will avoid calculating the same sums etc over and over again without locking. The existing mechanism could then still be used for the active pids. No?

akx · 2019-06-06T04:51:23Z

Compaction can be done for all dead/inactive pids. It will avoid calculating the same sums etc over and over again without locking. The existing mechanism could then still be used for the active pids. No?

Sure, if you can know when the PID becomes inactive. With uWSGI, it's hard to know which pids are still active; at least I don't know about a hook for that...

brian-brazil · 2019-06-06T08:08:10Z

A PID may also become active again when a new process starts.

prometheus_client/mmap_dict.py

akx · 2019-06-06T11:05:00Z

A PID may also become active again when a new process starts.

Which begs the question: does the pid have to necessarily be the real PID of the process? Could it just be an UUID that gets generated at process start?

Signed-off-by: Aarni Koskela <akx@iki.fi>

brian-brazil · 2019-06-06T11:10:00Z

That isn't going to help with churn ;)

brian-brazil · 2019-06-06T11:19:20Z

Thanks!

svanscho · 2019-06-06T14:20:19Z

Yeah, I figured gunicorn has hooks on dead/inactive workers but not all implementations might have/use this. Anyway, awesome fix, thanks a lot!

@brian-brazil when do you think we can make this into a release?

brian-brazil · 2019-06-06T15:28:26Z

Tomorrow is the plan.

svanscho · 2019-06-06T16:14:55Z

Awesome plan, thanks guys!

akx added 3 commits June 4, 2019 17:39

Split MultiProcessCollector.__init__ for better profiling

b91aff3

Signed-off-by: Aarni Koskela <akx@iki.fi>

fstat mmap file only once

b325a5e

Signed-off-by: Aarni Koskela <akx@iki.fi>

Avoid unpack_from() for a simple slice

db2a7f0

Signed-off-by: Aarni Koskela <akx@iki.fi>

akx force-pushed the multiproc-expose-speed branch from 43cc95c to fa172b2 Compare June 4, 2019 14:56

akx force-pushed the multiproc-expose-speed branch 2 times, most recently from 416860c to d8910d3 Compare June 5, 2019 06:38

brian-brazil reviewed Jun 5, 2019

View reviewed changes

akx force-pushed the multiproc-expose-speed branch from 716d7dc to 5d0f877 Compare June 6, 2019 08:02

brian-brazil reviewed Jun 6, 2019

View reviewed changes

prometheus_client/mmap_dict.py Outdated Show resolved Hide resolved

akx added 5 commits June 6, 2019 14:08

Avoid duplicate JSON parsing and small allocations

9c35671

Signed-off-by: Aarni Koskela <akx@iki.fi>

Don't use mmap() when only reading a MmapedDict file

fbed8d1

Signed-off-by: Aarni Koskela <akx@iki.fi>

Construct less tuples and dicts in accumulate

2a56d5c

Signed-off-by: Aarni Koskela <akx@iki.fi>

Use less genexprs in multiprocess accumulate

0cfb054

Signed-off-by: Aarni Koskela <akx@iki.fi>

Read only used bytes from MmapedDict files, not all the zeroes too

0f544eb

Signed-off-by: Aarni Koskela <akx@iki.fi>

akx force-pushed the multiproc-expose-speed branch from 5d0f877 to 0f544eb Compare June 6, 2019 11:09

brian-brazil merged commit df024e0 into prometheus:master Jun 6, 2019

akx mentioned this pull request Jun 19, 2019

Multiprocess mode is slow #367

Closed

akx mentioned this pull request Aug 5, 2020

High cpu when using Gunicorn with multiprocess #568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocess exposition speed boost #421

Multiprocess exposition speed boost #421

akx commented Jun 4, 2019

brian-brazil commented Jun 4, 2019

akx commented Jun 4, 2019

akx commented Jun 5, 2019

brian-brazil commented Jun 5, 2019

brian-brazil Jun 5, 2019

akx Jun 5, 2019

svanscho commented Jun 5, 2019 •

edited

Loading

akx commented Jun 5, 2019

akx commented Jun 5, 2019

svanscho commented Jun 5, 2019

akx commented Jun 6, 2019

brian-brazil commented Jun 6, 2019

akx commented Jun 6, 2019

brian-brazil commented Jun 6, 2019

brian-brazil commented Jun 6, 2019

svanscho commented Jun 6, 2019 •

edited

Loading

brian-brazil commented Jun 6, 2019

svanscho commented Jun 6, 2019

Multiprocess exposition speed boost #421

Multiprocess exposition speed boost #421

Conversation

akx commented Jun 4, 2019

brian-brazil commented Jun 4, 2019

akx commented Jun 4, 2019

akx commented Jun 5, 2019

brian-brazil commented Jun 5, 2019

brian-brazil Jun 5, 2019

Choose a reason for hiding this comment

akx Jun 5, 2019

Choose a reason for hiding this comment

svanscho commented Jun 5, 2019 • edited Loading

akx commented Jun 5, 2019

akx commented Jun 5, 2019

svanscho commented Jun 5, 2019

akx commented Jun 6, 2019

brian-brazil commented Jun 6, 2019

akx commented Jun 6, 2019

brian-brazil commented Jun 6, 2019

brian-brazil commented Jun 6, 2019

svanscho commented Jun 6, 2019 • edited Loading

brian-brazil commented Jun 6, 2019

svanscho commented Jun 6, 2019

svanscho commented Jun 5, 2019 •

edited

Loading

svanscho commented Jun 6, 2019 •

edited

Loading