Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocess exposition speed boost #421

Merged
merged 8 commits into from
Jun 6, 2019

Conversation

akx
Copy link
Contributor

@akx akx commented Jun 4, 2019

Hey there!

We at @valohai are still (#367, #368) bumping into situations where a long-lived multiproc (uWSGI) app's metric exposition steadily gets slower and slower, until it starts clogging up all workers and bumping into Prometheus's deadlines, and things break in a decidedly ungood way.

I figured I could take a little look at what exactly is taking time there, and I'm delighted to say I managed to eke out a roughly 5.8-fold speed increase in my test case (2857 files in the multiproc dir, totaling 2.8 GiB).

By far the largest boost here came from actually not using mmap() at all (f0319fa) when we're only reading the file; instead, simply reading the file fully into memory and parsing things from the memory buffer is much, much faster. Given each file (in my case anyway) is about 1 meg a pop, it shouldn't cause too much momentary memory pressure either.

Another decent optimization (0d8b870) came from looking at vmprof's output (and remembering python-babel/babel#571); it said a lot of time was spent in small, numerous memory allocations within Python, and the trail led to json.loads(). Since the JSON blobs in the files are written with sort_keys=True, we can be fairly certain that there's going to be plenty of string duplication. Simply adding an unbounded lru_cache() to where the JSON strings are being parsed into (nicely immutable!) objects gave a nice speed boost and probably also reduced memory churn since the same objects get reused. Calling cache_info() on the lru_cache validated the guess of duplication: CacheInfo(hits=573057, misses=3229, maxsize=None, currsize=3229)

A handful of other micro-optimizations brought the speed up a little more still.

My benchmark code was essentially 5 iterations of

registry = prometheus_client.CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
metrics_page = prometheus_client.generate_latest(registry)
assert metrics_page

and on my machine it took 22.478 seconds on 5132fd2 and 3.885 on 43cc95c91. 🎉

akx added 3 commits June 4, 2019 17:39
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
@akx akx force-pushed the multiproc-expose-speed branch from 43cc95c to fa172b2 Compare June 4, 2019 14:56
@brian-brazil
Copy link
Contributor

Thanks, taking a quick peek this looks okay.

I'm surprised that mmap is slower, would all the files have been in page cache?
Do we have to worry about a memory leak in the LRU cache?

@akx
Copy link
Contributor Author

akx commented Jun 4, 2019

I'm surprised that mmap is slower, would all the files have been in page cache?

I'm guessing the slower perf is because of small reads going through whatever wrappers Python has for mmaps, but who knows. (vmprof output was a little unreliable there.)

Do we have to worry about a memory leak in the LRU cache?

I don't think so. The cache is decorated here to the lifetime of the _parse_key() function, which is an inner function of _read_metrics. That is to say, it's only live in memory for the duration of MultiprocessCollector reading the files off disk.

@akx akx force-pushed the multiproc-expose-speed branch 2 times, most recently from 416860c to d8910d3 Compare June 5, 2019 06:38
@akx
Copy link
Contributor Author

akx commented Jun 5, 2019

Another small optimization that occurred to me on the bus to work: The bulk of the mmap files is likely to be pretty empty (I mean, those 2.8 GiB of files gzcompressed down to 9.2 megs), so we can only read used bytes from them instead of the full megabyte (which is the initial mmap size), if we peek at the value first. (d8910d3)

@brian-brazil
Copy link
Contributor

I'm guessing the slower perf is because of small reads going through whatever wrappers Python has for mmaps, but who knows. (vmprof output was a little unreliable there.)

It should only be hitting memory.

The bulk of the mmap files is likely to be pretty empty (I mean, those 2.8 GiB of files gzcompressed down to 9.2 megs),

At most half should be empty, once it has expanded.

for s in metric.samples:
name, labels, value = s.name, s.labels, s.value
name, labels, value, timestamp, exemplar = s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp and exemplar aren't used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. This is another micro-optimization: unpacking a slice of s would be slower than unpacking everything.

prometheus_client/multiprocess.py Outdated Show resolved Hide resolved
@svanscho
Copy link

svanscho commented Jun 5, 2019

Nice work! I honestly at first couldn't believe the current implementation cannot handle collecting and aggregating the metrics in a stable way over time.

So two questions:

  1. this seems like a lot faster, which is very nice and something we can definitely use asap, can we merge?
  2. wouldn't a compaction of the data be better - do we need to keep all data separately when the pid are no longer active?

This is the current scraping behaviour we get:
image

Where the resets are due to server restarts.

@akx
Copy link
Contributor Author

akx commented Jun 5, 2019

It should only be hitting memory.

If I'm reading https://github.com/python/cpython/blob/142566c028720934325f0b7fe28680afd046e00f/Modules/mmapmodule.c#L837-L867 and
https://github.com/python/cpython/blob/142566c028720934325f0b7fe28680afd046e00f/Objects/bytesobject.c#L126 right, slicing an mmap object involves allocation and memcpy. 😞 That'd do it...

At most half should be empty, once it has expanded.

When some of the worker processes don't live for very long at all, I was seeing files with used ranging in some tens of kilobytes. (Perhaps I should tune the initial size variable for this use case...)

@akx
Copy link
Contributor Author

akx commented Jun 5, 2019

  1. wouldn't a compaction of the data be better - do we need to keep all data separately when the pid are no longer active?

It'd have to involve some interesting locking, since any process may be writing into their metrics file while we're collating them. Of course, if compaction is an opt-in step and you're willing to risk dropping a measurement or few while it's happening, it'd be an interesting next step for this. :)

@svanscho
Copy link

svanscho commented Jun 5, 2019

Compaction can be done for all dead/inactive pids. It will avoid calculating the same sums etc over and over again without locking. The existing mechanism could then still be used for the active pids. No?

@akx
Copy link
Contributor Author

akx commented Jun 6, 2019

Compaction can be done for all dead/inactive pids. It will avoid calculating the same sums etc over and over again without locking. The existing mechanism could then still be used for the active pids. No?

Sure, if you can know when the PID becomes inactive. With uWSGI, it's hard to know which pids are still active; at least I don't know about a hook for that...

@akx akx force-pushed the multiproc-expose-speed branch from 716d7dc to 5d0f877 Compare June 6, 2019 08:02
@brian-brazil
Copy link
Contributor

A PID may also become active again when a new process starts.

@akx
Copy link
Contributor Author

akx commented Jun 6, 2019

A PID may also become active again when a new process starts.

Which begs the question: does the pid have to necessarily be the real PID of the process? Could it just be an UUID that gets generated at process start?

akx added 5 commits June 6, 2019 14:08
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
Signed-off-by: Aarni Koskela <akx@iki.fi>
@akx akx force-pushed the multiproc-expose-speed branch from 5d0f877 to 0f544eb Compare June 6, 2019 11:09
@brian-brazil
Copy link
Contributor

That isn't going to help with churn ;)

@brian-brazil brian-brazil merged commit df024e0 into prometheus:master Jun 6, 2019
@brian-brazil
Copy link
Contributor

Thanks!

@svanscho
Copy link

svanscho commented Jun 6, 2019

Yeah, I figured gunicorn has hooks on dead/inactive workers but not all implementations might have/use this. Anyway, awesome fix, thanks a lot!

@brian-brazil when do you think we can make this into a release?

@brian-brazil
Copy link
Contributor

Tomorrow is the plan.

@svanscho
Copy link

svanscho commented Jun 6, 2019

Awesome plan, thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants