Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to purge largest worker #3251

Open
TyeMcQueen opened this issue Jul 30, 2024 · 3 comments
Open

Add option to purge largest worker #3251

TyeMcQueen opened this issue Jul 30, 2024 · 3 comments
Assignees

Comments

@TyeMcQueen
Copy link

TyeMcQueen commented Jul 30, 2024

In getting Gunicorn running in Kubernetes and tuning resource specifications, I found that the memory consumption of each pod would just grow over time.

In the pull request I'll attach to this issue, I implemented 3 new config options that allowed me to make the memory consumption quickly flatten out and stay stable (and lower) over long periods with Production load. Example configuration:

import psutil

def proc_vmsize(pid):
    # Return how many MB of virtual memory is being used by a worker process
    try:
        p = psutil.Process(pid)
        mb = p.memory_info().vms/1024/1024
        return mb
    except psutil.NoSuchProcessError: # In case worker dies on its own
        return 0

prune_seconds = 5*60            # Prune largest worker every 4.25-5.25m
prune_function = proc_vmsize    # Measure worker size in MB of VM
prune_floor = 300               # Don't kill workers using <= 300 MB of VM

The crux of the problem is that relatively rare requests are much larger (or generate much larger responses or load a lot more resources when computing the response) than typical requests and cause Python to allocate quite a bit more memory and hold on to that memory until the process is killed. Over time, the number of oversized workers would grow and the size of the oversized workers would also creep up, causing pod memory consumption to just keep growing.

This is somewhat counteracted by max_requests, but not very effectively. Setting max_requests low enough to make a dent in this growth can make process restarts so frequent during busy times that you are likely to get overlapping restarts reducing your capacity (and spending more CPU doing all that restarting). And then it is still not effective when traffic volume drops during slow times of day.

Having each worker kill itself when it gets too big risks unbounded and uncoordinated restarting which could be a disaster.

So I added a time-based controlled restart of just the worker that is currently the largest, using whatever measure you prefer or that best fits how your code behaves.

The time-based approach works well in Kubernetes because when traffic ramps up, the number of pods increases and the rate of workers being restarted also goes up proportionally. And yet you never have two workers in the same pod restarting even close to the same time, so no significant dips in capacity.

And the prune_floor allows you to stop doing any restarts during periods when they aren't really needed.

TyeMcQueen added a commit to TyeMcQueen/gunicorn that referenced this issue Jul 30, 2024
TyeMcQueen added a commit to TyeMcQueen/gunicorn that referenced this issue Jul 30, 2024
TyeMcQueen added a commit to TyeMcQueen/gunicorn that referenced this issue Jul 30, 2024
TyeMcQueen added a commit to TyeMcQueen/gunicorn that referenced this issue Jul 30, 2024
TyeMcQueen added a commit to TyeMcQueen/gunicorn that referenced this issue Jul 30, 2024
TyeMcQueen added a commit to TyeMcQueen/gunicorn that referenced this issue Jul 30, 2024
@pajod
Copy link
Contributor

pajod commented Jul 30, 2024

Objections:

  • The documentation sounds a lot like leaking memory was in any way normal.
  • Doing mildly expensive things on a timer even when nothing happened is just waste.
  • Missing documentation on (lack of) consideration for threads, keepalive, graceful timeout, ..
  • The max_requests feature already provides an easily-abused escape hatch for keeping garbage code running. And that one already has known defects. Adding another does not fit well with the goal of a "simply implemented" WSGI server.

Pro:

  • Given this feature is simpler in nature (just SIGTERM and demand the worker can deal with this gracefully) it might serve both as reasonable replacement to some max_requests use cases, and a good way to exercise edge cases at this time not well unit tested. This would be a stronger argument if all use cases were covered.

I do not see a clean way to accomplish that with existing hooks, but both this and the existing max_requests may be better suited for a sample config in the examples/ folder, rather than a tightly integrated feature with (broken) stability promises.


Linking related issue for easier discovery: #2422

@TyeMcQueen
Copy link
Author

The documentation sounds a lot like leaking memory was in any way normal.

The documentation and this issue description have no mention of memory leaks and memory leaks are not what is being addressed by this feature. Granted that the "the size of the oversized workers would also creep up" (only from the issue description) would fit a memory leak, but it was actually talking about something quite different.

I will look at the documentation again to try to make it less likely for people to think that memory leaks are being discussed.

And these causes of process memory growth are indeed quite common in a huge variety of systems I've work on over the decades. Although here there is work done to pre-load lots of the components before fork()ing, a system as deep as Django has a huge body of potential code that might be loaded and not all of it will get fully pre-loaded (and there are good reasons for that). And even if you managed to do that, when the need for a particularly large set of data pops up, memory usage will increase and (on many systems) it is unlikely to go all the way back down even if free() is called for most of that memory. Memory arenas get fragmented (and a lot of systems reasonably don't add the layer of indirection and locks to allow aggressively trying to defragment arenas and return pages to the OS). And there can be good reasons to not even free() every buffer when it is likely just going to be needed again.

And there will be tons of places where little caches get populated to make the next iteration faster and those are not going to be free()d even if that particular path doesn't get exercised again over the next hour.

So, if you have a system that has a few huge customers and a ton of not-huge customers (and similar patterns for a lot of other types of objects), and you take the time to pay close attention to how memory is being used, in my experience, it is indeed pretty common to see the rare expensive request cause one worker to grow in memory consumption and stay there. Though, in my experience, it isn't so common for people to notice. So I was not particularly surprised to see that pattern in the Python system I was brought in to work on.

And I would think we would want gunicorn to be an appropriate tool even for services that deal with a wide variety of types of requests and thus have code paths that only get executed infrequently.

And I have a live example of a system where memory usage is decreased and made more flat by restarting at most 1 worker every 5 minutes (per pod). That's cheap and the impact on scalability in our case was huge. The right choice of timing and floor is going to be different for other systems. If you are running gunicorn in VMs instead of in pods, this is unlikely to be useful to you. And I'm sure many systems aren't going to have enough variability in request complexity/size and optional code paths to make something like this of much value.

But I also think this much exposition is not appropriate in the documentation for a relatively simple group of options.

Doing mildly expensive things on a timer even when nothing happened is just waste.

You can't accurately use the word "waste" unless the mildly expensive thing has no value. The periodic behavior is fully optional and so has no appreciable cost if you don't go to the work to enable it (I doubt you could even measure the impact in aggregate of the added checking if a variable is zero). And you would only enable it if it had value in your case.

Missing documentation on (lack of) consideration for threads, keepalive, graceful timeout, ..

Those features are documented other places. And we already use threads and keep-alives and timeouts. If you have more specific recommendations of particularly promising alternatives for this situation that should be called out (now that you understand that we aren't talking about memory leaks), please elaborate.

The max_requests feature already provides an easily-abused escape hatch for keeping garbage code running.

And I covered why it is not useful here.

Perhaps we should update the CONTRIBUTING documentation to add to "We are always thrilled to receive pull requests" to mention that we also love to tell you that your code is probably garbage? :)

I can see how max_requests could be useful for limping along until you get a memory leak fixed. We use max_requests but for a different reason. Leaving a complex system running for a very long time increases the risk of running into one of those complex bugs that haven't been fixed because they are nearly impossible to trigger but the millions of monkeys with typewriters producing requests in Production can eventually manage it. So I've several times seen services get into states where stuff is just weird and stays that way until the misbehaving component is restarted. And I've actually done the work to diagnose several of those and the problems always ended up not being a bug in a application code but broken behavior in a base component (like Python itself). So triggering some kind of eventual restart is a good idea for a resilient system, even though the potential problem is pretty rare (though perhaps not as rare as many people think).

it might serve both as reasonable replacement to some max_requests use cases

One idea would be to have the WorkerTmp.notify() write self.nr to the temporary file (which would also update the file timestamp) and make that number available to the configured prune function. Then provide an example config with a function that returns that number and that sets the prune floor to what max_requests used to be set to. Then simultaneous restarts would no longer be a risk.

That feels to me like it would handle most of the use cases of max_requests.

For a transition period, max_requests could be deprecated but implemented using the new mechanics (max_requests would not be allowed if you configured your own prune function).

And people can combine max_requests-type restarts with "largest worker" restarts using some simple math to produce a combined score. For example, adding the prune floor to the process size only if the number of requests is larger than your desired max requests would prune the largest workers and those exceeding max requests, just one at a time.

And having the approximate number of requests handled by each worker available to the Arbiter could also have interesting other applications. But I've gone on long enough.

@benoitc benoitc self-assigned this Jul 30, 2024
@pajod
Copy link
Contributor

pajod commented Jul 30, 2024

Thanks for the detailed explanation, much appreciated.

memory leaks are not what is being addressed by this feature. [..] a ton of not-huge customers

This is the sort of explanation I was missing from the docstring. Sacrificing cache-powered performance in one application to the benefit of another is a perfectly valid reason, one that does not need to involve any sort of generally undesirable memory consumption.

Those features are documented other places.

The interaction between the various ways to trigger a worker restart and long-running requests is badly documented across the board, so I am requesting that newly introduced behavior tries really hard to set the expectations right. Such as promising that long-running requests will be interrupted after waiting for graceful_timeout - or that the mechanism is not employed again until a successful worker restart was seen.

always ended up not being a bug in a application code but broken behavior in a base component (like Python itself)

That is the case I am most afraid of, actually. Business incentives already do not align well with getting bugs in Python (or Gunicorn!) thoroughly diagnosed and resolved. It would be wise if both the existing feature, and your newly suggested extensions, limit their amplifying effect on that situation with very carefully worded usage instructions.

Memory fragmented

By what metric? Are you using some simple tool/command, simple enough so other users could run it on their systems to determine to what degree they are affected / to what degree their applications memory overhead might improve by simply applying your PR right away?

max_requests could be deprecated

In that sense people have been limited to one predefined metric (# of requests), and you are offering a versatile way to add more. Other than the existing jitter mechanism not netting you sufficiently predictable behavior w.r.t simultaneous restarts.. do you see a blocker on why your work cannot be introduced as a strict superset of the existing mechanism? That would mostly nullify my "adding another" objection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants