Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] memory leak in Ray Serve 2.2.0 #31688

Closed
Mitan opened this issue Jan 16, 2023 · 8 comments
Closed

[Serve] memory leak in Ray Serve 2.2.0 #31688

Mitan opened this issue Jan 16, 2023 · 8 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@Mitan
Copy link

Mitan commented Jan 16, 2023

What happened + What you expected to happen

What happens: the memory of a simple Ray serve app with a single deployment keeps increasing when receiving requests. In 18 hours of continuous sending of requests the memory of the app more than doubles (see memory consumption from Prometheus). As a result, my app eventually runs OOM and crashes. I created a simple version of a Ray Serve app to reproduce the issue.

What you expected to happen: the memory should not increase.

ray__memory_issue

Versions / Dependencies

Ray 2.2.0
Python 3.7
OS: Linux

Reproduction script

A simple version of Ray Serve app to reproduce the issue.

Code to start the cluster (start.sh)

#!/bin/bash
ray start --head --num-cpus 4 --num-gpus 0 --metrics-export-port=8103 --include-dashboard=false
python3 app.py

Code for app.py

import logging

import ray
from ray import serve


@serve.deployment(route_prefix="/test_deployment")
class Test_deployment:

    def __init__(self):
        pass

    async def __call__(self, request):
        return {
            'code': 200,
            'response': "Hello"
        }


if __name__ == '__main__':
    ray.init(address="auto", include_dashboard=False)
    serve.start(http_options={'port': 8102}, detached=True)
    
    Test_deployment.deploy()

The polling script continuously and sequentlly sends requests in a synchronous manner (there is no queueing of requests). I can provide the code, if needed

Issue Severity

High: It blocks me from completing my task.

@Mitan Mitan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 16, 2023
@mihajenko
Copy link

@Mitan may I ask, how does the script behave on ray==2.1.0? Very relevant to us right now.

@Mitan
Copy link
Author

Mitan commented Jan 16, 2023

@mihajenko thanks for your reply - let me check and get back to you (by tomorrow should be ready)

@sihanwang41
Copy link
Contributor

Hi @Mitan , I am not able to reproduce the issue in my dev box, not seeing memory increase. Are you able to narrow down which process is increasing on your side?

BTW this is my send request script:

import requests

while True:
    print(requests.get("http://127.0.0.1:8102/test_deployment").text)

@zhe-thoughts zhe-thoughts added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2023
@Mitan
Copy link
Author

Mitan commented Jan 27, 2023

Hi @mihajenko @sihanwang41,

We were able to identify the root cause as a combination of multiple factors:

  1. HttpProxyActor uses the Serve logger with default level set to logging.INFO here https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/logging_utils.py#L17), so it prints something on every request:

INFO 2023-01-27 01:19:07,461 http_proxy 10.245.21.150 http_proxy.py:315 - POST /admin 200 20237.0ms

This cannot be silenced using existing guidelines from documentation, since I don't have access to HttpProxyActor constructor.

  1. Log Rotation for Ray Serve is currently not implemented, so the log files keeps growing. There is a PR fixing it submitted here: Enable Log Rotation on Serve #31844

  2. The default log folder for Ray is inside /tmp which is implemented as tempfs in our server, so all logs are stored in RAM.

So adding log rotation for Serve (and potentially allowing to change default logging level for service deployments such as HttpProxyActor) should resolve the issue.

@sihanwang41
Copy link
Contributor

sihanwang41 commented Jan 27, 2023

Hi @Mitan, we have a pr enabling the logging rotation on serve (as you pointed out). The team will help to make the pr land as soon as possible.

@Mitan
Copy link
Author

Mitan commented Jan 27, 2023

Thanks @sihanwang41!

@akshay-anyscale akshay-anyscale added the serve Ray Serve Related Issue label Mar 7, 2023
@rkooo567
Copy link
Contributor

I assume we can close this issue now? The log rotation has been enabled

@Mitan
Copy link
Author

Mitan commented Jul 22, 2023

Hi @rkooo567, yes issue can be closed, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

7 participants