[Serve] memory leak in Ray Serve 2.2.0 #31688

Mitan · 2023-01-16T05:57:29Z

What happened + What you expected to happen

What happens: the memory of a simple Ray serve app with a single deployment keeps increasing when receiving requests. In 18 hours of continuous sending of requests the memory of the app more than doubles (see memory consumption from Prometheus). As a result, my app eventually runs OOM and crashes. I created a simple version of a Ray Serve app to reproduce the issue.

What you expected to happen: the memory should not increase.

Versions / Dependencies

Ray 2.2.0
Python 3.7
OS: Linux

Reproduction script

A simple version of Ray Serve app to reproduce the issue.

Code to start the cluster (start.sh)

#!/bin/bash
ray start --head --num-cpus 4 --num-gpus 0 --metrics-export-port=8103 --include-dashboard=false
python3 app.py

Code for app.py

import logging

import ray
from ray import serve


@serve.deployment(route_prefix="/test_deployment")
class Test_deployment:

    def __init__(self):
        pass

    async def __call__(self, request):
        return {
            'code': 200,
            'response': "Hello"
        }


if __name__ == '__main__':
    ray.init(address="auto", include_dashboard=False)
    serve.start(http_options={'port': 8102}, detached=True)
    
    Test_deployment.deploy()

The polling script continuously and sequentlly sends requests in a synchronous manner (there is no queueing of requests). I can provide the code, if needed

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

mihajenko · 2023-01-16T08:27:26Z

@Mitan may I ask, how does the script behave on ray==2.1.0? Very relevant to us right now.

Mitan · 2023-01-16T09:25:12Z

@mihajenko thanks for your reply - let me check and get back to you (by tomorrow should be ready)

sihanwang41 · 2023-01-17T17:01:21Z

Hi @Mitan , I am not able to reproduce the issue in my dev box, not seeing memory increase. Are you able to narrow down which process is increasing on your side?

BTW this is my send request script:

import requests

while True:
    print(requests.get("http://127.0.0.1:8102/test_deployment").text)

Mitan · 2023-01-27T03:30:28Z

Hi @mihajenko @sihanwang41,

We were able to identify the root cause as a combination of multiple factors:

HttpProxyActor uses the Serve logger with default level set to logging.INFO here https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/logging_utils.py#L17), so it prints something on every request:

INFO 2023-01-27 01:19:07,461 http_proxy 10.245.21.150 http_proxy.py:315 - POST /admin 200 20237.0ms

This cannot be silenced using existing guidelines from documentation, since I don't have access to HttpProxyActor constructor.

Log Rotation for Ray Serve is currently not implemented, so the log files keeps growing. There is a PR fixing it submitted here: Enable Log Rotation on Serve #31844
The default log folder for Ray is inside /tmp which is implemented as tempfs in our server, so all logs are stored in RAM.

So adding log rotation for Serve (and potentially allowing to change default logging level for service deployments such as HttpProxyActor) should resolve the issue.

sihanwang41 · 2023-01-27T04:27:26Z

Hi @Mitan, we have a pr enabling the logging rotation on serve (as you pointed out). The team will help to make the pr land as soon as possible.

Mitan · 2023-01-27T05:13:20Z

Thanks @sihanwang41!

rkooo567 · 2023-07-20T22:58:32Z

I assume we can close this issue now? The log rotation has been enabled

Mitan · 2023-07-22T07:07:13Z

Hi @rkooo567, yes issue can be closed, thank you

Mitan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 16, 2023

jaikumarg1 assigned jaikumarg1 and sihanwang41 and unassigned jaikumarg1 Jan 16, 2023

zhe-thoughts added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2023

akshay-anyscale added the serve Ray Serve Related Issue label Mar 7, 2023

wangxiangming01 mentioned this issue Sep 15, 2023

add vllm output log for debug, reduce ray log to slow down memory inc… vllm-project/vllm#1051

Closed

akshay-anyscale closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] memory leak in Ray Serve 2.2.0 #31688

[Serve] memory leak in Ray Serve 2.2.0 #31688

Mitan commented Jan 16, 2023 •

edited

Loading

mihajenko commented Jan 16, 2023

Mitan commented Jan 16, 2023

sihanwang41 commented Jan 17, 2023

Mitan commented Jan 27, 2023

sihanwang41 commented Jan 27, 2023 •

edited

Loading

Mitan commented Jan 27, 2023

rkooo567 commented Jul 20, 2023

Mitan commented Jul 22, 2023

[Serve] memory leak in Ray Serve 2.2.0 #31688

[Serve] memory leak in Ray Serve 2.2.0 #31688

Comments

Mitan commented Jan 16, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

mihajenko commented Jan 16, 2023

Mitan commented Jan 16, 2023

sihanwang41 commented Jan 17, 2023

Mitan commented Jan 27, 2023

sihanwang41 commented Jan 27, 2023 • edited Loading

Mitan commented Jan 27, 2023

rkooo567 commented Jul 20, 2023

Mitan commented Jul 22, 2023

Mitan commented Jan 16, 2023 •

edited

Loading

sihanwang41 commented Jan 27, 2023 •

edited

Loading