Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP server leaks memory in python3.7.1 #340

Closed
josecv opened this issue Nov 10, 2018 · 16 comments · Fixed by #356
Closed

HTTP server leaks memory in python3.7.1 #340

josecv opened this issue Nov 10, 2018 · 16 comments · Fixed by #356

Comments

@josecv
Copy link

josecv commented Nov 10, 2018

I'm using version 0.4.2, and it seems like the default http server, started via start_http_server leaks memory on every request.

Repro script:

#!/usr/bin/env python3
import gc
import time
import tracemalloc
from random import randint

from prometheus_client import Summary, start_http_server

# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds',
                       'Time spent processing request')


# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)


if __name__ == '__main__':
    # Start up the server to expose the metrics.
    start_http_server(8000)
    tracemalloc.start()
    snapshot = tracemalloc.take_snapshot()
    # Generate some requests.
    while True:
        process_request(randint(1, 5))
        gc.collect()
        snapshot2 = tracemalloc.take_snapshot()
        top_stats = snapshot2.compare_to(snapshot, 'lineno')
        print("[ Top 5 differences ]")
        for stat in top_stats[:5]:
            print(stat)

If you kick off the above script and then do:

while true; do curl localhost:8000; done

you can observe its memory footprint growing on every iteration:

[ Top 5 differences ]
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:799: size=946 KiB (+946 KiB), count=1682 (+1682), average=576 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:365: size=868 KiB (+868 KiB), count=1683 (+1683), average=528 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:500: size=422 KiB (+422 KiB), count=6753 (+6753), average=64 B
/Users/jose/code/prometheus-mem-leak/.venv/lib/python3.7/site-packages/prometheus_client/core.py:802: size=354 KiB (+354 KiB), count=5027 (+5027), average=72 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:899: size=263 KiB (+263 KiB), count=3366 (+3366), average=80 B
[ Top 5 differences ]
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:799: size=992 KiB (+992 KiB), count=1763 (+1763), average=576 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:365: size=910 KiB (+910 KiB), count=1764 (+1764), average=528 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:500: size=443 KiB (+443 KiB), count=7078 (+7078), average=64 B
/Users/jose/code/prometheus-mem-leak/.venv/lib/python3.7/site-packages/prometheus_client/core.py:802: size=370 KiB (+370 KiB), count=5266 (+5266), average=72 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:899: size=276 KiB (+276 KiB), count=3528 (+3528), average=80 B
[ Top 5 differences ]
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:799: size=1063 KiB (+1063 KiB), count=1889 (+1889), average=576 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:365: size=974 KiB (+974 KiB), count=1889 (+1889), average=528 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:500: size=474 KiB (+474 KiB), count=7584 (+7584), average=64 B
/Users/jose/code/prometheus-mem-leak/.venv/lib/python3.7/site-packages/prometheus_client/core.py:802: size=397 KiB (+397 KiB), count=5639 (+5639), average=72 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:899: size=295 KiB (+295 KiB), count=3778 (+3778), average=80 B
[ Top 5 differences ]
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:799: size=1115 KiB (+1115 KiB), count=1982 (+1982), average=576 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:365: size=1022 KiB (+1022 KiB), count=1983 (+1983), average=528 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:500: size=497 KiB (+497 KiB), count=7956 (+7956), average=64 B
/Users/jose/code/prometheus-mem-leak/.venv/lib/python3.7/site-packages/prometheus_client/core.py:802: size=416 KiB (+416 KiB), count=5922 (+5922), average=72 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:899: size=310 KiB (+310 KiB), count=3966 (+3966), average=80 B
[ Top 5 differences ]
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:799: size=1151 KiB (+1151 KiB), count=2046 (+2046), average=576 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:365: size=1055 KiB (+1055 KiB), count=2047 (+2047), average=528 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:500: size=514 KiB (+514 KiB), count=8213 (+8213), average=64 B
/Users/jose/code/prometheus-mem-leak/.venv/lib/python3.7/site-packages/prometheus_client/core.py:802: size=430 KiB (+430 KiB), count=6108 (+6108), average=72 B
/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:899: size=320 KiB (+320 KiB), count=4094 (+4094), average=80 B

it's clearly coming from the server, because as soon as you stop the curl, memory usage stops growing

@brian-brazil
Copy link
Contributor

That code shouldn't even be calling the line it's pointing to, as there's no http requests being made.

@josecv
Copy link
Author

josecv commented Nov 10, 2018

there's no http requests being made

i'm assuming a continuous stream of http requests, external to the script (the while true curl)

@hipyhop
Copy link

hipyhop commented Nov 19, 2018

I'm seeing this behaviour too, though it's slow enough that it's not really a problem and our orchestrator is restarting the service when it hits 100Mi of memory.

Scraping 125 metrics every 30 seconds (~10-20 custom metrics, the rest default metrics)

screen shot 2018-11-19 at 15 41 09

@sbrandtb
Copy link
Contributor

I found the issue and created a merge request. I hope it will get merged soon.

sbrandtb added a commit to FRI-DAY/client_python that referenced this issue Dec 10, 2018
The leak is caused by the fact that in Python 3.7, the default behavior
of the `ThreadingMixin` is to use no daemon threads, but to request to
block on threads on close. Because of that, it collects references to
all created threads, creating the "leak":
https://github.com/python/cpython/blob/v3.7.0/Lib/socketserver.py#L661

* Python 3.7: `block_on_close` is `True`: https://github.com/python/cpython/blob/v3.7.0/Lib/socketserver.py#L635
* Python 3.6: `_block_on_close` is `False`: https://github.com/python/cpython/blob/v3.6.7/Lib/socketserver.py#L639
* Python 2.7: There is no `block_on_close`, thus no logic for collecting
  references: https://github.com/python/cpython/blob/v2.7.15/Lib/SocketServer.py#L582

Fix by setting `daemon_threads` to `True`, which in our case should be a
reasonable setting for all Python versions. Also, the new in Python 3.7
`ThreadingHTTPServer` stdlib class also sets it by default:
https://github.com/python/cpython/blob/v3.7.0/Lib/http/server.py#L144

Signed-off-by: Sebastian Brandt <sebastian.brandt@friday.de>
@sbrandtb
Copy link
Contributor

Until then, there is a quick fix:

from prometheus_client.exposition import _ThreadingSimpleServer, start_http_server
_ThreadingSimpleServer.daemon_threads = True

start_http_server(9099)

sbrandtb added a commit to FRI-DAY/client_python that referenced this issue Dec 11, 2018
Signed-off-by: Sebastian Brandt <sebastian.brandt@friday.de>
brian-brazil pushed a commit that referenced this issue Dec 11, 2018
* Fix thread leak in Python 3.7 #340

The leak is caused by the fact that in Python 3.7, the default behavior
of the `ThreadingMixin` is to use no daemon threads, but to request to
block on threads on close. Because of that, it collects references to
all created threads, creating the "leak":
https://github.com/python/cpython/blob/v3.7.0/Lib/socketserver.py#L661

* Python 3.7: `block_on_close` is `True`: https://github.com/python/cpython/blob/v3.7.0/Lib/socketserver.py#L635
* Python 3.6: `_block_on_close` is `False`: https://github.com/python/cpython/blob/v3.6.7/Lib/socketserver.py#L639
* Python 2.7: There is no `block_on_close`, thus no logic for collecting
  references: https://github.com/python/cpython/blob/v2.7.15/Lib/SocketServer.py#L582

Fix by setting `daemon_threads` to `True`, which in our case should be a
reasonable setting for all Python versions. Also, the new in Python 3.7
`ThreadingHTTPServer` stdlib class also sets it by default:
https://github.com/python/cpython/blob/v3.7.0/Lib/http/server.py#L144

Signed-off-by: Sebastian Brandt <sebastian.brandt@friday.de>
@sbrandtb
Copy link
Contributor

@brian-brazil since this is a rather bad thing in microservice environments, would you mind releasing this soon?

@tusharmakkar08
Copy link
Contributor

@brian-brazil @sbrandtb I am facing this issue with prometheus-client==0.5.0 and Python 3.5.2 and it's not getting fixed by applying the fix which was mentioned by @sbrandtb.

I am not only facing this issue when I am using start_http_server but also when I am trying to use flask's server with prometheus end point monitored. Can you guys help me out?

To be precise my close_wait connections keep on increasing till my server stops serving requests.

x@ip-10-0-0-0:~$ lsof -i | grep 8002
python 10601 x 3u IPv4 78413488 0t0 TCP *:8002 (LISTEN)
python 10601 x 55u IPv4 82060563 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45114 (CLOSE_WAIT)
python 10601 x 56u IPv4 82068773 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45142 (CLOSE_WAIT)
python 10601 x 57u IPv4 82069308 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45170 (CLOSE_WAIT)
python 10601 x 58u IPv4 82070804 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45198 (CLOSE_WAIT)
python 10601 x 59u IPv4 82071589 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45298 (CLOSE_WAIT)
python 10601 x 60u IPv4 82060912 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45328 (CLOSE_WAIT)
python 10601 x 61u IPv4 82073179 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45356 (CLOSE_WAIT)
python 10601 x 62u IPv4 82073397 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45386 (CLOSE_WAIT)
python 10601 x 63u IPv4 82075572 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45414 (CLOSE_WAIT)
python 10601 x 64u IPv4 82077191 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45442 (CLOSE_WAIT)
python 10601 x 65u IPv4 82076370 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45476 (CLOSE_WAIT)
python 10601 x 66u IPv4 82079364 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45504 (CLOSE_WAIT)
python 10601 x 67u IPv4 82077842 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45532 (CLOSE_WAIT)
python 10601 x 68u IPv4 82079997 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45560 (CLOSE_WAIT)
python 10601 x 69u IPv4 82081405 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45588 (CLOSE_WAIT)
python 10601 x 70u IPv4 82083436 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45616 (CLOSE_WAIT)
python 10601 x 71u IPv4 82084046 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45644 (CLOSE_WAIT)
python 10601 x 72u IPv4 82080465 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45672 (CLOSE_WAIT)
python 10601 x 73u IPv4 82084792 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45700 (CLOSE_WAIT)
python 10601 x 74u IPv4 82085875 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45728 (CLOSE_WAIT)
python 10601 x 75u IPv4 82088002 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45758 (CLOSE_WAIT)
python 10601 x 76u IPv4 82089044 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45788 (CLOSE_WAIT)
python 10601 x 77u IPv4 82086871 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45816 (CLOSE_WAIT)
python 10601 x 78u IPv4 82088321 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45844 (CLOSE_WAIT)
python 10601 x 79u IPv4 82090850 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45944 (CLOSE_WAIT)
python 10601 x 80u IPv4 82089551 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:45976 (CLOSE_WAIT)
python 10601 x 81u IPv4 82088475 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:46006 (CLOSE_WAIT)
python 10601 x 82u IPv4 82089875 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:46034 (CLOSE_WAIT)
python 10601 x 83u IPv4 82092200 0t0 TCP ip-10-0-0-0.ap-south-1.compute.internal:8002->ip-172-00-00-00.ap-south-1.compute.internal:46064 (CLOSE_WAIT)

image

@sbrandtb
Copy link
Contributor

sbrandtb commented Feb 1, 2019

@tusharmakkar08 Please check out the comment in my merge request: This issue is very specific to Python >= 3.7

Since you experience an issue in 3.5, I doubt it's the same cause. Also, the shape of the curve on Grafana does not match the behaviour of this bug at all: It does not suddenly start (7:something) nor suddenly decrease again, granted there are no other effects and your metrics endpoint is called regularly (also, that makes the graph go up very linearly, while yours is everything else than that)

However, without any more information like code we can't tell anything more.

@tusharmakkar08
Copy link
Contributor

I think my issue is more similar to prometheus/jmx_exporter#327 and prometheus/jmx_exporter#352.

@sbrandtb
Copy link
Contributor

@tusharmakkar08 I think you should open a new issue with code how to reproduce, because you won't get much audience here or help without reproducible code.

@yaverhussain
Copy link

I'm seeing this behaviour too, though it's slow enough that it's not really a problem and our orchestrator is restarting the service when it hits 100Mi of memory.

Scraping 125 metrics every 30 seconds (~10-20 custom metrics, the rest default metrics)

screen shot 2018-11-19 at 15 41 09

I'm seeing this behaviour too, though it's slow enough that it's not really a problem and our orchestrator is restarting the service when it hits 100Mi of memory.

Scraping 125 metrics every 30 seconds (~10-20 custom metrics, the rest default metrics)

![screen shot 2018-11-19 at 15 41 09](https://user-images.githubusercontent.com/692291/48688390-bb8d5100-ec11-11e8-8033-1f7bf800bee7.p

Any one resolved this?

@sbrandtb
Copy link
Contributor

@yaverhussain The issue has been fixed and merged by me two years ago in #356

@yaverhussain
Copy link

yaverhussain commented Feb 10, 2021 via email

@sbrandtb
Copy link
Contributor

@yaverhussain >= 0.6.0, see badges here

@yaverhussain
Copy link

yaverhussain commented Feb 12, 2021 via email

@csmarchbanks
Copy link
Member

Hello, rather than continuing to comment on this issue, would you make a new issue if you believe there is a problem with this client? If you have a general usage question, I recommend reaching out to the prometheus-users mailing list. Thanks in advance!

That said, if you are adding distinct URLs as label values, such as having ids/uuids in the label value, then you will use a lot of memory as metrics will need to be stored for each URL. Usually people will template urls like "/my/path/:id" to avoid this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants