Python freeze/hang on exit #411

johnmarcou · 2017-12-07T11:53:47Z

Context

Hello,

In a batch manager project, we are using this python client to submit jobs to the Kubernetes API.
In one word, the python application loads the library, submit the job, watch the related events, clean/delete the job, then return the succeeded or failed status. But sometimes, the application hang at the exit.

After investigation, it seems the ThreadPool used in the ApiClient class is not properly clean on the python process exit.

Reproduce

The easiest way to reproduce is to run this snippet:

python_version=3.6
kubernetes_version=4.0.0

docker run --name testing --rm -it --entrypoint "" python:$python_version /bin/bash -c "
pip install 'kubernetes==$kubernetes_version'
while true; do echo ===;
  for i in {0..50}; do python -c '

from kubernetes import client
coreapi = client.CoreV1Api()
print(0)' &

  done
wait
done"

This will run Python in a Docker container, install the Kubernetes python module, then run the test indefinitely. The test starts a simple application 50 times in order to increase the probability. This application loads the Kubernetes python module, create a CoreV1Api, which creates its ApiClient (with Async enabled using ThreadPool), then print 0 showing the freeze occured during the python exit sequence.

To stop the test:

docker rm -f testing

Expected:

This code should run indefinitely.

Result:

The loop hang on list of 0 after some time.

Workaround:

To avoid this issue, we override the ApiClient class to disable Async / ThreadPool feature. It seems to work without any issues so far. Downside is we are loosing the Async mode.

Thank you.

The text was updated successfully, but these errors were encountered:

RobbieClarken · 2017-12-07T13:15:43Z

This seems to be related to the __del__ method on ApiClient cleaning up its ThreadPool: https://github.com/kubernetes-incubator/client-python/blob/30b8ee44f4d14546e651dead91306719d53f8c37/kubernetes/client/api_client.py#L76-L78

This can cause a deadlock when the api clients are garbage collected as Python exits. I can reproduce with the following:

# deadlock.py
from multiprocessing.pool import ThreadPool

class Deadlocker:
    def __init__(self):
        self.pool = ThreadPool()

    def __del__(self):
        self.pool.close()
        self.pool.join()

d1 = Deadlocker()
d2 = Deadlocker()
print('exiting...')

and running:

while true; do python3 deadlock.py; done

On macOS 10.12.6 with Python 3.6.3, after 1-50 executions, it will print "exiting..." and stall. The python process won't ever terminate until you hit ctrl-c. I can also reproduce on Linux but it seems to be less frequent that on macOS.

This means a simple script like this:

from kubernetes import client
coreapi = client.CoreV1Api()
batchapi = client.BatchV1Api()

may never terminate because CoreV1Api and BatchV1Api will both instantiate ApiClients which have the problematic __del__ method. We can reduce the likelihood of a deadlock by creating a single ApiClient and passing it into CoreV1Api and BatchV1Api but the problem doesn't go away entirely. There are also some classes like Watch that always instantiate their own ApiClient.

I wonder whether the multiprocessing.pool.ThreadPool class is suitable for production use cases. According to a stackoverflow comment I came across:

The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation.

jphx · 2017-12-07T21:24:00Z

I'm seeing what I suspect is another symptom of this problem. When I use the kubernetes Python module, I often see this when the program exits:

Exception ignored in: <bound method ApiClient.__del__ of <kubernetes.client.api_client.ApiClient object at 0x2ae3248da470>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/kubernetes/client/api_client.py", line 78, in __del__
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 501, in join
  File "/usr/lib/python3.4/threading.py", line 1056, in join
TypeError: 'NoneType' object is not callable

It doesn't seem to cause any harm, since as the message says, the exception is ignored since it's from the __del__ method, but I still get these ugly messages to stderr.

I didn't see this symptom when using version 3.0.0.

tuxbotix · 2018-02-08T14:24:19Z

I think the hang happens with the join() method, as it waits until the process is done, can someone verify this? If so, it has to be investigated why!

I introduced the __del__() as the ThreadPool isn't cleared automatically all the time when apiClient is destructed/ collected by GC. On rapid creation of many clients (bad way to do things) and stop using them, you can quickly run out of threads as the destruction of these apiClient instances doesn't necessarily close and join the ThreadPool. At least in C/ C++ sense, threads should be closed and joined appropriately, so in that sense, forcing the ThreadPool to close and join is correct. Lack of documentation of ThreadPool made it impossible to figure out the right way (without looking into its implementation)

@RobbieClarken You are correct on the ThreadPool's bad documentation and all. I googled a lot to figure out how to "correctly" deal with a threadpool instance.

@jphx a check can be done to see if the object is not none. Ugly but should work.

kubernetes-client/python#411

…#813) * Add option to use kubectl to work around kubernetes-client/python#411 * Use image built from previous commit in cronjob yaml * Optimize by reducing api calls and other minor tweaks * Use image built from previous commit in block-aws-cron.yaml

scottmbaker · 2018-05-23T15:50:55Z

Is there a recommended workaround for this issue? Currently I'm pursuing the approach of deploying a modified api_client.py with the pool code disabled.

Perhaps lazily creating the thread pool only when some call occurs with async=True would be a better approach than always creating the pool? While doing so doesn't necessarily resolve the bug, it would at least confine the bug to those using async=True, and prevent unnecessary churn of thread pool objects for those not using async=True.

Sturgelose · 2018-07-16T08:20:36Z

@johnmarcou @RobbieClarken I'm having exactly the same situation as @sbconsulting and I'm trying to find out a workaround to this issue. Would you mind sharing some code that you are using to temporally patch this issue? I can see that John reported that they override the APIClient class and disabled the threadpool feature, but would be good to know how you did it in case other people have the same issue as all us.

Anyways, thanks a lot for the discussion in this issue, I've been a week hunting this ghost bug!

johnmarcou · 2018-07-16T09:11:29Z

Hi @Sturgelose,

The client.CoreV1Api() object use dependencies injection to get its ApiClient.
We can use this to create our own ApiClient, then inject it.

To create our own K8sApiClient, we are using inheritance of the ApiClient, then we override some functions.

We are talking about this Class:
https://github.com/kubernetes-client/python/blob/v4.0.0/kubernetes/client/api_client.py#L32

Here comes the patched K8sApiClient:

class K8sApiClient(client.ApiClient):
    def call_api(self, *args, async=None, **kwargs):
        return super().call_api(*args, async=False, **kwargs)

    def __init__(self, configuration=None, header_name=None, header_value=None, cookie=None):
        if configuration is None:
            configuration = Configuration()
        self.configuration = configuration

#        self.pool = ThreadPool()
        self.rest_client = RESTClientObject(configuration)
        self.default_headers = {}
        if header_name is not None:
            self.default_headers[header_name] = header_value
        self.cookie = cookie
        # Set default User-Agent.
        self.user_agent = 'Swagger-Codegen/4.0.0/python'

    def __del__(self):
        pass

Then how to use it:

        config = client.Configuration()
        api_client = K8sApiClient(configuration=config)
        coreapi = client.CoreV1Api(api_client)
        batchapi = client.BatchV1Api(api_client)

It's just some copy/paste from a project, it's not tested code. But you should have everything to disable this ThreadPool issue.

007 · 2018-11-05T18:14:41Z

This is broken for me as well.

Seems like swagger-api/swagger-codegen#8061 would fix it, since it only instantiates the problematic ThreadPool on first-request. If no requests in a session are async it should never create the pool.

The kubernetes python client has a bug [1] which results in frequent deadlocks while being cleaned up, which causes armada to hang at the end of execution. This patchset works around that issue by mocking out the associated thread pools, since they are only needed for async kubernetes api calls, which armada does not use. [1]: kubernetes-client/python#411 Change-Id: I71fbfbe355347ae2ddd02ffd26d881368320246b

spacez320 · 2019-01-29T15:59:18Z

I'm wondering what the state of this is. Given that there's a series of MRs, some open and some merged, and that this issue is itself open, would anyone be willing to provide a general summary of the path forward, and what versions of the client will have a fix?

tomplus · 2019-01-29T21:11:26Z

@spacez320 The workaround was added and it's available in the latest version 9.0.0a1.

fejta-bot · 2019-04-30T04:56:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

007 · 2019-04-30T12:53:17Z

/remove-lifecycle stale

fejta-bot · 2019-07-29T13:16:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yliaog · 2019-08-27T17:44:04Z

@spacez320 could you please verify if the workaround works?

fejta-bot · 2019-09-26T18:39:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-10-26T19:24:34Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-10-26T19:24:42Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

furkanmustafa · 2020-02-29T16:44:36Z

/remove-lifecycle rotten

furkanmustafa · 2020-02-29T16:44:42Z

/reopen

k8s-ci-robot · 2020-02-29T16:44:49Z

@furkanmustafa: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

furkanmustafa · 2020-02-29T16:45:07Z

related #422

rousku mentioned this issue Feb 8, 2018

[python] Thread pool not closed or handled on client. swagger-api/swagger-codegen#6392

Closed

tomplus mentioned this issue Apr 13, 2018

Exception RuntimeError: RuntimeError('cannot join current thread',) in ..... ignored #512

Closed

bookshelfdave added a commit to mozmeao/infra that referenced this issue Apr 25, 2018

blockaws cron hangs due to bug in K8s Python client

7a317c6

kubernetes-client/python#411

jgmize added a commit to mozmeao/infra that referenced this issue Apr 27, 2018

Add option to use kubectl to work around kubernetes-client/python#411

71cabb0

jgmize added a commit to mozmeao/infra that referenced this issue Apr 27, 2018

Add option to use kubectl to work around kubernetes-client/python#411

c9c810a

This was referenced Jul 23, 2018

Unit tests leaking thread pools #566

Closed

Memory leak #576

Closed

007 mentioned this issue Nov 5, 2018

[python] Avoid creating unused ThreadPools swagger-api/swagger-codegen#8061

Open

4 tasks

wing328 mentioned this issue Nov 6, 2018

[python] Avoid creating unused ThreadPools OpenAPITools/openapi-generator#1387

Merged

4 tasks

seaneagan mentioned this issue Nov 26, 2018

Uplift to swagger-codegen version 3.3.3 #692

Closed

tomplus mentioned this issue Dec 7, 2018

feat: avoid creating unused ThreadPools kubernetes-client/gen#91

Merged

jmontleon mentioned this issue Dec 12, 2018

automation-broker-apb canary image hangs running tests sometimes openshift/ansible-service-broker#1144

Closed

007 mentioned this issue Apr 30, 2019

migrate from swagger-codegen to openapi-generator kubernetes-client/gen#93

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2019

joshmeek mentioned this issue Jun 24, 2019

Bump bounds of kubernetes requirement PrefectHQ/prefect#1175

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2019

k8s-ci-robot closed this as completed Oct 26, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python freeze/hang on exit #411

Python freeze/hang on exit #411

johnmarcou commented Dec 7, 2017

RobbieClarken commented Dec 7, 2017

jphx commented Dec 7, 2017 •

edited

Loading

tuxbotix commented Feb 8, 2018 •

edited

Loading

scottmbaker commented May 23, 2018

Sturgelose commented Jul 16, 2018 •

edited

Loading

johnmarcou commented Jul 16, 2018 •

edited

Loading

007 commented Nov 5, 2018

spacez320 commented Jan 29, 2019

tomplus commented Jan 29, 2019

fejta-bot commented Apr 30, 2019

007 commented Apr 30, 2019

fejta-bot commented Jul 29, 2019

yliaog commented Aug 27, 2019

fejta-bot commented Sep 26, 2019

fejta-bot commented Oct 26, 2019

k8s-ci-robot commented Oct 26, 2019

furkanmustafa commented Feb 29, 2020

furkanmustafa commented Feb 29, 2020

k8s-ci-robot commented Feb 29, 2020

furkanmustafa commented Feb 29, 2020

Python freeze/hang on exit #411

Python freeze/hang on exit #411

Comments

johnmarcou commented Dec 7, 2017

Context

Reproduce

Expected:

Result:

Workaround:

RobbieClarken commented Dec 7, 2017

jphx commented Dec 7, 2017 • edited Loading

tuxbotix commented Feb 8, 2018 • edited Loading

scottmbaker commented May 23, 2018

Sturgelose commented Jul 16, 2018 • edited Loading

johnmarcou commented Jul 16, 2018 • edited Loading

007 commented Nov 5, 2018

spacez320 commented Jan 29, 2019

tomplus commented Jan 29, 2019

fejta-bot commented Apr 30, 2019

007 commented Apr 30, 2019

fejta-bot commented Jul 29, 2019

yliaog commented Aug 27, 2019

fejta-bot commented Sep 26, 2019

fejta-bot commented Oct 26, 2019

k8s-ci-robot commented Oct 26, 2019

furkanmustafa commented Feb 29, 2020

furkanmustafa commented Feb 29, 2020

k8s-ci-robot commented Feb 29, 2020

furkanmustafa commented Feb 29, 2020

jphx commented Dec 7, 2017 •

edited

Loading

tuxbotix commented Feb 8, 2018 •

edited

Loading

Sturgelose commented Jul 16, 2018 •

edited

Loading

johnmarcou commented Jul 16, 2018 •

edited

Loading