Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python freeze/hang on exit #411

Closed
johnmarcou opened this issue Dec 7, 2017 · 20 comments
Closed

Python freeze/hang on exit #411

johnmarcou opened this issue Dec 7, 2017 · 20 comments

Comments

@johnmarcou
Copy link

Context

Hello,

In a batch manager project, we are using this python client to submit jobs to the Kubernetes API.
In one word, the python application loads the library, submit the job, watch the related events, clean/delete the job, then return the succeeded or failed status. But sometimes, the application hang at the exit.

After investigation, it seems the ThreadPool used in the ApiClient class is not properly clean on the python process exit.

Reproduce

The easiest way to reproduce is to run this snippet:

python_version=3.6
kubernetes_version=4.0.0

docker run --name testing --rm -it --entrypoint "" python:$python_version /bin/bash -c "
pip install 'kubernetes==$kubernetes_version'
while true; do echo ===;
  for i in {0..50}; do python -c '

from kubernetes import client
coreapi = client.CoreV1Api()
print(0)' &

  done
wait
done"

This will run Python in a Docker container, install the Kubernetes python module, then run the test indefinitely. The test starts a simple application 50 times in order to increase the probability. This application loads the Kubernetes python module, create a CoreV1Api, which creates its ApiClient (with Async enabled using ThreadPool), then print 0 showing the freeze occured during the python exit sequence.

To stop the test:

docker rm -f testing

Expected:

This code should run indefinitely.

Result:

The loop hang on list of 0 after some time.

Workaround:

To avoid this issue, we override the ApiClient class to disable Async / ThreadPool feature. It seems to work without any issues so far. Downside is we are loosing the Async mode.

Thank you.

@RobbieClarken
Copy link

This seems to be related to the __del__ method on ApiClient cleaning up its ThreadPool: https://github.com/kubernetes-incubator/client-python/blob/30b8ee44f4d14546e651dead91306719d53f8c37/kubernetes/client/api_client.py#L76-L78

This can cause a deadlock when the api clients are garbage collected as Python exits. I can reproduce with the following:

# deadlock.py
from multiprocessing.pool import ThreadPool

class Deadlocker:
    def __init__(self):
        self.pool = ThreadPool()

    def __del__(self):
        self.pool.close()
        self.pool.join()

d1 = Deadlocker()
d2 = Deadlocker()
print('exiting...')

and running:

while true; do python3 deadlock.py; done

On macOS 10.12.6 with Python 3.6.3, after 1-50 executions, it will print "exiting..." and stall. The python process won't ever terminate until you hit ctrl-c. I can also reproduce on Linux but it seems to be less frequent that on macOS.

This means a simple script like this:

from kubernetes import client
coreapi = client.CoreV1Api()
batchapi = client.BatchV1Api()

may never terminate because CoreV1Api and BatchV1Api will both instantiate ApiClients which have the problematic __del__ method. We can reduce the likelihood of a deadlock by creating a single ApiClient and passing it into CoreV1Api and BatchV1Api but the problem doesn't go away entirely. There are also some classes like Watch that always instantiate their own ApiClient.

I wonder whether the multiprocessing.pool.ThreadPool class is suitable for production use cases. According to a stackoverflow comment I came across:

The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation.

@jphx
Copy link

jphx commented Dec 7, 2017

I'm seeing what I suspect is another symptom of this problem. When I use the kubernetes Python module, I often see this when the program exits:

Exception ignored in: <bound method ApiClient.__del__ of <kubernetes.client.api_client.ApiClient object at 0x2ae3248da470>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/kubernetes/client/api_client.py", line 78, in __del__
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 501, in join
  File "/usr/lib/python3.4/threading.py", line 1056, in join
TypeError: 'NoneType' object is not callable

It doesn't seem to cause any harm, since as the message says, the exception is ignored since it's from the __del__ method, but I still get these ugly messages to stderr.

I didn't see this symptom when using version 3.0.0.

@tuxbotix
Copy link

tuxbotix commented Feb 8, 2018

I think the hang happens with the join() method, as it waits until the process is done, can someone verify this? If so, it has to be investigated why!

I introduced the __del__() as the ThreadPool isn't cleared automatically all the time when apiClient is destructed/ collected by GC. On rapid creation of many clients (bad way to do things) and stop using them, you can quickly run out of threads as the destruction of these apiClient instances doesn't necessarily close and join the ThreadPool. At least in C/ C++ sense, threads should be closed and joined appropriately, so in that sense, forcing the ThreadPool to close and join is correct. Lack of documentation of ThreadPool made it impossible to figure out the right way (without looking into its implementation)

@RobbieClarken You are correct on the ThreadPool's bad documentation and all. I googled a lot to figure out how to "correctly" deal with a threadpool instance.

@jphx a check can be done to see if the object is not none. Ugly but should work.

bookshelfdave pushed a commit to mozmeao/infra that referenced this issue May 2, 2018
…#813)

* Add option to use kubectl to work around kubernetes-client/python#411

* Use image built from previous commit in cronjob yaml

* Optimize by reducing api calls and other minor tweaks

* Use image built from previous commit in block-aws-cron.yaml
@scottmbaker
Copy link

Is there a recommended workaround for this issue? Currently I'm pursuing the approach of deploying a modified api_client.py with the pool code disabled.

Perhaps lazily creating the thread pool only when some call occurs with async=True would be a better approach than always creating the pool? While doing so doesn't necessarily resolve the bug, it would at least confine the bug to those using async=True, and prevent unnecessary churn of thread pool objects for those not using async=True.

@Sturgelose
Copy link

Sturgelose commented Jul 16, 2018

@johnmarcou @RobbieClarken I'm having exactly the same situation as @sbconsulting and I'm trying to find out a workaround to this issue. Would you mind sharing some code that you are using to temporally patch this issue? I can see that John reported that they override the APIClient class and disabled the threadpool feature, but would be good to know how you did it in case other people have the same issue as all us.

Anyways, thanks a lot for the discussion in this issue, I've been a week hunting this ghost bug!

@johnmarcou
Copy link
Author

johnmarcou commented Jul 16, 2018

Hi @Sturgelose,

The client.CoreV1Api() object use dependencies injection to get its ApiClient.
We can use this to create our own ApiClient, then inject it.

To create our own K8sApiClient, we are using inheritance of the ApiClient, then we override some functions.

We are talking about this Class:
https://github.com/kubernetes-client/python/blob/v4.0.0/kubernetes/client/api_client.py#L32

Here comes the patched K8sApiClient:

class K8sApiClient(client.ApiClient):
    def call_api(self, *args, async=None, **kwargs):
        return super().call_api(*args, async=False, **kwargs)

    def __init__(self, configuration=None, header_name=None, header_value=None, cookie=None):
        if configuration is None:
            configuration = Configuration()
        self.configuration = configuration

#        self.pool = ThreadPool()
        self.rest_client = RESTClientObject(configuration)
        self.default_headers = {}
        if header_name is not None:
            self.default_headers[header_name] = header_value
        self.cookie = cookie
        # Set default User-Agent.
        self.user_agent = 'Swagger-Codegen/4.0.0/python'

    def __del__(self):
        pass

Then how to use it:

        config = client.Configuration()
        api_client = K8sApiClient(configuration=config)
        coreapi = client.CoreV1Api(api_client)
        batchapi = client.BatchV1Api(api_client)

It's just some copy/paste from a project, it's not tested code. But you should have everything to disable this ThreadPool issue.

This was referenced Jul 23, 2018
@007
Copy link

007 commented Nov 5, 2018

This is broken for me as well.

Seems like swagger-api/swagger-codegen#8061 would fix it, since it only instantiates the problematic ThreadPool on first-request. If no requests in a session are async it should never create the pool.

openstack-gerrit pushed a commit to airshipit/armada that referenced this issue Nov 29, 2018
The kubernetes python client has a bug [1] which results in frequent
deadlocks while being cleaned up, which causes armada to hang at the
end of execution.

This patchset works around that issue by mocking out the associated
thread pools, since they are only needed for async kubernetes api
calls, which armada does not use.

[1]: kubernetes-client/python#411

Change-Id: I71fbfbe355347ae2ddd02ffd26d881368320246b
@spacez320
Copy link

I'm wondering what the state of this is. Given that there's a series of MRs, some open and some merged, and that this issue is itself open, would anyone be willing to provide a general summary of the path forward, and what versions of the client will have a fix?

@tomplus
Copy link
Member

tomplus commented Jan 29, 2019

@spacez320 The workaround was added and it's available in the latest version 9.0.0a1.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2019
@007
Copy link

007 commented Apr 30, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2019
@yliaog
Copy link
Contributor

yliaog commented Aug 27, 2019

@spacez320 could you please verify if the workaround works?

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furkanmustafa
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 29, 2020
@furkanmustafa
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@furkanmustafa: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furkanmustafa
Copy link

related #422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests