Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed hanged KubernetesPodOperator #28336

Merged
merged 1 commit into from
Mar 4, 2023

Conversation

moiseenkov
Copy link
Contributor

@moiseenkov moiseenkov commented Dec 13, 2022

Fixes: #23497

  • Inserted a container status check before pulling longs from a stream, because KubernetesPodOperator hangs when attempting to read logs of the terminated container. With the current fix we make sure that the container is alive or was terminated no longer than the specified timeout before reading logs.
  • Updated unit tests

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels Dec 13, 2022
@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch 8 times, most recently from 09c496a to da0d40a Compare December 15, 2022 08:59
@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch from da0d40a to 6c19712 Compare December 19, 2022 08:32
Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: datetime

@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch 3 times, most recently from 62c82b4 to b368ce9 Compare December 21, 2022 10:28
Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look nice for me. However better that also someone else look on this changes

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. @dstandish @jedcunningham - maybe you can take a look too?

@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch 2 times, most recently from 45093c9 to b56e4a8 Compare January 19, 2023 14:32
@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch from b56e4a8 to 9370669 Compare January 31, 2023 14:34
@eladkal eladkal requested a review from dstandish February 1, 2023 07:45
@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch 2 times, most recently from 64b97ef to 31ffba3 Compare February 6, 2023 15:12
@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch 2 times, most recently from f4b28d7 to f1b17dd Compare February 10, 2023 10:26
@moiseenkov moiseenkov force-pushed the fix_hang_kybernetes_pod_operator branch 2 times, most recently from a81264f to 51114aa Compare February 13, 2023 17:28
@moiseenkov
Copy link
Contributor Author

@dimberman , @dstandish , @jedcunningham, hi.
Could you please review my fixes?

@potiuk
Copy link
Member

potiuk commented Feb 25, 2023

Needs re-review @dimberman @dstandish I guess.

@dstandish
Copy link
Contributor

Hi, I dismissed my old review, so it's not blocking.

I do have a suggestion though I'm sorry if it's a bit late in the game. And maybe it doesn't have to be done in this PR.

But so the thing that stuck out to me when looking at this is, we do a kube api call (in logs_available) every chunk in the log stream. This seems like it could result in a lot of calls and depending on how many such processes on the cluster could cause problems. Just a hunch I guess. But so it would seem to me that to avoid this, perhaps you could run the logs_available check in a thread, just have it run periodically, like once every 30 seconds or something, and then when it returns false, just set a stop boolean on the consumer so that it knows to exit the loop. This decouples the checking from the log stream so that you that checking does not increase in response to log volume.

@dstandish
Copy link
Contributor

dstandish commented Feb 27, 2023

OK I just experimented with our own "event scheduler" helper and it seems that we could use it to limit calls without managing threads. Here's a code sample:

import time

from airflow.utils.event_scheduler import EventScheduler

class Tracker:
    stop = False
    counter = 0

def hello(tracker: Tracker):
    tracker.counter += 1
    print("hi! %s" % tracker.counter)
    if tracker.counter > 10:
        tracker.stop = True

e = EventScheduler()
tracker = Tracker()
e.call_regular_interval(2, hello, (tracker,))

while True:
    e.run(blocking=False)
    time.sleep(0.5)
    if tracker.stop is True:
        break

I believe what this does is, keep track of how long it's been since hello has been called. and then in every loop, it checks whether it needs to run, and runs it if it does. So you could replace hello with the call to check if pod still running. And it won't make those network calls more than it has to.

@dimberman
Copy link
Contributor

LGTM Let's merge when tests pass

@potiuk potiuk force-pushed the fix_hang_kybernetes_pod_operator branch from fe14347 to 6007df5 Compare February 27, 2023 20:41
@moiseenkov
Copy link
Contributor Author

@dstandish , thank you for proposing a nicer approach.
You mentioned earlier that we do k8s API calls every chunk. It is true for earlier versions of this PR, but currently the API response is being cached for 120 seconds. Please, take a look at the method PodLogsConsumer.read_pod() - it invokes PodManager.read_pod() only if the cache is empty or last call was more than self.read_pod_cache_timeout seconds ago (by default 120s). Thus there's no such problem now, however the EventScheduler approach would look nicer, and the code can always be refactored if needed.

@dstandish
Copy link
Contributor

Ah ok thanks 👍

@potiuk potiuk merged commit 6d2face into apache:main Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks stuck indefinitely when following container logs
6 participants