K8S: auth and pod/job race #704

oavdeev · 2021-09-17T00:39:45Z

This PR has a couple fixes as a result of running our internal test suite:

A workaround for Refresh token/api-key periodically kubernetes-client/python#741 . This is properly fixed in Refresh exec-based API credentials when they expire kubernetes-client/python-base#250 but that PR is not yet merged into the kubernetes python client. In this PR, I've modified our client retrieval code to auto-refresh the client at fixed interval.
A race condition where the pod object doesn't get immediately created after we submit a Job.
A race condition where the Job status isn't immediately updated after container exits.

I have also slightly refactored the kubernetes_client.py for readability (obviously this is very subjective).

akyrola · 2021-09-17T07:14:00Z

metaflow/plugins/aws/eks/kubernetes_client.py

+        return KubernetesJob(self, **kwargs)
+
+    def get(self):
+        if time.time() - self._client_refresh_timestamp < CLIENT_REFRESH_INTERVAL_SECONDS:


Shouldn't the comparison be ">" ?

yep 🤦‍♂️

savingoyal

Some quick comments. I haven't been able to play with this PR live.

savingoyal · 2021-09-17T23:50:54Z

metaflow/plugins/aws/eks/kubernetes.py

+            # We shouldn't really get here unless the K8S control plane is
+            # really unhealthy.
+            echo(
+                "Pod is not running but the job not is done or failed, last job state: %s" % self._job._job,


In this PR, the implementation logic in kubernetes_client.py is bleeding here (pod awareness for example). If you look at the original Kubernetes PR, job.status encapsulates the reason that can be simply printed out here.

savingoyal · 2021-09-17T23:52:01Z

metaflow/plugins/aws/eks/kubernetes.py

@@ -317,7 +317,7 @@ def _print_available(tail, stream, should_persist=False):
                now = time.time()
                log_update_delay = update_delay(now - start_time)
                next_log_update = now + log_update_delay
-                is_running = self._job.is_running
+                is_running = self._job.check_pod_is_running()


What is the reasoning behind changing job.is_running to job.check_pod_is_running?

To make it clear that it checks that the pod is running not the k8s job. The entire issue was that is_running was checking for the pod status but is_done was checking for the job status, as a result RunningJob could be neither "running" nor "done" at the same time.

Actually, now reading your comments on this PR I realize that the "Job" in RunningJob wasn't intended to mean "Kubernetes Job" but a separate logical concept that kinda abstracts away both k8s jobs and pods. It makes sense now but I wonder if there's a way to make it more clear for the reader; when I encounter a variable named _job in kubernetes.py it makes me think of a K8S job object.

savingoyal · 2021-09-17T23:57:16Z

metaflow/plugins/aws/eks/kubernetes_client.py

@@ -472,8 +496,7 @@ def id(self):
        # TODO (savin): Should we use pod id instead?
        return self._id

-    @property
-    def is_done(self):
+    def check_is_done(self):


If it's not too much trouble, can we move back to @property to maintain parity with the @batch implementation?

It is def subjective but when I was debugging this issue, I found that @property made this code considerably harder to understand, at least for me. When I see self.is_done I assume that it is either a simple variable read or maybe a thin wrapper over one; where in reality it does pretty complex stuff: calls K8S API and updates the internal state.

I agree with the sentiment - this is just to keep the codebase a bit consistent.

metaflow/plugins/aws/eks/kubernetes_client.py

savingoyal · 2021-09-18T00:01:37Z

metaflow/plugins/aws/eks/kubernetes_client.py

+                else:
+                    time.sleep(POD_FETCH_BACKOFF_SECONDS)
+            else:
+                raise Exception('Could not fetch pod status in %s seconds' % (POD_FETCH_RETRIES * POD_FETCH_BACKOFF_SECONDS))


When we throw an exception here, does it automatically terminate the k8s job?

savingoyal · 2021-09-18T00:03:40Z

metaflow/plugins/aws/eks/kubernetes.py

+            # done but the job is still "active". It is more likely to happen
+            # when the control plane is overloaded, e.g. on minikube/kind.
+            self._job.wait_done(timeout_seconds=20)
+        except TimeoutError:


Is it feasible to encapsulate this logic in kubernetes_client.py? Currently, that module takes care of all the complexity of ascertaining the correct job status.

Let me see, I now get your intent that RunningJob was supposed to abstract away k8s job and pod together

oavdeev · 2021-09-20T23:00:10Z

@savingoyal please take another look, I've added back the properties as an external interface for RunningJob (but kept the internal method names more "imperative" so I can still follow what's going on under the hood), and moved wait-for-done logic inside RunningJob too

oavdeev · 2021-09-20T23:01:05Z

metaflow/plugins/aws/eks/kubernetes.py

-        elif not self._job.is_done:
-            # Kill the job if it is still running by throwing an exception.
-            raise KubernetesKilledException("Task failed!")
+


Removed since it is no longer possible for RunningJob to be not "running" anymore and not "done" at the same time.

savingoyal

A comment, but LGTM otherwise!

savingoyal · 2021-09-22T01:26:07Z

metaflow/plugins/aws/eks/kubernetes_client.py

+            pod_phase = self._pod.get("status", {}).get("phase")
+            if pod_phase == "Running":
+                return True
+            elif pod_phase in ("Succeeded", "Failed"):


Maybe, I am missing something - why do we need to wait for 40 seconds for us to mark the job as not running and not immediately?

It is about the issue 3 in the PR description, and trying to make RunningJob state machine simpler.

The core issue here was that we'd rely on pod status to determine "running" and V1Job status to determine "done", and it is technically possible for pod to be "Succeeded" but V1Job would still have "active= 1" for a brief period, before the job controller updates the V1Job metadata. That caused a successful "RunningJob" go through these states:

not-running, not-done (haven't started yet)

running, not-done

not-running, not-done (again!)

not-running, done

I wanted to get rid of this 3rd state because it is confusing for the caller, sometimes doesn't even happen, and in this case practically this complexity caused this bug. For the caller, it is also indistinguishable from the very first state, which feels like a trap for future us. So in this PR, I've changed it so that if the pod is done but the job isn't, is_running just kinda wait it out so that next call to is_done is guaranteed to return True. Or in other words, after the job has started, it is impossible for it to be not-running and not-done at the same time.

I agree, but we can return False for _check_is_running when pod_phase in ("Succeeded", "Failed") immediately and in _check_is_job_done we can add a check to for _check_is_running if the V1Job status is active. That should remove the need for a 40-second poll.

So something like

def _check_is_job_done(..) if self._job["active"] and not self._check_is_running(): return True

Effectively "folding" that intermediate state into "done" state instead?

That's an option, I think what's stopped me is that it would complicate is_succeeded/is_failed since they wouldn't be able to rely on V1Job status, as it may not be yet up to date.

We can change the logic of is_succeeded/is_failed to also inspect the status of "pod_phase". We do have this issue where the "pod" itself may disappear depending on how the job was canceled.

Another point that we need to think through if we are introducing a check for succeeded and failed pod phases is what happens when a new pod phase is introduced. We can consider inspecting the PodConditions to have a greater degree of certainty, although it makes reasoning about the code a bit more convoluted. That was one of the reasons why I was only looking at the Running pod phase in the original implementation.

Soo at the end of the day,

RunningJob.running := V1PodStatus.phase is "Running" RunningJob.done := V1JobStatus.succeeded > 0 OR V1JobStatus.failed > 0 OR V1PodStatus.phase in ("Succeded", "Failed") OR V1JobSpec.parallelism is 0 RunningJob.succeeded := done AND (V1PodStatus.phase is "Succeeded" OR V1JobStatus.succeeded > 0) RunningJob.failed := done AND (V1PodStatus.phase is "Failed" OR V1JobSpec.parallelism is 0 OR V1JobStatus.failed > 0) RunningJob.waiting = NOT (running OR done)

(if pod doesn't exist, V1PodStatus.phase is assumed to be "null" above)

This should ensure that the job goes through waiting -> running -> done without weird intermediate states, and that a "done" job is always either "succeeded" or "failed".

Seems reasonable to me (need to dust up my notes to find edge cases). This state diagram is pretty much the original implementation with accommodation for eventual consistency of job status? This still doesn't technically ensure that the job goes through waiting -> running -> done since pod phases can flip unpredictably to unknown.

Also, as an aside, the implementation currently also tries to optimize for the number of pod and job status API calls, so the order in which these conditions are evaluated is important.

Yep unknown is another thing we may need to revisit. I've updated the implementation to match this logic. I may also add some unit tests in a separate PR as this seems a perfect use case for some unit testing

Btw I have also removed another wait-loop where I'd wait for pod object to appear after the job was created, I think it is not necessary as long as we carefully handle pod-does-not-exist in the state machine logic

@Batch

* Refactor @resources decorator @resources decorator is shared by all compute related decorators - @Batch, @lambda, @K8s, @titus. This patch moves it out of batch_decorator.py so that other decorators can cleanly reference it. * Update __init__.py * Refactor @Batch decorator * more change * more changes * more changes * @kubernetes * Kubernetes * More changes * More changes * more changes * some more changes * more changes * add disk space * Add todos * some fixes * add k8s testing context * more changes * some more changes * minor fixups * better error handling for evicted pods (#711) * fixes for pod/job metadata race conditions (#704) * K8S: label value sanitizer (#719) * rename name_space to namespace for k8s plugin (#750) * fix k8s attribute handling bug (#753) * tweak k8s test resources (to run on kind) (#754) * add k8s api retries (#756) * update done marker * Use linux binaries in @conda when run in k8s (#758) Conda environment should pack linux python binary when run on MacOS to avoid an error metaflow_PlayListFlow_osx-64_179c56284704ca8e53622f848a3df27cdd1f4327/bin/python: cannot execute binary file: Exec format error * fix comment * fix merge conflict * update char Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: Roman Kindruk <36699371+sappier@users.noreply.github.com>

akyrola reviewed Sep 17, 2021

View reviewed changes

oavdeev force-pushed the kubernetes-pr-test-fixes branch from d523528 to d401db6 Compare September 17, 2021 16:57

savingoyal reviewed Sep 18, 2021

View reviewed changes

oavdeev force-pushed the kubernetes-pr-test-fixes branch 3 times, most recently from 600210e to 7d701ef Compare September 20, 2021 22:59

oavdeev commented Sep 20, 2021

View reviewed changes

oavdeev force-pushed the kubernetes-pr-test-fixes branch from 7d701ef to ed1dc54 Compare September 21, 2021 01:33

oavdeev requested a review from savingoyal September 21, 2021 15:58

oavdeev force-pushed the kubernetes-pr-test-fixes branch 4 times, most recently from 39193d7 to fc5f37f Compare September 21, 2021 22:08

oavdeev changed the title ~~k8s fixes: auth and pod/job race~~ K8S: auth and pod/job race Sep 21, 2021

savingoyal reviewed Sep 22, 2021

View reviewed changes

oavdeev force-pushed the kubernetes-pr-test-fixes branch 2 times, most recently from b26e3d4 to 037043f Compare September 22, 2021 19:13

fixes for pod/job metadata race conditions

a7a89de

oavdeev force-pushed the kubernetes-pr-test-fixes branch from 037043f to a7a89de Compare September 22, 2021 19:17

savingoyal approved these changes Oct 7, 2021

View reviewed changes

savingoyal merged commit c8d5610 into Netflix:kubernetes-pr Oct 7, 2021

oavdeev deleted the kubernetes-pr-test-fixes branch October 7, 2021 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8S: auth and pod/job race #704

K8S: auth and pod/job race #704

oavdeev commented Sep 17, 2021 •

edited

Loading

akyrola Sep 17, 2021

oavdeev Sep 17, 2021

savingoyal left a comment

savingoyal Sep 17, 2021

savingoyal Sep 17, 2021

oavdeev Sep 18, 2021

savingoyal Sep 17, 2021

oavdeev Sep 18, 2021

savingoyal Sep 18, 2021

savingoyal Sep 18, 2021

savingoyal Sep 18, 2021

oavdeev Sep 18, 2021

oavdeev commented Sep 20, 2021

oavdeev Sep 20, 2021

savingoyal left a comment

savingoyal Sep 22, 2021

oavdeev Sep 22, 2021

savingoyal Sep 22, 2021

oavdeev Sep 22, 2021

savingoyal Sep 22, 2021

savingoyal Sep 22, 2021

oavdeev Sep 22, 2021 •

edited

Loading

savingoyal Sep 22, 2021

oavdeev Sep 22, 2021

oavdeev Sep 22, 2021

K8S: auth and pod/job race #704

K8S: auth and pod/job race #704

Conversation

oavdeev commented Sep 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

savingoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oavdeev commented Sep 20, 2021

Choose a reason for hiding this comment

savingoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oavdeev Sep 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oavdeev commented Sep 17, 2021 •

edited

Loading

oavdeev Sep 22, 2021 •

edited

Loading