Richer logging and better error handling in driver pod watch #154

foxish · 2017-02-24T12:13:08Z

Fixes #143

We detect if the watch says the driver pod was DELETED and clean up if that is the case.
Also added the ContainerStatus to the long status. It should be prettier though.

cc/ @kimoonkim

kimoonkim

LGTM. Thanks for the quick fix!

kimoonkim · 2017-02-24T17:49:29Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/LoggingPodStatusWatcher.scala

+  }
+
+  private def hasCompleted(): Boolean = {
+    if (phase == "Succeeded" || phase == "Failed") {


The function body can be simplied to return phase == "Succeeded" || phase == "Failed"?

I had originally planned on having more logic there to detect other failed states :) Fixed. Thanks!

Sorry to jump on this again after that.

nit: remove return. Plus I think its okay to inline the check if its only used once.

ash211 · 2017-02-24T18:29:53Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/LoggingPodStatusWatcher.scala


  override def eventReceived(action: Action, pod: Pod): Unit = {
    this.pod = Option(pod)
-
-    logShortStatus()
-    if (prevPhase != phase) {


does this still log on null -> pending and pending -> running changes in the pod?

Yes, it should, because each of them are reported as MODIFIED events by the watch.

ash211 · 2017-02-24T18:31:30Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/LoggingPodStatusWatcher.scala

+        closeWatch()
+
+      case _ =>
+        logLongStatus()


does this log long status more often?

Yes, it does, especially when the container goes to failed states while retaining the same phase (ImagePullBackoff, ImageErr, ContainerCannotRun etc), which it wouldn't log before.

ash211 · 2017-02-24T18:32:15Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/LoggingPodStatusWatcher.scala

    }
  }

  override def onClose(e: KubernetesClientException): Unit = {
-    scheduler.shutdown()
    logDebug(s"Stopped watching application $appId with last-observed phase $phase")


now that this message is before the change, please change language to Stopping ... instead of Stopped ...

ash211 · 2017-02-24T18:43:16Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/LoggingPodStatusWatcher.scala

  }

  private def logShortStatus() = {
    logInfo(s"Application status for $appId (phase: $phase)")
  }

  private def logLongStatus() = {
-    logInfo("Phase changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown"))
+    logInfo("State changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown"))


is it the phase or the state change that triggers this message?

It is state change that triggers the message, we don't look at the phase anymore.

ash211

LGTM -- any opinions @mccheah ? I think we're ready to merge

ash211 · 2017-02-24T20:30:26Z

...ernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/LoggingPodStatusWatcher.scala

+      case Action.DELETED =>
+        closeWatch()
+
+      case Action.ERROR =>


is this the case we'd hit for ImagePullBackoff / ImageErr / ContainerCannotRun type states?

That's the case where the watch itself fails. Any error state of a kubernetes pod is still captured under "modified". I was thinking more about whether we want to delete the driver on ImageErr etc, but each of those states has a retry loop within the pod lifecycle itself. That's why the pod isn't marked as failed and continues to be in Pending/Waiting. For example, after ImageErr, It will enter imagePullBackoff and retry with exponential delays. We shouldn't delete in those cases IMO but expose it to the user.

If we think we should delete it however, I can do that in a subsequent PR. This one specifically handles driver pod deletion.

Ah ok yes, I was thinking you were putting that ImageErr handling in this PR through that Action.ERROR state, but putting in separately works too.

The tangible benefits we get here are:

richer logging for phase changes

better handling for driver pod failure

better handling for watch failure

* pod-watch progress around watch events * Simplify return * comments

…spark-on-k8s#154) * pod-watch progress around watch events * Simplify return * comments (cherry picked from commit d81c084)

Mostly want to get the same metrics name with the change to camel case `shuffleService` as that would be a functional break for any current deployments.

…spark-on-k8s#154) * pod-watch progress around watch events * Simplify return * comments

pod-watch progress around watch events

0bce240

kimoonkim mentioned this pull request Feb 24, 2017

Unit test flakes #155

Open

kimoonkim approved these changes Feb 24, 2017

View reviewed changes

Simplify return

c12e185

This was referenced Feb 24, 2017

Exclude known flaky tests #156

Merged

Launcher doesn't stop when SparkException thrown #143

Closed

ash211 reviewed Feb 24, 2017

View reviewed changes

comments

0592cf3

ash211 approved these changes Feb 24, 2017

View reviewed changes

ash211 changed the title ~~pod-watch progress around watch events~~ Richer logging and better error handling in driver pod watch Feb 24, 2017

ash211 merged commit d81c084 into k8s-support-alternate-incremental Feb 24, 2017

ash211 deleted the pod-watch branch February 24, 2017 21:10

kimoonkim mentioned this pull request Feb 24, 2017

Shutdown log watcher explicitly when launcher completes #160

Closed

ash211 pushed a commit that referenced this pull request Mar 8, 2017

Richer logging and better error handling in driver pod watch (#154)

2303aad

* pod-watch progress around watch events * Simplify return * comments

foxish added a commit that referenced this pull request Jul 24, 2017

Richer logging and better error handling in driver pod watch (#154)

5587588

* pod-watch progress around watch events * Simplify return * comments

puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019

Richer logging and better error handling in driver pod watch (apache-…

adae156

…spark-on-k8s#154) * pod-watch progress around watch events * Simplify return * comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Richer logging and better error handling in driver pod watch #154

Richer logging and better error handling in driver pod watch #154

foxish commented Feb 24, 2017

kimoonkim left a comment

kimoonkim Feb 24, 2017

foxish Feb 24, 2017

iyanuobidele Feb 24, 2017

foxish Feb 24, 2017

ash211 Feb 24, 2017

foxish Feb 24, 2017

ash211 Feb 24, 2017

foxish Feb 24, 2017

ash211 Feb 24, 2017

ash211 Feb 24, 2017

foxish Feb 24, 2017

ash211 left a comment

ash211 Feb 24, 2017

foxish Feb 24, 2017

ash211 Feb 24, 2017

Richer logging and better error handling in driver pod watch #154

Richer logging and better error handling in driver pod watch #154

Conversation

foxish commented Feb 24, 2017

kimoonkim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ash211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment