-
Notifications
You must be signed in to change notification settings - Fork 118
Richer logging and better error handling in driver pod watch #154
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the quick fix!
} | ||
|
||
private def hasCompleted(): Boolean = { | ||
if (phase == "Succeeded" || phase == "Failed") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function body can be simplied to return phase == "Succeeded" || phase == "Failed"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally planned on having more logic there to detect other failed states :) Fixed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to jump on this again after that.
nit: remove return. Plus I think its okay to inline the check if its only used once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
override def eventReceived(action: Action, pod: Pod): Unit = { | ||
this.pod = Option(pod) | ||
|
||
logShortStatus() | ||
if (prevPhase != phase) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this still log on null -> pending and pending -> running changes in the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should, because each of them are reported as MODIFIED
events by the watch.
closeWatch() | ||
|
||
case _ => | ||
logLongStatus() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this log long status more often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does, especially when the container goes to failed states while retaining the same phase (ImagePullBackoff, ImageErr, ContainerCannotRun etc), which it wouldn't log before.
} | ||
} | ||
|
||
override def onClose(e: KubernetesClientException): Unit = { | ||
scheduler.shutdown() | ||
logDebug(s"Stopped watching application $appId with last-observed phase $phase") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that this message is before the change, please change language to Stopping ...
instead of Stopped ...
} | ||
|
||
private def logShortStatus() = { | ||
logInfo(s"Application status for $appId (phase: $phase)") | ||
} | ||
|
||
private def logLongStatus() = { | ||
logInfo("Phase changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown")) | ||
logInfo("State changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the phase or the state change that triggers this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is state change that triggers the message, we don't look at the phase anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- any opinions @mccheah ? I think we're ready to merge
case Action.DELETED => | ||
closeWatch() | ||
|
||
case Action.ERROR => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the case we'd hit for ImagePullBackoff
/ ImageErr
/ ContainerCannotRun
type states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the case where the watch itself fails. Any error state of a kubernetes pod is still captured under "modified". I was thinking more about whether we want to delete the driver on ImageErr etc, but each of those states has a retry loop within the pod lifecycle itself. That's why the pod isn't marked as failed and continues to be in Pending/Waiting. For example, after ImageErr, It will enter imagePullBackoff and retry with exponential delays. We shouldn't delete in those cases IMO but expose it to the user.
If we think we should delete it however, I can do that in a subsequent PR. This one specifically handles driver pod deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok yes, I was thinking you were putting that ImageErr
handling in this PR through that Action.ERROR state, but putting in separately works too.
The tangible benefits we get here are:
- richer logging for phase changes
- better handling for driver pod failure
- better handling for watch failure
* pod-watch progress around watch events * Simplify return * comments
* pod-watch progress around watch events * Simplify return * comments
…spark-on-k8s#154) * pod-watch progress around watch events * Simplify return * comments (cherry picked from commit d81c084)
Mostly want to get the same metrics name with the change to camel case `shuffleService` as that would be a functional break for any current deployments.
…spark-on-k8s#154) * pod-watch progress around watch events * Simplify return * comments
Fixes #143
We detect if the watch says the driver pod was DELETED and clean up if that is the case.
Also added the
ContainerStatus
to the long status. It should be prettier though.cc/ @kimoonkim