-
Notifications
You must be signed in to change notification settings - Fork 118
Richer logging and better error handling in driver pod watch #154
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -50,35 +50,47 @@ private[kubernetes] class LoggingPodStatusWatcher(podCompletedFuture: CountDownL | |
} | ||
|
||
private var pod: Option[Pod] = Option.empty | ||
private var prevPhase: String = null | ||
private def phase: String = pod.map(_.getStatus().getPhase()).getOrElse("unknown") | ||
private def status: String = pod.map(_.getStatus().getContainerStatuses().toString()) | ||
.getOrElse("unknown") | ||
|
||
override def eventReceived(action: Action, pod: Pod): Unit = { | ||
this.pod = Option(pod) | ||
|
||
logShortStatus() | ||
if (prevPhase != phase) { | ||
logLongStatus() | ||
} | ||
prevPhase = phase | ||
|
||
if (phase == "Succeeded" || phase == "Failed") { | ||
podCompletedFuture.countDown() | ||
scheduler.shutdown() | ||
action match { | ||
case Action.DELETED => | ||
closeWatch() | ||
|
||
case Action.ERROR => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this the case we'd hit for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's the case where the watch itself fails. Any error state of a kubernetes pod is still captured under "modified". I was thinking more about whether we want to delete the driver on ImageErr etc, but each of those states has a retry loop within the pod lifecycle itself. That's why the pod isn't marked as failed and continues to be in Pending/Waiting. For example, after ImageErr, It will enter imagePullBackoff and retry with exponential delays. We shouldn't delete in those cases IMO but expose it to the user. If we think we should delete it however, I can do that in a subsequent PR. This one specifically handles driver pod deletion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah ok yes, I was thinking you were putting that The tangible benefits we get here are:
|
||
closeWatch() | ||
|
||
case _ => | ||
logLongStatus() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this log long status more often? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it does, especially when the container goes to failed states while retaining the same phase (ImagePullBackoff, ImageErr, ContainerCannotRun etc), which it wouldn't log before. |
||
if (hasCompleted()) { | ||
closeWatch() | ||
} | ||
} | ||
} | ||
|
||
override def onClose(e: KubernetesClientException): Unit = { | ||
scheduler.shutdown() | ||
logDebug(s"Stopped watching application $appId with last-observed phase $phase") | ||
logDebug(s"Stopping watching application $appId with last-observed phase $phase") | ||
closeWatch() | ||
} | ||
|
||
private def logShortStatus() = { | ||
logInfo(s"Application status for $appId (phase: $phase)") | ||
} | ||
|
||
private def logLongStatus() = { | ||
logInfo("Phase changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown")) | ||
logInfo("State changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it the phase or the state change that triggers this message? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is state change that triggers the message, we don't look at the phase anymore. |
||
} | ||
|
||
private def hasCompleted(): Boolean = { | ||
phase == "Succeeded" || phase == "Failed" | ||
} | ||
|
||
private def closeWatch(): Unit = { | ||
podCompletedFuture.countDown() | ||
scheduler.shutdown() | ||
} | ||
|
||
private def formatPodState(pod: Pod): String = { | ||
|
@@ -103,7 +115,8 @@ private[kubernetes] class LoggingPodStatusWatcher(podCompletedFuture: CountDownL | |
.asScala | ||
.map(_.getImage) | ||
.mkString(", ")), | ||
("phase", pod.getStatus.getPhase()) | ||
("phase", pod.getStatus.getPhase()), | ||
("status", pod.getStatus.getContainerStatuses().toString) | ||
) | ||
|
||
// Use more loggable format if value is null or empty | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this still log on null -> pending and pending -> running changes in the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should, because each of them are reported as
MODIFIED
events by the watch.