Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Richer logging and better error handling in driver pod watch #154

Merged
merged 3 commits into from
Feb 24, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -50,35 +50,47 @@ private[kubernetes] class LoggingPodStatusWatcher(podCompletedFuture: CountDownL
}

private var pod: Option[Pod] = Option.empty
private var prevPhase: String = null
private def phase: String = pod.map(_.getStatus().getPhase()).getOrElse("unknown")
private def status: String = pod.map(_.getStatus().getContainerStatuses().toString())
.getOrElse("unknown")

override def eventReceived(action: Action, pod: Pod): Unit = {
this.pod = Option(pod)

logShortStatus()
if (prevPhase != phase) {
logLongStatus()
}
prevPhase = phase

if (phase == "Succeeded" || phase == "Failed") {
podCompletedFuture.countDown()
scheduler.shutdown()
action match {
case Action.DELETED =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this still log on null -> pending and pending -> running changes in the pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should, because each of them are reported as MODIFIED events by the watch.

closeWatch()

case Action.ERROR =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the case we'd hit for ImagePullBackoff / ImageErr / ContainerCannotRun type states?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the case where the watch itself fails. Any error state of a kubernetes pod is still captured under "modified". I was thinking more about whether we want to delete the driver on ImageErr etc, but each of those states has a retry loop within the pod lifecycle itself. That's why the pod isn't marked as failed and continues to be in Pending/Waiting. For example, after ImageErr, It will enter imagePullBackoff and retry with exponential delays. We shouldn't delete in those cases IMO but expose it to the user.

If we think we should delete it however, I can do that in a subsequent PR. This one specifically handles driver pod deletion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok yes, I was thinking you were putting that ImageErr handling in this PR through that Action.ERROR state, but putting in separately works too.

The tangible benefits we get here are:

  • richer logging for phase changes
  • better handling for driver pod failure
  • better handling for watch failure

closeWatch()

case _ =>
logLongStatus()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this log long status more often?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does, especially when the container goes to failed states while retaining the same phase (ImagePullBackoff, ImageErr, ContainerCannotRun etc), which it wouldn't log before.

if (hasCompleted()) {
closeWatch()
}
}
}

override def onClose(e: KubernetesClientException): Unit = {
scheduler.shutdown()
logDebug(s"Stopped watching application $appId with last-observed phase $phase")
logDebug(s"Stopping watching application $appId with last-observed phase $phase")
closeWatch()
}

private def logShortStatus() = {
logInfo(s"Application status for $appId (phase: $phase)")
}

private def logLongStatus() = {
logInfo("Phase changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown"))
logInfo("State changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the phase or the state change that triggers this message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is state change that triggers the message, we don't look at the phase anymore.

}

private def hasCompleted(): Boolean = {
phase == "Succeeded" || phase == "Failed"
}

private def closeWatch(): Unit = {
podCompletedFuture.countDown()
scheduler.shutdown()
}

private def formatPodState(pod: Pod): String = {
Expand All @@ -103,7 +115,8 @@ private[kubernetes] class LoggingPodStatusWatcher(podCompletedFuture: CountDownL
.asScala
.map(_.getImage)
.mkString(", ")),
("phase", pod.getStatus.getPhase())
("phase", pod.getStatus.getPhase()),
("status", pod.getStatus.getContainerStatuses().toString)
)

// Use more loggable format if value is null or empty
Expand Down