Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Richer logging and better error handling in driver pod watch #154

Merged
merged 3 commits into from
Feb 24, 2017

Conversation

foxish
Copy link
Member

@foxish foxish commented Feb 24, 2017

Fixes #143

We detect if the watch says the driver pod was DELETED and clean up if that is the case.
Also added the ContainerStatus to the long status. It should be prettier though.

cc/ @kimoonkim

@kimoonkim kimoonkim mentioned this pull request Feb 24, 2017
Copy link
Member

@kimoonkim kimoonkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the quick fix!

}

private def hasCompleted(): Boolean = {
if (phase == "Succeeded" || phase == "Failed") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function body can be simplied to return phase == "Succeeded" || phase == "Failed"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally planned on having more logic there to detect other failed states :) Fixed. Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to jump on this again after that.

nit: remove return. Plus I think its okay to inline the check if its only used once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


override def eventReceived(action: Action, pod: Pod): Unit = {
this.pod = Option(pod)

logShortStatus()
if (prevPhase != phase) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this still log on null -> pending and pending -> running changes in the pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should, because each of them are reported as MODIFIED events by the watch.

closeWatch()

case _ =>
logLongStatus()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this log long status more often?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does, especially when the container goes to failed states while retaining the same phase (ImagePullBackoff, ImageErr, ContainerCannotRun etc), which it wouldn't log before.

}
}

override def onClose(e: KubernetesClientException): Unit = {
scheduler.shutdown()
logDebug(s"Stopped watching application $appId with last-observed phase $phase")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that this message is before the change, please change language to Stopping ... instead of Stopped ...

}

private def logShortStatus() = {
logInfo(s"Application status for $appId (phase: $phase)")
}

private def logLongStatus() = {
logInfo("Phase changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown"))
logInfo("State changed, new state: " + pod.map(formatPodState(_)).getOrElse("unknown"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the phase or the state change that triggers this message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is state change that triggers the message, we don't look at the phase anymore.

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- any opinions @mccheah ? I think we're ready to merge

case Action.DELETED =>
closeWatch()

case Action.ERROR =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the case we'd hit for ImagePullBackoff / ImageErr / ContainerCannotRun type states?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the case where the watch itself fails. Any error state of a kubernetes pod is still captured under "modified". I was thinking more about whether we want to delete the driver on ImageErr etc, but each of those states has a retry loop within the pod lifecycle itself. That's why the pod isn't marked as failed and continues to be in Pending/Waiting. For example, after ImageErr, It will enter imagePullBackoff and retry with exponential delays. We shouldn't delete in those cases IMO but expose it to the user.

If we think we should delete it however, I can do that in a subsequent PR. This one specifically handles driver pod deletion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok yes, I was thinking you were putting that ImageErr handling in this PR through that Action.ERROR state, but putting in separately works too.

The tangible benefits we get here are:

  • richer logging for phase changes
  • better handling for driver pod failure
  • better handling for watch failure

@ash211 ash211 changed the title pod-watch progress around watch events Richer logging and better error handling in driver pod watch Feb 24, 2017
@ash211 ash211 merged commit d81c084 into k8s-support-alternate-incremental Feb 24, 2017
@ash211 ash211 deleted the pod-watch branch February 24, 2017 21:10
ash211 pushed a commit that referenced this pull request Mar 8, 2017
* pod-watch progress around watch events

* Simplify return

* comments
foxish added a commit that referenced this pull request Jul 24, 2017
* pod-watch progress around watch events

* Simplify return

* comments
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
…spark-on-k8s#154)

* pod-watch progress around watch events

* Simplify return

* comments

(cherry picked from commit d81c084)
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
Mostly want to get the same metrics name with the change to camel case
`shuffleService` as that would be a functional break for any current
deployments.
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…spark-on-k8s#154)

* pod-watch progress around watch events

* Simplify return

* comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants