-
Notifications
You must be signed in to change notification settings - Fork 118
Support setting the driver pod launching timeout. #36
Support setting the driver pod launching timeout. #36
Conversation
Driver launch timeout seems fine, I wouldn't want to make the config point too verbose. If we could come up with a config name that's concise and that captures the fact that we're waiting to upload data and start running the application, that's fine, but if anything we come up with is too verbose then this will suffice. |
I guess it isn't too bad a name. :) My concern was that except when uploading local jars, we should support a "fire and forget" type of launch mode, which would obviate the need for this timeout, and behave like the other schedulers in cluster mode. This is okay for now though. |
Actually launch timeout is the correct term since the future's completion is contingent on both the pod watch indicating that the pod is in "ready" status and that the application has been submitted. |
Sounds good. |
Hm, how is this different from the current world? Do you mean having two futures and blocking on each in turn - block on the first future to get the pod into ready status and block on the second future for finishing the submission to the driver server? |
|
And increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time.
96c87cb
to
351db7c
Compare
For me the practical problem is it takes quite some time to pull the driver image from the public docker registry. But it does sound like a good idea to output the detailed reason or phase of the pod when aborting the submit to k8s cluster, e.g. "the pod is running but the submit failed with error xyz", or "the pod has been pending for too long". Such details would help the user understand the reason of the failure and help locating/fixing the underlying problem. What about merging this PR (maybe with another better name than |
LGTM. I'm ok with merging this for now. @mccheah, do you concur? |
@@ -424,7 +426,7 @@ private[spark] object Client extends Logging { | |||
private val DRIVER_LAUNCHER_CONTAINER_NAME = "spark-kubernetes-driver-launcher" | |||
private val SECURE_RANDOM = new SecureRandom() | |||
private val SPARK_SUBMISSION_SECRET_BASE_DIR = "/var/run/secrets/spark-submission" | |||
private val LAUNCH_TIMEOUT_SECONDS = 30 | |||
private val LAUNCH_TIMEOUT_SECONDS = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rename this to DEFAULT_LAUNCH_TIMEOUT_SECONDS
now that it's configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I've updated it.
Looks good to me |
c480574
into
apache-spark-on-k8s:k8s-support-alternate-incremental
* Support setting the driver pod launching timeout. And increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time. * Use a better name for the default timeout.
* Support setting the driver pod launching timeout. And increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time. * Use a better name for the default timeout.
* Support setting the driver pod launching timeout. And increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time. * Use a better name for the default timeout.
…s#36) * Support setting the driver pod launching timeout. And increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time. * Use a better name for the default timeout.
…s#36) * Support setting the driver pod launching timeout. And increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time. * Use a better name for the default timeout.
What changes were proposed in this pull request?
Support setting the driver pod launching timeout through
"spark.kubernetes.driverLaunchTimeout"
, and increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time.How was this patch tested?
Manual test.