Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

lins05
Copy link

@lins05 lins05 commented Jan 22, 2017

What changes were proposed in this pull request?

Support setting the driver pod launching timeout through "spark.kubernetes.driverLaunchTimeout", and increase the default value from 30s to 60s. The current value of 30s is kind of short for pulling the image from public docker registry plus the container/JVM start time.

How was this patch tested?

Manual test.

@foxish
Copy link
Member

foxish commented Jan 24, 2017

I don't think driverLaunchTimeout adequately captures what the timeout really signifies. It is "local file upload timeout", and we should call it something like that IMO. WDYT @ash211 @mccheah

@mccheah
Copy link

mccheah commented Jan 24, 2017

Driver launch timeout seems fine, I wouldn't want to make the config point too verbose. If we could come up with a config name that's concise and that captures the fact that we're waiting to upload data and start running the application, that's fine, but if anything we come up with is too verbose then this will suffice.

@foxish
Copy link
Member

foxish commented Jan 24, 2017

I guess it isn't too bad a name. :) My concern was that except when uploading local jars, we should support a "fire and forget" type of launch mode, which would obviate the need for this timeout, and behave like the other schedulers in cluster mode. This is okay for now though.

@mccheah
Copy link

mccheah commented Jan 24, 2017

Actually launch timeout is the correct term since the future's completion is contingent on both the pod watch indicating that the pod is in "ready" status and that the application has been submitted.

@foxish
Copy link
Member

foxish commented Jan 24, 2017

Sounds good.
Maybe the future should be contingent on the pod entering the Pending/Ready/Running state in case we're not submitting anything from the local machine?

@mccheah
Copy link

mccheah commented Jan 24, 2017

Hm, how is this different from the current world? Do you mean having two futures and blocking on each in turn - block on the first future to get the pod into ready status and block on the second future for finishing the submission to the driver server?

@erikerlandson
Copy link
Member

fileStagingTimeout ?

And increase the default value from 30s to 60s. The current value of
30s is kind of short for pulling the image from public docker registry
plus the container/JVM start time.
@lins05 lins05 force-pushed the allow-driver-launch-timeout branch from 96c87cb to 351db7c Compare January 25, 2017 04:38
@lins05
Copy link
Author

lins05 commented Jan 25, 2017

Maybe the future should be contingent on the pod entering the Pending/Ready/Running state in case we're not submitting anything from the local machine?

For me the practical problem is it takes quite some time to pull the driver image from the public docker registry. But it does sound like a good idea to output the detailed reason or phase of the pod when aborting the submit to k8s cluster, e.g. "the pod is running but the submit failed with error xyz", or "the pod has been pending for too long". Such details would help the user understand the reason of the failure and help locating/fixing the underlying problem.

What about merging this PR (maybe with another better name than spark.kubernetes.driverLaunchTimeout), and leave the more detailed error message described above for a future PR?

@foxish
Copy link
Member

foxish commented Jan 25, 2017

LGTM. I'm ok with merging this for now. @mccheah, do you concur?

@@ -424,7 +426,7 @@ private[spark] object Client extends Logging {
private val DRIVER_LAUNCHER_CONTAINER_NAME = "spark-kubernetes-driver-launcher"
private val SECURE_RANDOM = new SecureRandom()
private val SPARK_SUBMISSION_SECRET_BASE_DIR = "/var/run/secrets/spark-submission"
private val LAUNCH_TIMEOUT_SECONDS = 30
private val LAUNCH_TIMEOUT_SECONDS = 60
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rename this to DEFAULT_LAUNCH_TIMEOUT_SECONDS now that it's configurable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I've updated it.

@mccheah
Copy link

mccheah commented Jan 25, 2017

Looks good to me

@mccheah mccheah merged commit c480574 into apache-spark-on-k8s:k8s-support-alternate-incremental Jan 25, 2017
@lins05 lins05 deleted the allow-driver-launch-timeout branch January 26, 2017 14:32
ash211 pushed a commit that referenced this pull request Feb 8, 2017
* Support setting the driver pod launching timeout.

And increase the default value from 30s to 60s. The current value of
30s is kind of short for pulling the image from public docker registry
plus the container/JVM start time.

* Use a better name for the default timeout.
ash211 pushed a commit that referenced this pull request Mar 8, 2017
* Support setting the driver pod launching timeout.

And increase the default value from 30s to 60s. The current value of
30s is kind of short for pulling the image from public docker registry
plus the container/JVM start time.

* Use a better name for the default timeout.
foxish pushed a commit that referenced this pull request Jul 24, 2017
* Support setting the driver pod launching timeout.

And increase the default value from 30s to 60s. The current value of
30s is kind of short for pulling the image from public docker registry
plus the container/JVM start time.

* Use a better name for the default timeout.
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
…s#36)

* Support setting the driver pod launching timeout.

And increase the default value from 30s to 60s. The current value of
30s is kind of short for pulling the image from public docker registry
plus the container/JVM start time.

* Use a better name for the default timeout.
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…s#36)

* Support setting the driver pod launching timeout.

And increase the default value from 30s to 60s. The current value of
30s is kind of short for pulling the image from public docker registry
plus the container/JVM start time.

* Use a better name for the default timeout.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants