-
Notifications
You must be signed in to change notification settings - Fork 118
Use readiness probe instead of client-side ping. #75
Conversation
Keep one ping() just as a sanity check, but otherwise set up the readiness probe to report the container as ready only when the ping endpoint can be reached. Also add a liveliness probe for convenience and symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely prefer this over the client side ping. Main question is around changing the watch to the service vs the driver pod
@@ -127,6 +121,11 @@ private[spark] class Client( | |||
.pods() | |||
.withLabels(driverKubernetesSelectors) | |||
.watch(podWatcher)) { _ => | |||
val probePingHttpGet = new HTTPGetActionBuilder() | |||
.withScheme(if (driverSubmitSslOptions.enabled) "HTTPS" else "HTTP") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need to be all-caps? looks kinda weird to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uppercasing is done within the library anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does need to be capitalized, the Kubernetes API will throw an error otherwise.
@@ -162,6 +161,8 @@ private[spark] class Client( | |||
.endEnv() | |||
.addToEnv(sslEnvs: _*) | |||
.withPorts(containerPorts.asJava) | |||
.withNewReadinessProbe().withHttpGet(probePingHttpGet).endReadinessProbe() | |||
.withNewLivenessProbe().withHttpGet(probePingHttpGet).endLivenessProbe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need initialDelaySeconds
/ timeoutSeconds
/ periodSeconds
set here?
https://kubernetes.io/docs/api-reference/v1/definitions/#_v1_probe
I'm also not sure that having both is necessary -- we only rely on the readiness probe in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The liveness could be useful to restart our server if it fails for some reason, but we would need an initial delay seconds, since otherwise, we'll have it fail early on, before the server starts listening.
@@ -127,6 +121,11 @@ private[spark] class Client( | |||
.pods() | |||
.withLabels(driverKubernetesSelectors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently our PodWatcher watches the driver pod to determine its liveness for uploading files to it. What we more directly depend on those is the service -- when the readiness probe goes to success and the pod gets added to the service, that seems like the more specific trigger.
Proposal: what if we change the DriverPodWatcher
to watch for the service gaining a backing pod, instead of the driver pod being ready? That more directly monitors for the condition we need to submit the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need an extra layer of nesting to do that, since we want to delay the service creation for as long as possible. So we want to only make the service after we make the pod, so we'll need a Watch
for the pod being created to trigger creating the service, and another watch for the Service
having the backing pod. It's doable, just not sure how to write it in a way that makes the nesting easy to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth discussing what the ramifications could be for creating the service up front and perhaps having the node ports open for the duration of the launch - in the worst case, if the launch times out. If we created the service outside of the watch it would be easier to design this where we don't have two layers of nesting and futures to listen for, but there is the aforementioned tradeoff to consider.
In the meantime - @ash211 @foxish how about we merge this PR as is and follow up on that point moving forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with making this PR's change be to watch on the readiness probe of the driver pod, and potentially in a subsequent PR we change that to be a watch on the driver service directly if that makes more sense.
Filed #76 as a followup for discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree to let this go in for now, perhaps, we could table this fine-grained approach at the next meeting.
driverSubmitSslOptions) | ||
val ping = Retry.retry(5, 5.seconds) { | ||
driverSubmitter.ping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment that this ping is a final check of the service liveness before submitting the job (in addition to the k8s checks).
Might also be worth try/catching it and logging on failure that even though the k8s service is active we are unable to connect to the driver's rest service from the submitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(meant to click request changes)
@mccheah you've got merge conflicts |
…te-incremental' into readiness-probe
I found that this patch fails integration tests as the Client reports that it can't reach the Minikube host at the NodePort. I had thought that using the Readiness Probe would make it such that we would not have to retry the submission, as when the Pod is marked as "ready" any Services that proxy to the Pod should be available immediately over their Node Ports. This isn't necessarily the case though, and watching the Service doesn't give any indication as to whether or not a Service is "Ready" or not. Thus I can only conclude that we still need retry logic client-side here. @foxish do you have any thoughts? |
…te-incremental' into readiness-probe
I was thinking more about this. The pod enters |
Hm, I still prefer trying to get a reactive-based approach to work. The main reason is that pinging has an inherent latency in that if the first attempt doesn't work, there's a delay between the first attempt and the subsequent attempt, where if the components become ready in between the two attempts, we waste a bit of time. Of course we could reduce the time between each ping, but there's only a certain frequency where we would want to do that; that is, we wouldn't want to flood the proxy. We could make a What about this:
An alternate version of this that delays the endpoint creation only until after the pod is successfully launched, has the @foxish thoughts? |
@mccheah I think the double watch with two CountDownLatches would work well, and could even be added as a single Given how much the retry logic is costing in my testing, I'd rather do the work to make this fully event-driven with no polling. |
Let's pick this up on Monday? I will perform a couple of experiments and see how we can shorten the time taken in the retry loop and fail quicker. If the event based approach is the only way we can achieve decent latency, then we can go with the two latches as Matt suggested. |
I'm attempting the two-latch approach and unfortunately even if the Endpoint-creation event comes in to the EndpointsWatcher, indicating the endpoint is ready, the client still can't reach the driver pod through the service immediately after. If I set a breakpoint and wait a while, the endpoint becomes reachable. I'm not sure what event I should be looking for, or perhaps the Watch's listener is being triggered prematurely? |
@mccheah when watching for the service to be ready, what method in the fabric8 api are you checking? |
I'll post a patch with what I have so far and we can work from there |
…te-incremental' into readiness-probe
This works on my machine. This has the caveat of creating the service and the NodePort endpoint a little earlier than we actually need it, but in practice I found that we needed to create the service earlier on so that it would be available by the time we try to query it. Thus the pod launch time services as a "buffer" of sorts to allow the service to become ready... but this seems severely brittle and I would like to investigate why the endpoint is being reported as ADDED when it's not open yet. |
Utils.tryWithResource(kubernetesClient | ||
.endpoints() | ||
.withName(kubernetesAppId) | ||
.watch(endpointsReadyWatcher)) { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we simplify the three calls to Utils.tryWithResource
with a helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lins05 the latest patch rearranges the logic a little bit - how does it look now? I don't think extracting the three tryWithResource
is the real problem, so much as having this method do both the Kubernetes component creation and the adjustment for owner references seems like a bit of overload.
I think the idea of having a tryWithResource
that takes multiple Closeable factories is intriguing - the tricky part would be having the closure accept a variable-length argument list, if that makes sense.
override def eventReceived(action: Action, endpoints: Endpoints): Unit = { | ||
if ((action == Action.ADDED) || (action == Action.MODIFIED) | ||
&& endpoints.getSubsets.asScala.nonEmpty | ||
&& endpoints.getSubsets.asScala.exists(_.getAddresses.asScala.nonEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service endpoints have an address, it means the pod must have been ready. This makes the pod status watcher unnecessary. Should we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, we could still keep the pod watcher, such that we can tell the user the exact stage at which the submission fails (failed when creating the driver pod v.s. failed after the driver pod has been created)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like more granular logging in failures (makes debugging much easier) so agreed on keeping the pod watcher. It's not just for better logging though, I think Matt has also seen instances where a Minikube cluster would send events to Watches that showed the service as having ready endpoints even when the endpoints themselves weren't actually ready. (please correct if this telling is off)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only remaining concern is about liveness vs readiness probes, and the liveness probe timeout being too short.
} | ||
Utils.tryLogNonFatalError { | ||
kubernetesClient.pods().delete(driverPod) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 6 lines ending here are almost the same as lines 140-146 -- worth extracting out to a method with service and pod parameters?
.endReadinessProbe() | ||
.withNewLivenessProbe() | ||
.withHttpGet(probePingHttpGet) | ||
.withInitialDelaySeconds(DRIVER_INITIAL_LIVELINESS_CHECK_DELAY_SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when readinessProbe and livenessProbe are both failing? does k8s wait for the pod to become ready, or immediately restart?
If this check delay is too short (e.g. a kubelet can't pull the docker image in the 10s window) then I'm worried the driver pod would never get spun up anywhere.
We might need to bump this value to the longest we can imagine a docker image pull taking, or otherwise making it configurable.
I'd probably prefer putting in just the readiness probe and no liveness probe until we get this bit figured out
override def eventReceived(action: Action, endpoints: Endpoints): Unit = { | ||
if ((action == Action.ADDED) || (action == Action.MODIFIED) | ||
&& endpoints.getSubsets.asScala.nonEmpty | ||
&& endpoints.getSubsets.asScala.exists(_.getAddresses.asScala.nonEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like more granular logging in failures (makes debugging much easier) so agreed on keeping the pod watcher. It's not just for better logging though, I think Matt has also seen instances where a Minikube cluster would send events to Watches that showed the service as having ready endpoints even when the endpoints themselves weren't actually ready. (please correct if this telling is off)
@@ -67,4 +67,5 @@ package object constants { | |||
// Miscellaneous | |||
private[spark] val DRIVER_CONTAINER_NAME = "spark-kubernetes-driver" | |||
private[spark] val KUBERNETES_SUBMIT_SSL_NAMESPACE = "kubernetes.submit" | |||
private[spark] val DRIVER_INITIAL_LIVELINESS_CHECK_DELAY_SECONDS = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use this anymore?
@@ -33,6 +33,7 @@ private[spark] object HttpClientUtil { | |||
|
|||
def createClient[T: ClassTag]( | |||
uris: Array[String], | |||
maxRetriesPerServer: Int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this param need to get passed through to the MultiServerFeignTarget constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -33,6 +33,7 @@ private[spark] object HttpClientUtil { | |||
|
|||
def createClient[T: ClassTag]( | |||
uris: Array[String], | |||
maxRetriesPerServer: Int = 1, | |||
sslSocketFactory: SSLSocketFactory = SSLContext.getDefault.getSocketFactory, | |||
trustContext: X509TrustManager = null, | |||
readTimeoutMillis: Int = 20000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a cute trick recently for status checking we could use below. Instead of:
if (response.status() >= 200 && response.status() < 300) {
use
if (response.status() / 100 == 2) {
if (threadLocalCurrentAttempt.get < maxRetriesPerServer) { | ||
logWarning(s"Attempt $currentAttempt of $maxRetriesPerServer failed for" + | ||
s" server ${url()}. Retrying request...", e) | ||
Thread.sleep(delayBetweenRetriesMillis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the feign API expect Retryer
s to sleep in this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather - everything in Feign is synchronous so there would be no other realistic way to add the delay except for here (unless of course we wanted the retry loop in the caller - but trying to abstract that away with this in the first place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- will merge when build is green
Hold off a second -- debugging something.. |
Need to dedupe the list that we shuffle through
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- need to dedupe the list
- need to shuffle the list better as my master got picked first (it's first alphabetically) which would be uncommon
- need to reduce initial connect timeout to more like 2sec from 20sec
We probably need #90 to select only the external IPs - can follow up separately there. |
Much better! I think we should drop the retries down from 10 to 3 though. At the point where we're retrying, we've already gotten a Watch event back from k8s that things are all good to go (including service readiness) so I think 10 attempts is overkill and takes quite a while. |
This is performing well in tests now -- failing over is working properly, and it happens reasonably quickly. Still need to do more proactive filtering of the candidate host list for failover:
But I think we can do that in a followup PR |
* Use readiness probe instead of client-side ping. Keep one ping() just as a sanity check, but otherwise set up the readiness probe to report the container as ready only when the ping endpoint can be reached. Also add a liveliness probe for convenience and symmetry. * Extract common HTTP get action * Remove some code * Add delay to liveliness check * Fix merge conflicts. * Fix more merge conflicts * Fix more merge conflicts * Revamp readiness check logic * Add addresses ready condition to endpoints watch * Rearrange the logic some more. * Remove liveness probe, retry against servers * Fix compiler error * Fix another compiler error * Delay between retries. Remove unintended test modification * FIx another compiler error * Extract method * Address comments * Deduplicate node addresses, use lower initial connect timeout * Drop maxRetriesPerServer from 10 to 3
* Use readiness probe instead of client-side ping. Keep one ping() just as a sanity check, but otherwise set up the readiness probe to report the container as ready only when the ping endpoint can be reached. Also add a liveliness probe for convenience and symmetry. * Extract common HTTP get action * Remove some code * Add delay to liveliness check * Fix merge conflicts. * Fix more merge conflicts * Fix more merge conflicts * Revamp readiness check logic * Add addresses ready condition to endpoints watch * Rearrange the logic some more. * Remove liveness probe, retry against servers * Fix compiler error * Fix another compiler error * Delay between retries. Remove unintended test modification * FIx another compiler error * Extract method * Address comments * Deduplicate node addresses, use lower initial connect timeout * Drop maxRetriesPerServer from 10 to 3
) * Use readiness probe instead of client-side ping. Keep one ping() just as a sanity check, but otherwise set up the readiness probe to report the container as ready only when the ping endpoint can be reached. Also add a liveliness probe for convenience and symmetry. * Extract common HTTP get action * Remove some code * Add delay to liveliness check * Fix merge conflicts. * Fix more merge conflicts * Fix more merge conflicts * Revamp readiness check logic * Add addresses ready condition to endpoints watch * Rearrange the logic some more. * Remove liveness probe, retry against servers * Fix compiler error * Fix another compiler error * Delay between retries. Remove unintended test modification * FIx another compiler error * Extract method * Address comments * Deduplicate node addresses, use lower initial connect timeout * Drop maxRetriesPerServer from 10 to 3
) * Use readiness probe instead of client-side ping. Keep one ping() just as a sanity check, but otherwise set up the readiness probe to report the container as ready only when the ping endpoint can be reached. Also add a liveliness probe for convenience and symmetry. * Extract common HTTP get action * Remove some code * Add delay to liveliness check * Fix merge conflicts. * Fix more merge conflicts * Fix more merge conflicts * Revamp readiness check logic * Add addresses ready condition to endpoints watch * Rearrange the logic some more. * Remove liveness probe, retry against servers * Fix compiler error * Fix another compiler error * Delay between retries. Remove unintended test modification * FIx another compiler error * Extract method * Address comments * Deduplicate node addresses, use lower initial connect timeout * Drop maxRetriesPerServer from 10 to 3
Keep one ping just as a sanity check, but otherwise set up the readiness probe to report the container as ready only when the ping endpoint can be reached.
Also add a liveliness probe for convenience and symmetry.
Closes #72