-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add implementation details to sidecar kep #841
add implementation details to sidecar kep #841
Conversation
Hi @Joseph-Irving. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node cc @dchen1107 @derekwaynecarr, let me know if you any further questions, parts you would like expanded etc. |
keps/sig-apps/sidecarcontainers.md
Outdated
@@ -67,7 +67,7 @@ Solve issues so that they don't require application modification: | |||
|
|||
## Non-Goals | |||
|
|||
Allowing multiple containers to run at once during the init phase. //TODO See if we can solve this problem with this proposal | |||
Allowing multiple containers to run at once during the init phase - this could be solved using the same principal but can be implemented separately. //TODO should this be a non-goal or an extension? is there enough support for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not quite what you were asking: Do we have any use-cases or even hypotheticals that speak to the intersection of this design and init containers? E.g. Do init containers need service-mesh functionality? I would assume they do.
In other words, should sidecars start before init containers? And then, do we have sequencing within init containers? Concretely, consider something like Istio which has an init container with privilegs to write iptables rules, then a sidecar which processes net traffic. It seems to me that the sidecar would be needed for any user-provided init containers. I don't know if this has come up in real life yet.
How do we think that holds up in general?
@louiscryan @geeknoid @costinm for some perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin So this was brought up by istio specifically, see #753 and https://github.com/istio/cni/issues/77. I think init containers needing service mesh access seems like a fairly valid use case. I think more generally the idea of an init container needing a sidecar isn't unheard of, e.g I have an init container that performs a db schema migration but it should go through mysql proxy.
I think you can also assume that some sidecars may need the init containers to run before they start such as the example of istio with init doing iptables and sidecar then starting for traffic.
You could potentially achieve this using what we've already suggested but adapting it for init e.g
initContainers:
- name: iptables
command: ['set up some iptables]
- name: proxy
command: ['proxy-stuff']
sidecar: true
- name: myinitapp
command: ['do some init stuff that requires the proxy']
The difference being it would follow order like init does already, but when it gets to the sidecar it could wait for it to be ready instead of complete, you then clean up the init sidecars at the end of the init phase.
This has the advantage that it keeps the UX relatively simple, the init phase is still logically separated, while hopefully enabling the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the need for the functionality, but something feels wonky UX-wise about boot order jumping between the containers and initContainers sections.
Would it make sense to have a more explicit DAG of boot operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vllry, what do you mean by boot order jumping? Perhaps I wasn't that clear but I was suggesting we still have all the init containers finishing before the containers start. You could just have init sidecars that run alongside the init containers for that phase only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that having a sidecar toggle for initContainers seems clunky in the current config (the idea overall idea makes sense).
EG I might have a boot order like this:
initContainers:
- name: 1
sidecar: true
command: ...
- name: 3
command: ...
containers:
- name: 2
sidecar: true
- name: 4
It definitely works, but is awkward to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggesting that Init container sidecars would start and finish in the init phase, container sidecars would start after the init phase is over, so there would be no way to have a boot order like you described. I agree it would be confusing if this was the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be to explore this just enough to be able to write a short section in the doc explaining how this could be extended to init containers in the future (at a similar level of implementation detail as the current section for implementing non-init sidecars).
We have clear, concrete use cases for non-init sidecars (Jobs, log/result shippers, sql proxy, Istio) and I don't think we should block addressing those. Non-init sidecar is already a pattern that's in use in the real world, and we just want to replace ugly workarounds (wrapper scripts, etc) with true support. Init sidecars are not in use because it's not possible even as a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think init containers should continue to be executed to completion strictly before sidecars are launched. The purpose of init containers is to perform actions that pertain to the Pod that need to happen prior to the main containers starting (e.g. setting up configuration files, preparing the file system, pulling data down). Launching run to completion containers subsequent to "main" containers might be problematic. Perhaps in the future we could add init-sidecars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good first pass to me. I added some suggestions for things to consider in the second pass.
/lgtm
keps/sig-apps/sidecarcontainers.md
Outdated
###### Shutdown triggering: | ||
Package `kuberuntime` | ||
|
||
Modify `kuberuntime_manager.go`, function `computePodActions`. Have a check in this function that will see if all the non-sidecars had permanently exited, if true: return all the running sidecars in `ContainersToKill`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we trigger shutdown of sidecars through ContainersToKill
, what exit status do they get? Do we send them SIGTERM and give them a chance to terminate gracefully? We need to verify that the overall Pod phase (Succeeded/Failed) will reflect the right semantics in the presence of sidecars, so that Job will work as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when you return ContainersToKill
it iterates over them and does KillContainer
on each container. This is the function that sends prestop, sigterms/kills etc, grace period is obeyed. So as long as the sidecar exits 0
it should count towards succeeded.
###### Sidecars terminated last | ||
Package `kuberuntime` | ||
|
||
Modify `kuberuntime_container.go`, function `killContainersWithSyncResult`. Break up the looping over containers so that it goes through killing the non-sidecars before terminating the sidecars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner loop iterates over kubecontainer.Container (as opposed to v1.Container
), which doesn't seem to have a way to tell whether the container is a sidecar. You might be able to cross-reference from the v1.Pod
that also gets passed in, but we need to be careful to validate that that argument contains what we think it does. For example, I saw some comments that seemed to imply pod
might be nil
sometimes; we need to know what those cases are, and whether we can handle them properly.
It might end up being necessary to add data to kubecontainer.Container
. I'm not sure if that represents a change to CRI? I'm also new to this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah I missed that. Looking around it seems to indicate that Pod may be nil, running pod must not be.
If the pod is dead we're not as bothered about term ordering, so cross referencing may work with a if pod != nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can verify that pod
is only ever nil
when the Pod
object on the API server has been force-deleted, then I agree it's reasonable to say, "We don't respect sidecar termination order if you force-delete the Pod." Have you found any evidence to say exactly when pod
could be nil
? For example, is it possible as a transient condition during a Node restart, before it fetches Pods from the API server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident from looking at the code that I could tell you all the scenarios where pod
could be nil
. If @feiskyer could weigh in it would be appreciated. If there are scenarios where it can be nil and we still care about ordering then we'll need to look at a different way to enable this behaviour. Changing kubecontainer.Container
is a possibility but it means changing the container runtime.
keps/sig-apps/sidecarcontainers.md
Outdated
###### PreStop hooks sent to Sidecars first | ||
Package `kuberuntime` | ||
|
||
Modify `kuberuntime_container.go`, function `killContainersWithSyncResult`. Loop over sidecars and execute `executePreStopHook` on them before moving on to terminating the containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code groups pre-stop and stop in one function (killContainer
). Do you plan to break that up and fix all call sites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite unsure about this part, killContainer does not appear in too many places so we could potentially change every call to split the pre-stop from the term, this definitely complicates the code a little bit as going forward everyone would have to be aware they need to call both pre-stop and kill. Alternatively you could execute prestop on the sidecars twice, once before the containers are terminated and again when the sidecars are terminated, this would of course rely on the assumption that pre-stop hooks on sidecars are idempotent.
keps/sig-apps/sidecarcontainers.md
Outdated
###### Sidecars started first | ||
Package `kuberuntime` | ||
|
||
Modify `kuberuntime_manager.go`, function `computePodActions`. If pods has sidecars it will return these first in `ContainersToStart`, until they are all ready it will not return the non-sidecars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we return the sidecars, is it guaranteed that we'll sync again at the right time to get a chance to check readiness and start the non-sidecars? For example, do we sync every time any container status (as in, status on the machine, not k8s API object .status) changes? It might be obvious to people familiar with this code, but I just want to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, sync pods is frequently ran with the kubelet option of SyncFrequency
indicating the max gap between syncs, this defaults to 1minute
keps/sig-apps/sidecarcontainers.md
Outdated
@@ -67,7 +67,7 @@ Solve issues so that they don't require application modification: | |||
|
|||
## Non-Goals | |||
|
|||
Allowing multiple containers to run at once during the init phase. //TODO See if we can solve this problem with this proposal | |||
Allowing multiple containers to run at once during the init phase - this could be solved using the same principal but can be implemented separately. //TODO should this be a non-goal or an extension? is there enough support for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be to explore this just enough to be able to write a short section in the doc explaining how this could be extended to init containers in the future (at a similar level of implementation detail as the current section for implementing non-init sidecars).
We have clear, concrete use cases for non-init sidecars (Jobs, log/result shippers, sql proxy, Istio) and I don't think we should block addressing those. Non-init sidecar is already a pattern that's in use in the real world, and we just want to replace ugly workarounds (wrapper scripts, etc) with true support. Init sidecars are not in use because it's not possible even as a workaround.
###### Sidecars terminated last | ||
Package `kuberuntime` | ||
|
||
Modify `kuberuntime_container.go`, function `killContainersWithSyncResult`. Break up the looping over containers so that it goes through killing the non-sidecars before terminating the sidecars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can verify that pod
is only ever nil
when the Pod
object on the API server has been force-deleted, then I agree it's reasonable to say, "We don't respect sidecar termination order if you force-delete the Pod." Have you found any evidence to say exactly when pod
could be nil
? For example, is it possible as a transient condition during a Node restart, before it fetches Pods from the API server?
keps/sig-apps/sidecarcontainers.md
Outdated
###### Sidecars started first | ||
Package `kuberuntime` | ||
|
||
Modify `kuberuntime_manager.go`, function `computePodActions`. If pods has sidecars it will return these first in `ContainersToStart`, until they are all ready it will not return the non-sidecars. `computePodActions` is ran frequently, with the max time between runs being determined by the Kubelet's `SyncFrequency` option (default 1 minute). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minute is actually a large enough delay to be annoying. For example, if we always waited 1 minute after starting sidecars before realizing we can now start regular containers, we'd be adding that minute to the startup of any Pod using sidecars.
I would hope that there is also an event-based sync trigger in addition to the periodic one. Did you see any evidence of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that any changes to the Pod status should trigger a sync via an update channel. config/config.go should send a kubetypes.RECONCILE
when pod.Status
has changed. This is given to the kubelet's syncLoop
update channel which should trigger handler.HandlePodReconcile(u.Pods)
which will cause the pod to be synced. So when the sidecar becomes ready that should trigger a sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was not quite correct, it does indeed trigger handler.HandlePodReconcile(u.Pods)
but this only triggers a pod sync if NeedToReconcilePodReadiness
returns true, which currently only does anything if the pod is using readiness gates. Potentially we could put another check here looking to see if all sidecars are ready and non sidecars haven't been started, if so trigger a sync.
The kubelet doesn't currently have an update channel for readiness changes. only liveness so that isn't an option without making a fair amount of changes.
656198c
to
6186c93
Compare
/ok-to-test |
I have created a PoC for the sidecars and raised a WIP pr for people to view kubernetes/kubernetes#75099, very much a first draft but should hopefully help illustrate what these implementation details might look like. |
6186c93
to
ba6357e
Compare
ba6357e
to
29c22ba
Compare
Unless anyone has some big objections, it would be good to get this merged soon and revisit some of the unanswered questions in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enisoc, Joseph-Irving, kow3ns The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This adds some clarification around a few points and adds some initial implementation details.
I didn't want to go too specific with code example as I'm not sure how valuable that would be at this stage and instead just call out the places where I think we should implement the logic so we can discuss whether this seems like a sensible approach. Once some POCing has been done we can link to the POC branch in the implementation details.
/assign @enisoc @kow3ns