Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insulate users from kubeadm API version changes #2769

Closed
randomvariable opened this issue Mar 24, 2020 · 38 comments
Closed

Insulate users from kubeadm API version changes #2769

randomvariable opened this issue Mar 24, 2020 · 38 comments
Assignees
Labels
area/dependency Issues or PRs related to dependency changes kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/release-blocking Issues or PRs that need to be closed before the next CAPI release lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@randomvariable
Copy link
Member

randomvariable commented Mar 24, 2020

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Goals

  1. Improve kubeadm to support Cluster API
  2. Remove duplication of effort across kubeadm and Cluster API

Non-Goals/Future Work

  1. Fully replace kubeadm

User Story

As an operator, I want kubeadm to have better support for Cluster API's use cases to reduce the number of failed machines in my infrastructure.

Detailed Description

In a number of environments, machines can intermittently fail to bootstrap. The most common of these are control plane joins, which lead to temporary changes in etcd and API server availability, mediated by the speed of the underlying infrastructure and the particulars of infrastructure load balancers.

Some ugly hacks have been introduced, notably #2763 to retry kubeadm operations. As a long term solution, Cluster API should be a good kubeadm citizen and make changes to kubeadm to do the appropriate retries to cover the variety of infrastructure providers supported by Cluster API. In addition, the KCP controller re-implements some of the

Contract changes [optional]

Data model changes [optional]

  • Migrate to kubeadm v1beta2

[Describe contract changes between Cluster API controllers, if applicable.]

/kind proposal

@k8s-ci-robot k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Mar 24, 2020
@ncdc ncdc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 25, 2020
@ncdc ncdc added this to the Next milestone Mar 25, 2020
@ncdc ncdc added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 25, 2020
@ncdc
Copy link
Contributor

ncdc commented Mar 25, 2020

cc @neolit123 @fabriziopandini

@fabriziopandini
Copy link
Member

I would like to add:

  • Kubeadm operation should be idempotent -> so it is easier to re-execute in case of problems
  • Investigate usage of etcd learner mode
  • investigate changes to the join workflow in order to prevent problems when the API server gets available before the local etcd instance

Final thought.
Despite all the improvements we can add to kubeadm, a CLI cannot prove the same guarantees a reconciliation loop does. So, it is necessary that also ClusterAPI implements/improve the capability to detect failures in the CLIs and replace failed nodes

@neolit123
Copy link
Member

agreed to all of @fabriziopandini 's points.

kubeadm follows the philosophy of a CLI tool (like ssh, ftp, etc) and it cannot anticipate all of the infrastructure related failures. but having a sane / best-effort amount of retries in the CLI tool makes sense.

Support the effort to move kubeadm out-of-tree

hopefully scheduled for 1.19. depends a lot on sig-release and partly on sig-arch!

Make kubeadm retry operations based on data gathered from Cluster API users

this can be useful, no doubt. like i've mentioned today, interestingly we have not seen major complains about the failures CAPI is seeing. users are applying custom amount of timeout around their cluster creation on custom infrastructure (e.g. "i know what my GCE running cluster needs").

Consider implementing machine-readable for kubeadm to support #2554

@randomvariable can you expand on this point?
we have a tracking issue to support machine readable output. but not sure how does this relate to the failures. to my understanding one of the major issues we have in CAPI is that we cannot get signal if kubeadm join returned > 0.

Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

there is a tracking issue for that as well. it will be a long process and the timeline is unclear.
after the move, we can start working on that but for a period of time the exposed library will be unstable.

@randomvariable
Copy link
Member Author

randomvariable commented Mar 25, 2020

For #2254 we will likely have some component on the machine call back to an infrastructure API notification service (or back to the management cluster) to provide information about the failure. Providing users with access to the log is one case, but providing a readable output, which may actually be something along the lines of expanding the range of error codes, could update a specific condition on the machine related to the exact kubeadm failure. A controller could then take appropriate remediative action. I agree this is long-term however.

@vincepri
Copy link
Member

/area dependency

@k8s-ci-robot k8s-ci-robot added the area/dependency Issues or PRs related to dependency changes label Mar 31, 2020
@vincepri vincepri removed the kind/proposal Issues or PRs related to proposals. label Apr 27, 2020
@vincepri
Copy link
Member

Removing this as a proposal, rather seems like a future cleanup

/kind cleanup

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 27, 2020
@neolit123
Copy link
Member

neolit123 commented Jun 16, 2020

WRT:

Some ugly hacks have been introduced, notably #2763 to retry kubeadm operations.

in 1.19 kubeadm merged a number of fixes and backported them to 1.17, 1.18:

kubernetes/kubeadm#2091
kubernetes/kubeadm#2092
kubernetes/kubeadm#2093
kubernetes/kubeadm#2094

/assign @fabriziopandini
for evaluation of this part.

adding up-to-date comments to the rest of the tasks:

Support the effort to move kubeadm out-of-tree

[1] timeline is unclear, we are blocked on the lack of policy for component extractions out of k/k.
we have stakeholders such as sig-arch and sig-release who see this as low-prio.

Make kubeadm retry operations based on data gathered from Cluster API users

fixes above should conform this task.

Consider implementing machine-readable for kubeadm to support #2554

we did not merge any PRs in 1.19 for MRO as the contributor was busy with other tasks, but the boilerplate is in place.

Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

this is very long term, potentially after [1]

Migrate to kubeadm v1beta2

v1beta1 is scheduled for removal in kubeadm 1.20 and my proposal would be to keep us on track for this effort.

@vincepri
Copy link
Member

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: Next, v0.4.0 Jun 16, 2020
@fabriziopandini
Copy link
Member

@neolit123 thanks for the update!

@ncdc
Copy link
Contributor

ncdc commented Jul 10, 2020

xref my comment from #3323 (comment):

We should stop exposing the kubeadm v1betax types in our KubeadmConfig/KubeadmControlPlane specs, and instead use our own types. This would allow us to separate what users fill in from which kubeadm API version we end up using in our bootstrap data. As @detiber pointed out, we still have to know which version of the kubeadm types to use when generating our kubeadm yaml file and when interacting with the kubeadm-config ConfigMap.

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 12, 2021
@randomvariable
Copy link
Member Author

I know we don't have a label for it, but just for tracking

/area node-agent

@k8s-ci-robot
Copy link
Contributor

@randomvariable: The label(s) area/node-agent cannot be applied, because the repository doesn't have them

In response to this:

I know we don't have a label for it, but just for tracking

/area node-agent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri vincepri removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. labels Feb 8, 2021
@vincepri
Copy link
Member

vincepri commented Feb 8, 2021

/assign @fabriziopandini
/priority critical-urgent
/milestone v0.4.0

This task is going to be tackled as release blocker for CABPK first.

As part of it, we need a plan of action for v1alpha3 as well, where we translate the v1beta1 types currently used to v1beta2 or later when creating newer Kubernetes clusters.

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2021
@fabriziopandini
Copy link
Member

/lifecycle active
/kind api-change

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Mar 10, 2021
@CecileRobertMichon
Copy link
Contributor

/kind release-blocking

@k8s-ci-robot k8s-ci-robot added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Apr 5, 2021
@fabriziopandini
Copy link
Member

fabriziopandini commented May 24, 2021

@vincepri @CecileRobertMichon I think we can close this one

Instead IMO we should bump up the priority for following issues/PRs which are mandatory to get v1alpha4 to work with Kubernetes v1.22:

@sbueringer
Copy link
Member

@fabriziopandini @CecileRobertMichon @vincepri Will 1.22 support be mandatory for the CAPI v0.4.0 release and does bump up then mean release blocking?

(I don't really know when we wanted to release v0.4.0, Kubernetes 1.22 release seems to be mid-August according to https://github.com/justaugustus/enhancements/blob/2887453eac5cbc5fbd31112fd3d0be2be17b456c/keps/sig-release/2572-release-cadence/README.md)

@fabriziopandini
Copy link
Member

fabriziopandini commented May 24, 2021

Yes, with bump up I intended to make the issue listed above as release blocking.
The rational behind this is that IMO we should cut v1alpha4 only if it working with v1.22 - at the day we plan to do the cut (things can still change in v1.22 after the v0.4 cut date, but at least all problems known as of today are out of the table).

@CecileRobertMichon
Copy link
Contributor

+1 not blocking releasing v0.4 on the release timeline of k8s 1.22

@fabriziopandini, you are saying that we can cut v1alpha4 even if 1.22 is not out yet, given that we have fixed the known compatibility issues defined above, correct?

@fabriziopandini
Copy link
Member

yes, let's do our best to get ready for v1.22 within v0.4, but we should not wait for it (I edited my comment above to make it clear, I hope)

@vincepri
Copy link
Member

The rational behind this is that IMO we should cut v1alpha4 only if it working with v1.22

Just to clarify, if we're not ready for 1.22 we shouldn't block the release, but rather add the support in a patch release

@vincepri
Copy link
Member

Also, given that 1.22 isn't ready yet, we're relying on an alpha/beta/rc version which until release could still bring in additional required changes

@fabriziopandini
Copy link
Member

I think we are in agreement.
Let try to get ready as much as possible.

@fabriziopandini
Copy link
Member

What about closing this issue?

@vincepri
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependency Issues or PRs related to dependency changes kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/release-blocking Issues or PRs that need to be closed before the next CAPI release lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests