Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kustomize re-orders config causing fail when crd or admission webhook used #821

Closed
donbowman opened this issue Feb 25, 2019 · 34 comments
Closed

Comments

@donbowman
Copy link
Contributor

If we look @ cert-manager we see that it has in its output:

  • crd
  • objects
  • webhooks
  • objects using webooks + crd

thus the order must be preserved.

So if I apply kustomize across their input and add

---
apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  namespace: kube-system
  name: selfsigning-issuer
spec:
  selfSigned: {}

to the 'end', i get a failure

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterissuers.admission.certmanager.k8s.io": the server could not find the requested resource

the attached output file shows. The selfsigning-issuer ClusterIssuer is emitted in the middle, before its webhook is defined

cert-manager-to-be-applied.yaml.gz

@donbowman
Copy link
Contributor Author

donbowman commented Feb 25, 2019

possible solutions?

  • kubectl gets a 'sort' or 'order' concept
  • kubectl gets a 'begin/commit' phase like sql
  • kustomize maintains order
  • kubectl understands crd + webhooks, orders crd first, then webhook creation first
  • kustomize understands order and emits crd first then webhook
  • kubectl gets a 'filter' argument, i can run 3 times with kubectl apply --filter=crd ; kubectl apply --filter=webhook first

others?

@donbowman
Copy link
Contributor Author

In the short term I have a workaround:
https://github.com/Agilicus/yaml_filter

kustomize build . | yaml_filter.py -i CustomResourceDefinition,ValidatingWebhookConfiguration |kubectl apply -f -
kustomize build . | yaml_filter.py -o CustomResourceDefinition,ValidatingWebhookConfiguration |kubectl apply -f -

@pwittrock
Copy link
Contributor

kustomize already sorts things like namespaces for this reason. CRDs and Webhooks should also be sorted IMHO.

@pwittrock
Copy link
Contributor

I think we just need to add MutatingWebhooks here:

"Namespace",

@donbowman
Copy link
Contributor Author

This issue is not resolved by that commit.
the commit doesn't even add the test case above.

@monopole
Copy link
Contributor

monopole commented Mar 5, 2019

Re-opened.

kubernetes/enhancements#865 proposed as a general solution to order.

Open to other ideas.

@donbowman You mention that the PR that attempted to close this issue did not include the data you provided, and it did not.

But the test case you mention was a large tarball of resources - which can be a lot to digest.

The readme suggests a good approach for reporting undesired output, but i realize that it's buried in a bunch of commentary.

basically, if you convert your data to a unit test in the pkg/target package, e.g. this config map generator test, then we all get multiple wins.

  1. The test unambiguously records the bad behavior currently exhibited by the code. The test must pass (so we can merge it), but it should be commented to show why the actual output is undesired, and what would be desired instead (e.g. resource Foo must be declared in the output before resource Bar). These tests are formulated to be very easy to build and read.

  2. When a purported fix comes along, it must modify this test, encoding the desired behavior, or the PR cannot be submitted. So proper modification of the test serves as proof that the bug is closeable.

  3. The test remains as part of the regression suite forever in either form - either the bad behavior stands as a proven thing, or the fixed behavior stands to protect itself.

please consider converting your data to such a test. It's not that i don't understand the problem, it's that i want help with regression coverage.

@anguslees
Copy link

anguslees commented Mar 5, 2019

Fwiw, I suggest replacing #822 with something that orders Webhook declarations last.

In a single invocation (and without more human-driven direction), we have to assume that we aren't able to perform the webhook until after Deployments/etc have started, and we also have to assume that other resources in the same invocation don't require the (mutating) webhook to act on them (that would be a bootstrapping cycle).

(Edit: or better yet for kustomize: preserve original document order)

mgoltzsche added a commit to mgoltzsche/kustomize that referenced this issue May 22, 2019
Fixes the cert-manager example of kubernetes-sigs#821.
mgoltzsche added a commit to mgoltzsche/kustomize that referenced this issue May 22, 2019
Fixes the cert-manager example of kubernetes-sigs#821.
@mgoltzsche
Copy link
Contributor

mgoltzsche commented May 23, 2019

I changed the order so that ValidatingWebhookConfiguration now comes last.
This makes a cert-manager.yaml installation work with kustomize as well.

However unnecessary problems caused by ordering may still appear in theory when using objects with a CustomResourceDefinition's kind where one has a hard dependency to the other.
This could be avoided when the original input source order would be respected as kubectl apply -f does it. For this reason I agree with @anguslees that the original order should be preserved.

The question is: why was a kustomize-specific order introduced in the first place?

Edit: Also I didn't touch the MutatingWebhookConfiguration order since I had no example to test at hand but I think it must come last as well.

@liggitt
Copy link
Contributor

liggitt commented May 23, 2019

This could be avoided when the original input source order would be respected as kubectl apply -f does it. For this reason I agree with @anguslees that the original order should be preserved.

I strongly agree with this. Baking in dependency re-ordering requires understanding all types.

@brichins
Copy link

What was the downside to leaving ordering up to the user, based on the order of the manifest appeared in the kustomization.yaml resources array? What motivated explicit ordering in gvk.go?

@mgoltzsche
Copy link
Contributor

mgoltzsche commented May 28, 2019

I guess one of the reasons was to make it more declarative/deterministic (but I don't know - one of the original authors should answer this - @monopole?).
After having thought about it a little longer I think that this ordering decision may not be as bad as I first thought because even with kubectl apply -f you hit the limits of what can be deployed using a single command quickly: Usually when you want to apply a custom resource the corresponding crd, API service and deployment(s) need to be available first. Hence, if you don't want to apply everything using --validate=false, you need to apply your crd, API service and deployment first and wait for it to become available before you apply resources that rely on it. This looks like some kind of dependency/deployment management requirement and one could argue that it is a matter of higher level tooling. Still such a feature could be added to kubectl - especially since kubectl is aware of all resource types while kustomize cannot be.
In absence of such a feature within kubectl kustomize forces the user to split her artifacts into smaller separately applied packages or write additional order configuration unnecessarily.
(However I haven't seen such a hard dependency between custom resources yet that would require to break with the current kustomize behaviour and the latter can always be worked around.)

Is there already tooling that features the kind of deployment described above (waiting for apiservice/deployment to become available before applying another yaml/package)? (I don't mean helm + shell scripts)

@jerrinss5
Copy link

To just carry on forward with what @brichins had suggested, can MutatingWebhookConfiguration be added to orderLast
This is to support scenarios where if MutatingWebhookConfiguration is set with failurePolicy: Fail and would prevent deployments to be deployed as that deployment is the endpoint which the MutatingWebhookConfiguration is referring to.

@liggitt
Copy link
Contributor

liggitt commented May 30, 2019

This is to support scenarios where if MutatingWebhookConfiguration is set with failurePolicy: Fail and would prevent deployments to be deployed as that deployment is the endpoint which the MutatingWebhookConfiguration is referring to.

that is not a good configuration, and creation order is not a reasonable way to mitigate that. see the namespaceSelector option in webhooks for the ability to exclude the namespace containing the webhook's deployment from depending on the webhook

@jerrinss5
Copy link

@liggitt thanks for the info.

This was referenced Jun 6, 2019
monopole added a commit to monopole/kustomize that referenced this issue Jun 17, 2019
Modify all test coverage to compare actual build
output against a non-sorted version of expected
output.

Expected output is specified in the same order as
the inputs are specified in the kustomization
file, using a depth-first ordering when overlays
are involved.

Fixes kubernetes-sigs#756
Related kubernetes-sigs#821
monopole added a commit to monopole/kustomize that referenced this issue Jun 17, 2019
Modify all test coverage to compare actual build
output against a non-sorted version of expected
output.

Expected output is specified in the same order as
the inputs are specified in the kustomization
file, using a depth-first ordering when overlays
are involved.

Fixes kubernetes-sigs#756
Related kubernetes-sigs#821
monopole added a commit to monopole/kustomize that referenced this issue Jun 17, 2019
Modify all `build` tests to use the raw,
non-sorted output of build.  This makes every test
provide coverage for how kustomize re-orders (or
doesn't reorder) resources during processing.

Going forward, the ordering of resources in
_expected_ output should match the depth-first
ordering specified in the `resources:` field used
in the test's kustomization file.

The only exception to this rule would be tests
that actually confirmed some other output
ordering, e.g. the test of the
`LegacyOrderTransformer` plugin.

Fixes kubernetes-sigs#756
Related kubernetes-sigs#821
monopole added a commit to monopole/kustomize that referenced this issue Jun 17, 2019
Modify all `build` tests to use the raw,
non-sorted output of build.  This makes every test
provide coverage for how kustomize re-orders (or
doesn't reorder) resources during processing.

Going forward, the ordering of resources in
_expected_ output should match the depth-first
ordering specified in the `resources:` field used
in the test's kustomization file.

The only exception to this rule would be tests
that actually confirmed some other output
ordering, e.g. the test of the
`LegacyOrderTransformer` plugin.

Fixes kubernetes-sigs#756
Related kubernetes-sigs#821
@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2020
@invidian
Copy link
Member

invidian commented Jan 2, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2020
@invidian
Copy link
Member

invidian commented Apr 1, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@jeffhubLR
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2020
@invidian
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2020
@invidian
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 27, 2021
@foobarbecue
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2021
@monopole
Copy link
Contributor

monopole commented Apr 13, 2021

What if anything needs to be done here?

  • a user can define input order to match their desired output order and enter:
    kustomize build --reorder none {target} | kubectl apply -f -
    
  • Regardless, ordering backend deployments before frontends in one
    apply isn't sufficient, or even necessary, to bring up a healthy system.

background

kustomize build has a --reorder flag that accepts only two options, legacy and none.

  • The default value is legacy.

    • This is a heuristic sort (e.g. emit Namespaces first, do WebHooks last).
    • It keys off kind, and is controlled by the Gvk.IsLessThan function.
    • This sort dates back to the early days when kustomize internals were
      map based, and didn't respect input order. To get deterministic output,
      a sort was added.
    • At some point (mid 2019?) kustomize changed to retain and respect input order.
      The sort just before output had to be retained as the default since
      people expected it.
    • Perhaps we can change this default via popular demand?
  • Under none, the output order then matches the input order - first in, first out (FIFO).

    • if resource R1 is specified before resource R2 in a kustomization file resources field,
      R1 will come out before R2. The kinds are irrelevant.

    • Objects from the resources field will come out before objects from the generated field.

    • The resources field can refer to other kustomizations, and these are recursively
      generated in the order specified by resources. So in the larger picture of a directed
      graph of overlay-bases, output order is a depth-first-traversal.

    • All of the tests rely on this FIFO (non-legacy) behavior.
      To prove this, pick a test, e.g. namereferece_test,
      change the order in a resources field, run the tests, and see what happens.

That said, kubernetes is built on the notion of controllers being told to
achieve some state, and then trying to get to that state in a non-centrally
controlled fashion. Controllers can be expected to be interrupted in mid-change
to go to some other state instead.

So, the API server, given an apply request with N resources, doesn't perform a serial
update - waiting for item 1 to finish updating, then kicking off 2, etc.

The cases where it appears to do so, e.g. namespace creation, are outliers.

@monopole
Copy link
Contributor

More comments in #3794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests