-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047
Comments
Copying latest findings here #56426 (comment):
@crassirostris It seems that both DaemonSets are deployed? If so, DaemonSet controller is not at fault. It's more likely to be something with addon manager. It should only deploy one fluentd DaemonSet. |
@janetkuo Agree, seems like a problem with addon manager /cc @mikedanese @roberthbailey @k8s-mirror-cluster-lifecycle-bugs |
At this point it seems somehow vague to tell which exact component is causing the issue. Probably worth to solve the fundamental downgrade issue (#57013) first. |
Digging in a bit ... Looks like the test installs a Log Provider and fails here as it finds 2 agents instead of just the one that it installs: Probably because i am guessing fluentd was switched on in this environment? This seems to have happened around 11/30, ring any bells to anyone? |
@dims Test doesn't install anything, there's a valid error in addon manager, which results in two addon DaemonSets running after the downgrade Flunetd is there since forever, as DaemonSet for 3-4 releases already, nothing has changed in it, except for the version, which should be handled adequately by addon manager, but it's not |
Ack @crassirostris thanks! |
Another clue a better one i hope, grepping through the UUID(s) for the fluentd containers - http://paste.openstack.org/raw/628792/ spotted the following
and trying to look for why it failed, saw the following
These logs are from: |
/sig network |
This log line usually shows up when kubelet restarts. It should recover after a while.
|
Can someone from sig-node investigate why kubelet was restarted and if it's normal? @kubernetes/sig-node-bugs |
[MILESTONENOTIFIER] Milestone Issue Needs Attention @crassirostris @spiffxp @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-instrumentation-misc @kubernetes/sig-network-misc @kubernetes/sig-node-misc Action required: During code freeze, issues in the milestone should be in progress. Note: This issue is marked as Example update:
Issue Labels
|
@dchen1107 will take a look at kubelet restarts to see if that's normal |
I think I got the root cause. Starting from addon-manager's log of the current run (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-master-gci-new-downgrade-cluster/563):
So clearly something changed for ClusterRoleBinding "kubelet-cluster-admin" between 1.8 and 1.9, upon the downgrade, addonmanager tried to apply the change to an immutable field and failed. Due to some implementation details in @cjcullen might know about the ClusterRoleBinding "kubelet-cluster-admin" change? |
cc @mikedanese who last touched these files: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/rbac |
Seems like ClusterRoleBinding kubelet-cluster-admin does change a bit in the legacy-kubelet-user-disable case: |
It seems related to #53144.
In this case, we create a new cluster, which gets the new binding. Then we try to downgrade, which ends up attempting to switch back to the old binding. I wonder if we can cherry-pick something to 1.8 to make it keep the new binding upon downgrade? |
cc @liggitt and @tallclair who reviewed the above PR. |
From addonmanager perspective, one way to achieve this is by changing the Reconcile mode to EnsureExist mode.
|
1.8 created kubelet-cluster-admin binding to system:node role in ReconcileMode If we want 1.9 to tolerate that and not remove it, we should leave the kubelet-cluster-admin binding to system:node role with no subjects and EnsureExists mode Separately in 1.9 we can create a new kubelet-bootstrapper binding to system:node-bootstrapper for the kubelet subject with reconcile mode |
(Which is what it looks like https://github.com/kubernetes/kubernetes/pull/53144/files#diff-b701a165afa6442d4c7d06bf087d88fd did… the later change of what role it bound to was the breaking change) |
962e1e2 should have left the existing binding alone and created a new binding for the new role |
@liggitt Just to make sure we're on the same page: We didn't see any problem in 1.8->1.9 upgrade tests. This only shows up in the downgrade test where we start with a fresh 1.9 cluster and downgrade to 1.8. It sounds like your plan will help with both directions, but just want to be clear. Can you work on a PR or do I need to find someone? |
Nice work @MrHohn, So we don't really capture the addon manager logs do we? Looks like you had to pick one up from the running system, is that right? |
@dims Yeah, unfortunately addon manager logs on GKE CIs is not publicly visible. There may be ways to retrieve the same logs for history runs, just that getting it from a running cluster seems easiest to me. |
Quickly looked at Kubelet and netconfig is not ready error message should be red herring. Right after that, all nodes are ready with the following logging:
Will look more. |
Ahh, after refreshing the issue, and found that @MrHohn already identified the root cause. Nice work and thanks! |
The add-on manager would still have hit issues trying to reapply the changed rolebinding, but the 1.8 binding granted a superset of permissions, so we likely didn't notice.
Correct
opened #57172 |
Thanks @dchen1107 at least my investigation triggered a deeper look which seems to have helped the cause. yay! |
Automatic merge from submit-queue (batch tested with PRs 57172, 55382, 56147, 56146, 56158). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. gce: split legacy kubelet node role binding and bootstrapper role binding fixes issue upgrading 1.8->1.9 or downgrading 1.9->1.8 fixes #57047 ```release-note NONE ```
Reopening until we confirm the fix worked: https://k8s-testgrid.appspot.com/sig-release-1.9-all#gke-1.9-1.8-downgrade-parallel&width=80 |
/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
/area platform/gke
@kubernetes/sig-instrumentation-test-failures
owns the test
This test has been failing since at least 2017-11-30 for the following job:
This job is on the sig-release-master-upgrade dashboard, and prevents us from cutting v1.9.0 (kubernetes/sig-release#40). Is there work ongoing to bring this test back to green?
/assign @crassirostris @janetkuo
Pulling this out of #56426 (comment) into its own issue. It might be GKE specific? But the GCE downgrade jobs are failing right now, so I don't have enough data to say for sure
The text was updated successfully, but these errors were encountered: