Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-45924: Fix "live get after write" in static pod installer controller. #1929

Merged

Conversation

benluddy
Copy link
Contributor

The static pod installer controller builds an apply configuration for the status of the nodes it manages based on potentially-stale state from an informer cache. The controller is written to assume that it has observed the effect of its own previous writes. If the cache is stale, this assumption can be violated which results in unpredictable installation decisions.

A mechanism was recently introduced requiring the installer controller to wait for its lister to catch up to the latest version after performing a write. This mechanism did not work because it depended on keeping state across calls to the Sync method, and because Sync has a value receiver, field writes were not visible on subsequent calls.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 29, 2025
@openshift-ci-robot
Copy link

@benluddy: This pull request references Jira Issue OCPBUGS-45924, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The static pod installer controller builds an apply configuration for the status of the nodes it manages based on potentially-stale state from an informer cache. The controller is written to assume that it has observed the effect of its own previous writes. If the cache is stale, this assumption can be violated which results in unpredictable installation decisions.

A mechanism was recently introduced requiring the installer controller to wait for its lister to catch up to the latest version after performing a write. This mechanism did not work because it depended on keeping state across calls to the Sync method, and because Sync has a value receiver, field writes were not visible on subsequent calls.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@benluddy
Copy link
Contributor Author

/hold

Getting some E2E signal while working on the tests...

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2025
The static pod installer controller builds an apply configuration for the status of the nodes it
manages based on potentially-stale state from an informer cache. The controller is written to assume
that it has observed the effect of its own previous writes. If the cache is stale, this assumption
can be violated which results in unpredictable installation decisions.

A mechanism was recently introduced requiring the installer controller to wait for its lister to
catch up to the latest version after performing a write. This mechanism did not work because it
depended on keeping state across calls to the Sync method, and because Sync has a value receiver,
field writes were not visible on subsequent calls.
@benluddy benluddy force-pushed the installer-liveget-sync-receiver branch from 314ad28 to 0fc0059 Compare January 29, 2025 18:23
@benluddy
Copy link
Contributor Author

/cc @dgrisonnet @tkashem

@openshift-ci openshift-ci bot requested a review from tkashem January 29, 2025 18:24
@tkashem
Copy link
Contributor

tkashem commented Jan 29, 2025

/lgtm
/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 29, 2025
@benluddy
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 29, 2025
@openshift-ci-robot
Copy link

@benluddy: This pull request references Jira Issue OCPBUGS-45924, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 29, 2025
@openshift-ci openshift-ci bot requested a review from wangke19 January 29, 2025 18:46
Copy link
Contributor

openshift-ci bot commented Jan 29, 2025

@benluddy: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Jan 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, dgrisonnet, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@benluddy
Copy link
Contributor Author

/hold cancel

I see this bug occur in practically every 4.19 E2E job, and I don't see it in the presubmits of openshift/cluster-etcd-operator#1392.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit fe56c2c into openshift:master Jan 29, 2025
4 checks passed
@openshift-ci-robot
Copy link

@benluddy: Jira Issue OCPBUGS-45924: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-45924 has not been moved to the MODIFIED state.

In response to this:

The static pod installer controller builds an apply configuration for the status of the nodes it manages based on potentially-stale state from an informer cache. The controller is written to assume that it has observed the effect of its own previous writes. If the cache is stale, this assumption can be violated which results in unpredictable installation decisions.

A mechanism was recently introduced requiring the installer controller to wait for its lister to catch up to the latest version after performing a write. This mechanism did not work because it depended on keeping state across calls to the Sync method, and because Sync has a value receiver, field writes were not visible on subsequent calls.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@benluddy
Copy link
Contributor Author

/cherry-pick release-4.18

@openshift-cherrypick-robot

@benluddy: new pull request created: #1932

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants