Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to use state file for IP allocation pool management #2110

Merged
merged 7 commits into from
Oct 24, 2022
Merged

Conversation

M00nF1sh
Copy link
Contributor

@M00nF1sh M00nF1sh commented Oct 19, 2022

What type of PR is this?
feature

Which issue does this PR fix:
N/A

What does this PR do / Why do we need it:

  1. Switch to use state file instead of CRI socket when restoring the IP allocation pool upon CNI restart.
    We need it to remove the CRI socket dependency.
  2. Adds a soak test suit to keep spin up and teardown pods on a chosen node, which can be executed for given time or forever via ginkgo(--until-it-fails).
  3. Did a few improvements to our test framework, mainly:
    • switch to use the REAL k8s client instead of cache one. which is one of the main source of flakes.
      • we have a lot cases where we create a deployment and wait for it to become ready, and this often flakes due to the "cache" deployment was ready but the real one wasn't.
      • I don't see a need to use a cached client for testing purpose, uses real one simplifies our test logic and reduces flakes.
    • added context based wait utils into k8s resource generators, which allows external code control timeouts easily.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

  • Spin up/down pods on a given node for several hours, and monitor the state file is consistent with pods on node.

Automation added to e2e:

Added a new soak test suit that keep spin up and teardown pods and compare the state file with pods on nodes.
The suit can be executed via:

MAX_POD_PER_NODE=<max-pods-for-your-instance-type> ginkgo -v --until-it-fails -- --cluster-kubeconfig=/Users/yyyng/.kube/config --cluster-name=<clusterName> --aws-region=us-west-2 --aws-vpc-id=<vpc-id> --ng-name-label-key="kubernetes.io/os" --ng-name-label-val="linux"

Will this PR introduce any new dependencies?:

No new dependencies as the file is already written.

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
This will break upgrades when upgrade from CNI < 1.7.0, as no state file were wrotten.
I have tested upgrade from CNI 1.10.0, and it works fine

Does this change require updates to the CNI daemonset config files to work?:

Does this PR introduce any user-facing change?:

Yes, the CRI socket mount is no longer needed.

We have switched to use state file instead of CRI socket for IP allocation pool restore.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@M00nF1sh M00nF1sh requested a review from a team as a code owner October 19, 2022 21:06
@jayanthvn jayanthvn added this to the v1.12.0 milestone Oct 19, 2022
@M00nF1sh M00nF1sh force-pushed the state-file branch 3 times, most recently from 402a21f to 5a959aa Compare October 20, 2022 02:32
Copy link
Contributor

@jdn5126 jdn5126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, just one performance concern and I will give others a chance to review

pkg/ipamd/datastore/data_store.go Show resolved Hide resolved
Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few nits. Can we also run the soak test on bottlerocket? Rest lgtm.

// Assume that no file == no containers are currently in use, e.g. a fresh reboot just cleared everything out.
// This is ok, and no-op.
if os.IsNotExist(err) {
ds.log.Debugf("backing store don't exists, assuming bootstrap on a new node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't -> doesn't

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :D fixed

@@ -1538,3 +1546,39 @@ func (ds *DataStore) CheckFreeableENIexists() bool {
}
return false
}

// NormalizeCheckpointDataByPodVethExistence will normalize checkpoint data by remove allocations that don't have a corresponding pod veth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove -> removing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :D fixed

jayanthvn
jayanthvn previously approved these changes Oct 21, 2022
Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@M00nF1sh M00nF1sh merged commit 7eeb2a9 into master Oct 24, 2022
jdn5126 pushed a commit to jdn5126/amazon-vpc-cni-k8s that referenced this pull request Oct 25, 2022
* switch to use state file for IP allocation pool management

* add soak test suit for state file

* address comments

* add missing go.sum entries
@jdn5126 jdn5126 deleted the state-file branch December 12, 2022 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants