feat: support local service in multiple standard load balancer mode #4450

nilo19 · 2023-08-15T08:21:27Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

In multiple slb mode, each local service owns a backend pool named after the service name. The backend pool will be created in the service reconciliation loop when the service is created or updated from external traffic policy cluster. It will be deleted in the service reconciliation loop when: 1, the service is deleted; 2, the service is changed to etp cluster; 3, the cluster is migrated from multi-slb to single-slb; and 4, the service is moved to another load balancer.
Besides the service reconciliation loop, an endpointslice informer is added in this pr. It watches all endpointslices of local services, monitors any updating events, and updates the corresponding backend pool. Considering local services may churn quickly, the informer will send backend pool updating operations to a buffer queue. The queue merges operations targeting to the same backend pool, and updates them every 30s.

Which issue(s) this PR fixes:

Fixes #
Related: #4013

Special notes for your reviewer:

Doc: #4451

Does this PR introduce a user-facing change?

feat: support local service in multiple standard load balancer mode

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-08-15T08:21:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nilo19]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nilo19 · 2023-08-15T08:22:23Z

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

nilo19 · 2023-08-15T11:22:52Z

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

nilo19 · 2023-08-16T02:13:09Z

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

jwtty · 2023-08-16T05:04:34Z

tests/e2e/network/multiple_standard_lb.go

+			if strings.HasPrefix(strings.ToLower(pointer.StringDeref(bp.Name, "")), strings.ToLower(bpName)) {
+				found = true
+			}
+			if len(*bp.LoadBalancerBackendAddresses) != expectedCount {


Do you mean all bps will have the same amount of nodes?

Yeah, only one bp should be created.

jwtty · 2023-08-16T05:23:18Z

tests/e2e/network/multiple_standard_lb.go

+		_, err = cs.AppsV1().Deployments(ns.Name).Update(context.Background(), deployment, metav1.UpdateOptions{})
+		Expect(err).NotTo(HaveOccurred())
+		err = wait.PollUntilContextTimeout(context.Background(), 5*time.Second, 5*time.Minute, false, func(ctx context.Context) (bool, error) {
+			if err := checkNodeCountInBackendPoolByServiceIPs(tc, clusterName, expectedBPName, ips, len(nodes)); err != nil {


so len(nodes) <= 5?

It can be any number. I have a wrong log in L192.

pkg/provider/azure_loadbalancer.go

feiskyer · 2023-08-16T08:16:34Z

pkg/provider/azure.go

@@ -264,6 +264,11 @@ type Config struct {

 	// DisableAPICallCache disables the cache for Azure API calls. It is for ARG support and not all resources will be disabled.
 	DisableAPICallCache bool `json:"disableAPICallCache,omitempty" yaml:"disableAPICallCache,omitempty"`
+
+	// RouteUpdateIntervalInSeconds is the interval for updating routes. Default is 30 seconds.


Thanks making those intervals configurable.

feiskyer · 2023-08-16T08:26:50Z

pkg/provider/azure_loadbalancer_backendpool_test.go

@@ -247,10 +238,77 @@ func TestEnsureHostsInPoolNodeIP(t *testing.T) {
 				},
 			},
 		},
+		{
+			desc:  "local service",


could you make the desc for those tests more clear? e.g. something like local service should xxxx.

feiskyer · 2023-08-16T08:30:18Z

pkg/provider/azure_local_services.go

 }

 func (op *loadBalancerBackendPoolUpdateOperation) wait() batchOperationResult {
-	return <-op.result
+	return batchOperationResult{}


Is this still used somewhere? seems it should be deleted.

Yes, it is used in route updater.

feiskyer · 2023-08-16T08:31:03Z

pkg/provider/azure_local_services.go

+				"service name", op.serviceName,
+				"load balancer name", op.loadBalancerName,
+				"backend pool name", op.backendPoolName,
+				"related IPs", strings.Join(op.nodeIPs, ","))


s/related IPs/node IPs/

feiskyer · 2023-08-16T08:31:12Z

pkg/provider/azure_local_services.go

+		"service name", op.serviceName,
+		"load balancer name", op.loadBalancerName,
+		"backend pool name", op.backendPoolName,
+		"related IPs", strings.Join(op.nodeIPs, ","))


feiskyer · 2023-08-16T08:37:55Z

pkg/provider/azure_local_services.go

+
+// getLocalServiceEndpointsNodeNames gets the node names that host all endpoints of the local service.
+func (az *Cloud) getLocalServiceEndpointsNodeNames(service *v1.Service) (sets.Set[string], error) {
+	eps, err := az.KubeClient.DiscoveryV1().EndpointSlices(service.Namespace).List(context.Background(), metav1.ListOptions{})


could we get this from local watch cache instead of a list API?

Sure, will add a cache.

Added a cache, please check the latest commit.
BTW, there is a problem theoretically: if the endpointslice doesn't finish updating when we ensure hosts in pool, there could be a discrepancy between the # of node IPs in the pool and the actual state. This is because the endpoint slice update may be slower than the pods update. After testing multiple times, I haven't found this behavior but I think this could happen.

yes, that's a valid issue on large scale of workloads. Could we also trigger the backend pool updates on the endpointslice add/delete events?

Synced offline. There is no need to support backend pool updates for add/delete events. The latency issue will be fixed in a patch version of v1.28.

nilo19 · 2023-08-18T03:09:15Z

pkg/provider/azure_local_services.go

+
+// setUpEndpointSlicesInformer creates an informer for EndpointSlices of local services.
+// It watches the update events and send backend pool update operations to the batch updater.
+// TODO (niqi): the update of endpointslice may be slower than tue update of endpoint pods. Need to fix this.


Added a todo

jwtty · 2023-08-18T03:41:25Z

pkg/provider/azure_local_services.go

+	var ep *discovery_v1.EndpointSlice
+	az.endpointSlicesCache.Range(func(key, value interface{}) bool {
+		endpointSlice := value.(*discovery_v1.EndpointSlice)
+		if strings.EqualFold(getServiceNameOfEndpointSlice(endpointSlice), service.Name) {


Need to check service namespace also

Thanks for catching, please check the new commit.

jwtty · 2023-08-18T07:04:17Z

/test pull-cloud-provider-azure-e2e-ccm-vmss-capz

jwtty · 2023-08-18T08:31:50Z

pkg/provider/azure_loadbalancer_backendpool.go

+		} else {
+			key := strings.ToLower(getServiceName(service))
+			si, found := bi.getLocalServiceInfo(key)
+			if found && !strings.EqualFold(si.lbName, lbName) {


What would happen if originally cluster was using single LB and a local service was created (so this map does not contain the service) and then cluster is updated to multiple LB (and service may need to be changed to another LB)? Want to make sure it still works.

We reconcile all managed LB's backendpools in the outer caller of ensureHostsInPool.

It's ok if the map doesn't contain the service here, we just block the case where the lb name in the map is incorrect.

jwtty · 2023-08-18T09:15:23Z

pkg/provider/azure_local_services.go

+				az.endpointSlicesCache.Store(strings.ToLower(fmt.Sprintf("%s/%s", newES.Namespace, newES.Name)), newES)
+
+				key := strings.ToLower(fmt.Sprintf("%s/%s", newES.Namespace, svcName))
+				si, found := az.getLocalServiceInfo(key)


When user migrates from single LB to multiple LBs, a local svc may be not be found right? Will that cause any problem?

No it won't, in this case the main loop will take over the process.

pkg/provider/azure_local_services.go

jwtty · 2023-08-18T16:49:22Z

/lgtm

jwtty · 2023-08-19T00:55:46Z

/test pull-cloud-provider-azure-e2e-ccm-vmssflex-capz

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 15, 2023

k8s-ci-robot requested review from lzhecheng and MartinForReal August 15, 2023 08:21

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 15, 2023

nilo19 force-pushed the feat/multi-slb/local-svc/main branch from ef79b2e to 68fa2cd Compare August 15, 2023 08:29

nilo19 force-pushed the feat/multi-slb/local-svc/main branch from 68fa2cd to ccb8c6e Compare August 15, 2023 14:37

jwtty reviewed Aug 16, 2023

View reviewed changes

nilo19 force-pushed the feat/multi-slb/local-svc/main branch from ccb8c6e to 8513bdf Compare August 16, 2023 06:15

jwtty reviewed Aug 16, 2023

View reviewed changes

pkg/provider/azure_loadbalancer.go Show resolved Hide resolved

feiskyer reviewed Aug 16, 2023

View reviewed changes

nilo19 force-pushed the feat/multi-slb/local-svc/main branch 5 times, most recently from 745c706 to 4e3536d Compare August 17, 2023 05:46

feat: support local service in multiple standard load balancer mode

9f9d85c

nilo19 force-pushed the feat/multi-slb/local-svc/main branch from 4e3536d to 9f9d85c Compare August 18, 2023 03:02

nilo19 commented Aug 18, 2023

View reviewed changes

jwtty reviewed Aug 18, 2023

View reviewed changes

fix: change endpoint slice cache key to add namespace

06108f4

jwtty reviewed Aug 18, 2023

View reviewed changes

pkg/provider/azure_local_services.go Show resolved Hide resolved

k8s-ci-robot assigned jwtty Aug 18, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2023

k8s-ci-robot merged commit 58f4150 into kubernetes-sigs:master Aug 19, 2023

nilo19 deleted the feat/multi-slb/local-svc/main branch August 21, 2023 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support local service in multiple standard load balancer mode #4450

feat: support local service in multiple standard load balancer mode #4450

nilo19 commented Aug 15, 2023 •

edited

Loading

k8s-ci-robot commented Aug 15, 2023

nilo19 commented Aug 15, 2023

nilo19 commented Aug 15, 2023

nilo19 commented Aug 16, 2023

jwtty Aug 16, 2023

nilo19 Aug 16, 2023

jwtty Aug 16, 2023

nilo19 Aug 16, 2023

feiskyer Aug 16, 2023

feiskyer Aug 16, 2023

feiskyer Aug 16, 2023

nilo19 Aug 16, 2023

feiskyer Aug 16, 2023

feiskyer Aug 16, 2023

feiskyer Aug 16, 2023

nilo19 Aug 16, 2023

nilo19 Aug 17, 2023

feiskyer Aug 18, 2023

nilo19 Aug 18, 2023

nilo19 Aug 18, 2023

jwtty Aug 18, 2023

nilo19 Aug 18, 2023

jwtty commented Aug 18, 2023

jwtty Aug 18, 2023

nilo19 Aug 18, 2023

nilo19 Aug 18, 2023

jwtty Aug 18, 2023

nilo19 Aug 18, 2023

jwtty commented Aug 18, 2023

jwtty commented Aug 19, 2023

feat: support local service in multiple standard load balancer mode #4450

feat: support local service in multiple standard load balancer mode #4450

Conversation

nilo19 commented Aug 15, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Aug 15, 2023

nilo19 commented Aug 15, 2023

nilo19 commented Aug 15, 2023

nilo19 commented Aug 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwtty commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwtty commented Aug 18, 2023

jwtty commented Aug 19, 2023

nilo19 commented Aug 15, 2023 •

edited

Loading