Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase reload performances #547

Closed
3 tasks done
prometherion opened this issue Apr 2, 2020 · 14 comments
Closed
3 tasks done

Increase reload performances #547

prometherion opened this issue Apr 2, 2020 · 14 comments
Milestone

Comments

@prometherion
Copy link
Contributor

prometherion commented Apr 2, 2020

This issue aims to track the required improvements to decrease the reload time of the HAProxy instance when managing thousand of Ingress resources, as mentioned in #541 when dealing with ~17k items.

The overall time required to perform a configuration reload is divided into three spans:

  • parse_ingress
  • parse_tcp_svc
  • write_tmpl
  • reload_haproxy

All these actions, except reload_haproxy, are handled by the Ingress Controller and can be optimized further providing refactoring, enhancements or new functionalities.

According to these, we can plan and discuss the required improvements to tackle the situation at different levels.

  • Working on parse_ingress section using Listers instead of cache.Store to parse ingresses (addressed by Converting to cache.Listers #545)
  • speed up the write_tmpl section
  • TBC: parser evolution for dynamic in-memory update rather than new configuration creation and compare

@jcmoraisjr can you share further details regarding your proof of concept for the dynamic in-memory update?

If you don't mind, I would give a try to optimize further the configuration templating, maybe switching over a third-party package like quicktemplate that looks blazing fast compared to the template manager provided by the standard library.

@jcmoraisjr
Copy link
Owner

can you share further details regarding your proof of concept for the dynamic in-memory update?

The controller actually reads and parses every single ingress, adding them to a clean model. The first time this happens, haproxy is started using such model and the current model is copied to old. Once an update happens, a new clean model is created, all the current ingress are added and the actual current model is compared to the old one in order to identify if a dynamic update could be done.

This works very well on clusters that have up to about 5k ingress and 5k services but apparently doesn't scale very well. The proposal here is to use a single model (or use the old one as a starting point) where ingress isn't just added, but also updated and deleted. I think remove a whole domain and reinclude only cahnged ingress is doable, and should run magnitudes faster than the actual implementation on clusters whose the number of domains is high and the number of paths per domain is not. I'll start this approach and let's see how it scales. New ideas usually arises after starting to code.

If you don't mind

No I don't, be my guest! Btw I do follow some of valyala components out there, they have in general a pretty good performance.

@jcmoraisjr
Copy link
Owner

Template time. My guess is that we can split haproxy conf in a couple of files and render+save only the files whose configuration was changed. About +95% of the config on big clusters are backend declaration and they can be splitted into a few tens of files depending on the number of services.

@jcmoraisjr
Copy link
Owner

Dynamic model helps to reduce CPU usage and processing time on clusters with 2k+ ingress and 2k+ services. Clusters with 10k+ ingress and 10k+ services with high cardinality on hostnames should see a big improvement on processing time of ingress objects when using dynamic model.

Actually haproxy ingress parses all ingress objects on top of a clean model. The proposal is to copy the previous state and update the model state based on the changes reported by the k8s client. This is not as simple as edit an attribute: there is a conflict analyzer and precedence of annotations in place, and some options might change a configuration value in more than one place. Fortunatelly everything happens below the hostname with a few exceptions detailed below.


  1. Update or remove Ingress or Services
  • list all related hostnames
    • from ingress itself (when ingress)
    • whenever some of its paths reference the service (when service)
    • whenever some of its paths references some of the services of all listed hostnames (when ingress or service)
  • remove hostnames and backends based on hostnames and services list
  • recreate the hostnames
    • current implementation
    • iterate only the listed hostnames
    • parse only added hostnames and backends

Annotations and its conflicts need to be reanalyzed from a clean configuration, and the scope of such configuration is hostname, path and service.


  1. Update or remove Secrets
  • create a new entity to store secret data
  • find and update its entity
  • follow the ingress/service route if an annotation references the secret

Secret data does not impact other configurations except if an annotation references it.


  1. Update or remove endpoints
  • remove and reinclude the backend endpoints

Endpoints impact only its backend.


  1. Update or remove configmaps
  • compare old and new configmap data
  • need a full parsing if changing frontend or backend scoped config keys
  • need to update a single attribute if changing a global scoped config key - this need a double check

Frontend and backend config keys are used as default value for all ingress and services, so all of them should be parsed again. Globals are used to update global haproxy configs and does not (should not) impact other configs (double check).


  1. Update or remove pods
  • only if drain support is true: find endpoints that represent that pod and follow the update/remove endpoints route

The pod state is used only in the endpoint list building, and only when drain support is true.


  1. Update or remove node
  • does not have event trigger

@prometherion
Copy link
Contributor Author

  1. Update or remove pods

Honestly, I'm missing this point: why we should rely on Pod rather than Endpoint resources? We're already listening for Endpoint events and AFAIK Kubernetes reflects all the Pod changes to endpoint list as well according to its status (Terminating, Healthy, Unhealthy and so on and so forth).

  1. Update or remove node

I guess the actual lister can be removed to reduce memory consumption for large clusters running more >1k nodes: maybe we should raise an issue and track it down.


I'm a bit concerned regarding the design of the reconciliation loop, rather:

il, err := hc.storeLister.Ingress.Lister.List(labels.Everything())

func (c *converter) Sync(ingress []*extensions.Ingress) {
for _, ing := range ingress {
c.syncIngress(ing)
}
c.syncAnnotations()
}

tl;dr; we're not processing just the single enqueued entry but iterating over all the ingresses to render them back. @jcmoraisjr is your in-memory processing feature going to process this way items? Also because we have to keep in mind that serialization of objects is time and resource consuming, we should avoid it at least, also from a memory management perspective.

@jcmoraisjr
Copy link
Owner

why we should rely on Pod rather than Endpoint resources

Because of the way drain support works. Terminating pods are also added to the backend slots with weight zero so it continue to receive requests from sticky sessions. This could however be changed to a conditional configuration, changed to query the api, etc.

I guess the actual lister can be removed (node)

Probably true, my guess is that it should be used in the status updater but it isn't. These (current listers and status updater) are some of the old controller inheritance that I couldn't have a look yet.

we're not processing just the single enqueued entry but iterating over all the ingresses to render them back

Yes that's exactly how it works now. This is a pretty simple design and works very well on small, not-so-small, and also on not so-soooo-big clusters. We does this step on about 1s with ~5k ingress and ~3k services.

is your in-memory processing feature going to process this way items?

No. The proposed redesign will continue to use such iteration, but filtering out hostnames that wasn't touched via ingress or service objects. Eg if you change or remove an ingress or service object that touches foo.local, the controller will perform a cascade delete of that domain and parsing it again. The internal design of this new approach won't be that simple anymore but should run a lot faster compared with the current approach - provided that you have lots of hostnames and few paths/services per hostname.

@prometherion
Copy link
Contributor Author

prometherion commented Apr 10, 2020

Because of the way drain support works

That's incorrect. I guess the Ingress Controller just needs to behave as the user is expecting to and need: let's dig it up.

In the past, Kubernetes was providing the Service annotation service.alpha.kubernetes.io/tolerate-unready-endpoints. This annotation was used by the Endpoint controller and it has been translated into Service.spec.publishNotReadyAddresses with a default value of false.

When set to True the controller is going to add to the IP subsets the Pod one also in case the Pod has been marked to be deleted.

If the user is not setting any spec, the Endpoint controller is going to remove the terminating Pod IP from the subset, updating the resource we're already watching to via the EndpointLister.

I started looking at the history of the source file and this logic seems implemented since the early version 1.0 of Kubernetes: if we're going to remove Pod IPs since they're running in a to being drained node although the user is using the Service.spec.publishNotReadyAddresses annotation, we're overriding the Kubernetes behavior and this is bad.

Given so, I'm pretty confident that we have to drop the listening on Pods since all the business logic is already handled correctly by Kubernetes itself, also considered that iterating over Pods is a huge waste of CPU cycles, and memory too.

@jcmoraisjr
Copy link
Owner

I think there is a misunderstanding here. publishNotReadyAddresses svc spec is used to tell to DNS (eg coredns), ipvs/iptables (eg kubeproxy) or a load balancer (ingress) that if they should include notReady endpoints or not. Drain support on the other side configures if notReady and terminating pods should be added as weight zero, which only makes sense if sticky session is being used.

Some ideas that come to my mind:

  1. I definitely ignored publishNotReadyAddresses. I should read this attribute and, if true, add notReady endpoints despite drain support configuration. I'll have a closer look into this;
  2. Drain support also add terminating pods to the list of zero-weight available endpoints.

Item 2 above sounds to me as a new functionality currently not supported by k8s. Because of that it sounds an override of the default behavior, and because of that pod lister is being used.

I'm completely in favor of optimizations, so I completely agree with you that a pod lister shouldn't be configured if such lister isn't being used, but currently I cannot agree that much that we should remove it altogether because we'd break how drain support is supposed to work.

@prometherion
Copy link
Contributor Author

publishNotReadyAddresses svc spec is used to tell to DNS (eg coredns), ipvs/iptables (eg kubeproxy) or a load balancer (ingress) that if they should include notReady endpoints or not

Nope, please, take a look at the code here: this is the func executing the reconciliation for the given Endpoint object and it's performed every time, not just for external Load Balancer.

So if a Pod is going to be deleted, the Endpoint Controller will start reconciliation and the Endpoint will be updated, dropping the Pod IP: what I'm trying to say is that's already done by Kubernetes.

I guess we're using Kuberentes Endpoint resources as HAProxy Backend Server object: given so, if I got an endpoint update because the Pod has been removed from the subset due to its terminating state, I have to update the HAProxy Server, removing its address from the list. Can you confirm this? Or are we directly upstreaming to Kubernetes Service rather than single Kubernetes Endpoints?

Drain support also add terminating pods to the list of zero-weight available endpoints

Please, can you point me to the code where this should happen?

Because I started digging the k8s.io/api types and there's no weight at all for endpoints, or, rather, the proof of what I'm saying (please note NotReadyAddresses and Addresses).

Item 2 above sounds to me as a new functionality currently not supported by k8s

Indeed, wondering why you have to try to implement it: Ingress Controller should be responsible only to take external traffic into the cluster. All the network mesh should be handled by CNI and the Endpoint controller: this could break the Single Responsibility Principle.

currently I cannot agree that much that we should remove it altogether because we'd break how drain support is supposed to work

It's actually working because, according to code, it's already done by Kubernetes and the Endpoint controller.

@prometherion
Copy link
Contributor Author

prometherion commented Apr 10, 2020

Just to be more clear, I'm referring to this snippet

func isTerminatingServicePod(svc *apiv1.Service, pod *apiv1.Pod) (termSvcPod bool) {
termSvcPod = false
if svc.GetNamespace() != pod.GetNamespace() {
return
}
for selectorLabel, selectorValue := range svc.Spec.Selector {
if labelValue, present := pod.Labels[selectorLabel]; !present || selectorValue != labelValue {
return
}
}
if pod.DeletionTimestamp != nil && pod.Status.Reason != "NodeLost" && pod.Status.PodIP != "" {
termSvcPod = true
}
return
}

We're picking dropping the Pod IP in three cases:

  • the Pod DeletionTimestamp is not nil
  • the Pod Status.Reason is not NodeLost
  • the Pod Status.PodIP is not an empty string

All these three cases are handled by the Endpoint Controller: why replicating it into HAProxy putting a weight to zero? These shouldn't get traffic at all because they could be unreachable, on shutdown or any other issue.

Also, last but not least, although you mentioned sticky sessions, if my Pod is going to be shut down it should just complete the pending request and not accepting any new one, even if I have a sticky session: that's the Kubernetes pattern.

@jcmoraisjr
Copy link
Owner

Or are we directly upstreaming to Kubernetes Service rather than single Kubernetes Endpoints?

No, we are using the endpoint object all the time, except when drain support is enabled which means "add also the terminating pods, define weight zero in haproxy". I'm not using this functionality myself, perhaps @brianloss and @gavinbunney can advocate on behalf of it. As I said before, if drain support isn't being used, pod lister won't be used as well.

Please, can you point me to the code where this should happen?

https://github.com/jcmoraisjr/haproxy-ingress/blob/v0.10-snapshot.5/pkg/converters/ingress/ingress.go#L299-L328

Just to be clear: we're not reinventing the endpoint discovery, but instead adding endpoints to the backend which k8s consider dead. This is false by default, of course, and this only makes sense if both sticky session and long terminating status are in place.

@jcmoraisjr
Copy link
Owner

We have some good news, listed in no special order:

  • Hosts and backends has their own entities
  • A nice refactor in the frontend maps writing
  • Removed the multi frontend configuration
  • Listers and informers moved to the new controller (PR to be merged)

All of them combined makes it possible to start something related with model reuse, partial update, clean/dirty hosts and backends. Just cleaning up stuff, no performance improvement here.

Besides that, some queries was converted to map, this one on top of the lister/informer refactor reduced our parsing time about 50%.

This finishes the phase one. Now starts the real improvements.

@jcmoraisjr jcmoraisjr added this to the v0.11 milestone May 10, 2020
@jcmoraisjr
Copy link
Owner

Hi, fyi I've just tested #571 on our production cluster and here is the time comparison of parse_ingress: v0.10: 5000ms-7000ms; v0.11 full 700ms-1200ms; v0.11 partial 1ms-7ms. We've about 1500 namespaces, 4800 ingress and 4300 services atm.

@jcmoraisjr
Copy link
Owner

Running #623 in production since the last sunday and it looks promising. I'll update this issue again as soon as we have a snapshot tag.

@jcmoraisjr
Copy link
Owner

First v0.11 snapshot is tagged and changelog updated. Thank you for all the help in this version. Let me know about the numbers of v0.11 on your cluster and if you have any problem. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants