-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ServiceLoadBalancer + externalTrafficPolicy: Local = Connection Refused most of time #3785
Comments
This is the correct expectation and this is how the implementation should work even today. Since you are using Antrea v1.6.1, you can exec into antrea-agent Pods and run An alternative is to look at the Agent logs for entries like this one:
BTW, did you have the same issue with Antrea v1.6.0, or is it new to Antrea v1.6.1? |
It appears to be acquired for worker5 which seems to be OK.
Pods are running on worker3 and worker 5
Service is there
From master1:
From node4 ( not running the pod ):
From node3 ( running the pod )
From node3 ( running the pod )
|
I have never tested on 1.6. |
I assume you have the following default configuration, but let us know if this is not the case: |
All default. |
@jsalatiel I gave it more thought and I think this is the expected behavior and not specific to the Antrea implementation.
The "right" way to test this is to try to access the LB IP (10.1.2.220) from a machine which is NOT a K8s Node in your cluster. When you access the Service from a Node (not common in real life) or a Pod, you should use the Cluster IP (or the Service DNS name when accessing from a Pod with Cluster DNS). |
If I understood the problem correctly, the ask is to select a Node with a backend Pod to assign the Service's LoadBalancer IP, when externalTrafficPolicy is Local. I think we do not implement this behavior today, but it should be a valid feature. I also saw MetalLB does respect externalTrafficPolicy. |
@jianjuns Antrea's implementation does respect antrea/pkg/agent/controller/serviceexternalip/controller.go Lines 363 to 371 in 2526b1f
I think @antoninbas is correct. The access is supposed to fail if the traffic towards a service with Local externalTrafficPolicy reaches a Node that doesn't have any backends of the service. I haven't tried but I suppose MetalLB is same.
|
I understand what @antoninbas and @tnqn are saying and it looks like the expected behaviour, but there is something still strange happening. Is there any extra requirements when running ServiceExternalIP on VM guests on VMWare? |
More debugging here:
|
ping will not work for the external IPs of Services managed by Antrea. The IP is virtually assigned on Nodes.
|
Ping does work ( from servers outside the cluster at least )! This is a new subnet so I am absolutely sure there is no IP conflict if that's what you mean. As soon I remove the service ( which releases the loadBalancerIP ) ping stops.
10.1.2.23 is the node5 node IP. Curious that when I remove the service from the cluster and ping stops, the real ip of the node is exposed. ( good to know )
From the same subnet I can see the traffic going to the right node every time. ( node5) |
Found a nice comment here that may explain what was happening to me.
|
I missed this part! Yes I agree it is expected that access a Service with externalTrafficPolicy Local from a Node without backend Pods should fail. I do not think MetalLB will have any difference, as it just relies on kube-proxy for Service LB. |
Nice finding! I was suspecting kube-proxy IPVS too when you mentioned kube-ipvs0. Also found an earlier K8s issue for this problem: kubernetes-sigs/kubespray#4788.
Yes. Would you create a PR to add it in docs/service-loadbalancer.md? Of course no problem if you prefer to let us handle it. Thanks again for reporting and root causing the issue! |
I do not know how to do PRs , sorry. |
No problem. I can make the change. In case you like to learn how to create a PR, check here: https://github.com/antrea-io/antrea/blob/main/CONTRIBUTING.md. |
Thanks. Closing this. |
We need to document this, something came up in the past already with kube-proxy IPVS: #3370 Although it may be a pain to identify when exactly it is required. And @tnqn pointed out in the past (#3370 (comment)) that strictARP mode may interfere with the Egress feature if we are not cautious. |
I'll reopen this issue since there is a documentation change required. And I will assign @jianjuns since he volunteered :) |
That is actually a TODO (@xliuxu) with Egress. We can add it too in the Service LB document. |
What must be done to not break egress? |
I think one workaround can be manually setting arp_ignore to 0 for antrea-gw0, e.g.:
@xliuxu : do you have an idea? |
Can that be set automatically by the node agent if it detects strictArp=true? |
I think our plan is to remove the requirement for arp_ignore for Egress. If you do have use cases to have both Egress and Service LB working with kube-proxy IPVS, we can prioritize the fix. Another solution is to use AntreaProxy to replace kube-proxy: https://github.com/antrea-io/antrea/blob/main/docs/antrea-proxy.md#antreaproxy-with-proxyall. Not sure if that is what you want though. |
Well, I do use Egress right now on production and I would start using ServiceExternalIP now, but apparently I can not use both at the sametime yet =) Moving to antreaproxy is not on my plans for the current clusters. |
Good to know you plan to use the feature in production! We can definitely prioritize the Egress fix. I created an issue to track that: #3804 |
Any chance that we could get on 1.7 ? |
1.7 might be hard, as we plan to freeze this week. But we can consider a patch after 1.7. |
Describe the bug
I have a 5 node cluster with a deployment with 2 replicas. The deployment uses the ServiceExternalIP feature of antrea.
I can see that the service got an IP from the same node network
But if I try to curl 10.1.2.220 it will work for some remote endpoints, but not for others.
It works just fine if I set the externalTrafficPolicy=Cluster, but that way I will lose the client IP
To Reproduce
Enable Service IP to any service with a single replica deployed and use the yamls at the end of this bug report.
Expected
I suppose the loadblancer IP would be "acquired" by one of the nodes running the pod and it should work.
Actual behavior
Most of times I get connection refused and in other just works. From the masters it works using the clusterIP but it fails using the LoadBalancer IP.
Versions:
kubectl version
). 1.22.8uname -r
). 4.18.0-348.23.1.el8_5.x86_64Which node does get the LoadBalancer IP assigned to? I can see the LoadBalancer IP assigned to the kube-ipvs0 interface on all servers, but I suppose only one is really using, otherwise it would be an IP conflict situation, wouldn't it ?
I will check metallb to see If I get the same behaviour or not.
The text was updated successfully, but these errors were encountered: