Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea policies with exact FQDN doesn't work on IPv6 clusters #3873

Closed
tnqn opened this issue Jun 8, 2022 · 3 comments · Fixed by #3869
Closed

Antrea policies with exact FQDN doesn't work on IPv6 clusters #3873

tnqn opened this issue Jun 8, 2022 · 3 comments · Fixed by #3869
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tnqn
Copy link
Member

tnqn commented Jun 8, 2022

Describe the bug

I found the issue when investigating #3842. antrea-agent logs show DNS resolving never succeeded:

I0608 00:54:09.668460       1 reconciler.go:263] Reconciling rule dca91bcad598a398 of NetworkPolicy AntreaClusterNetworkPolicy:test-acnp-fqdn-cluster-svc
E0608 00:54:09.668864       1 fqdn.go:696] "DNS exchange failed" err="dial udp: address fd74:ca9b:172:17::a:53: too many colons in address"
E0608 00:54:09.669027       1 fqdn.go:639] "Error syncing FQDN, retrying" err="DNS request failed for at least one of type A or AAAA queries" fqdn="ipv6-svc.x-ft8kjhxm.svc.cluster.local"
E0608 00:54:14.670149       1 fqdn.go:696] "DNS exchange failed" err="dial udp: address fd74:ca9b:172:17::a:53: too many colons in address"
E0608 00:54:14.670184       1 fqdn.go:639] "Error syncing FQDN, retrying" err="DNS request failed for at least one of type A or AAAA queries" fqdn="ipv6-svc.x-ft8kjhxm.svc.cluster.local"

The problem is IPv6 address must be wrapped with "[]" when used in network API. It seemed policies with exact FQDN never worked. However, the test only started failing recently and was relatively stable before. I checked previous success log and found the first two probes failed but the 3rd one got expected response:

T0608 00:54:09.771974   29989 k8s_util.go:140] Running: kubectl exec y-ft8kjhxmb-64d4c7988c-nwsbg -c c80 -n y-ft8kjhxm -- /bin/sh -c for i in $(seq 1 3); do /agnhost connect ipv6-svc.x-ft8kjhxm.svc.cluster.local:80 --timeout=1s --protocol=tcp; done;
T0608 00:54:17.244073   29989 k8s_util.go:147] y-ft8kjhxm/b -> ipv6-svc.x-ft8kjhxm.svc.cluster.local: error when running command: err - command terminated with exit code 1 /// stdout -  /// stderr - TIMEOUT
TIMEOUT
REFUSED

And the validation logic considered this as success. I guess the 3rd one sometimes succeeded because DNS response of the first two probes were handled by FQDN controller via packet-in message so corresponding IP was added to openflow rules. But not sure why it was stable before but failed very frequently now.

For example, when running the same test with release-1.6 branch, the result was all "REFUSED":

T0608 13:31:25.058840   29880 k8s_util.go:139] Running: kubectl exec yb-5cd45ddd4-s7sv9 -c c80 -n y -- /bin/sh -c for i in $(seq 1 3); do /agnhost connect ipv6-svc.x.svc.cluster.local:80 --timeout=1s --protocol=tcp; done;
T0608 13:31:25.782748   29880 k8s_util.go:146] y/b -> ipv6-svc.x.svc.cluster.local: error when running command: err - command terminated with exit code 1 /// stdout -  /// stderr - REFUSED
REFUSED
REFUSED
@XinShuYang
Copy link
Contributor

XinShuYang commented Jun 8, 2022

Release-1.6 branch can pass both dual-stack and ipv6 e2e test. I think it's possible the code change between 1.6 and 1.7 introduce this new issue. @tnqn

@tnqn
Copy link
Member Author

tnqn commented Jun 8, 2022

@XinShuYang thanks for the information. What's more strange is I can reproduce this error when running v1.6.0 in my local kind cluster and it failed because of the same reason as main branch. I don't understand why running release-1.6 on CI can get 3 "REFUSED". When running release-1.6 and main branch, is there any K8s version difference?

@XinShuYang
Copy link
Contributor

XinShuYang commented Jun 9, 2022

@XinShuYang thanks for the information. What's more strange is I can reproduce this error when running v1.6.0 in my local kind cluster and it failed because of the same reason as main branch. I don't understand why running release-1.6 on CI can get 3 "REFUSED". When running release-1.6 and main branch, is there any K8s version difference?

All ipv6 tests run on private testbed so the k8s version is same. The k8s version is 1.21.2 for ipv6-dual stack and 1.18.20 for ipv6-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants