Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea GKE CI job keeps failing #1032

Closed
antoninbas opened this issue Aug 4, 2020 · 5 comments
Closed

Antrea GKE CI job keeps failing #1032

antoninbas opened this issue Aug 4, 2020 · 5 comments
Labels
area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
It seems that the job is failing most of the time. Last 6 runs failed, but every time it was a different job failing:

Failed tests:
[sig-network] DNS should resolve DNS of partial qualified names for services [LinuxOnly] [Conformance]
Failed tests:
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should deny ingress access to updated pod [Feature:NetworkPolicy]
Failed tests:
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce policy based on PodSelector with MatchExpressions[Feature:NetworkPolicy]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce except clause while egress access to server in CIDR block [Feature:NetworkPolicy]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should stop enforcing policies after they are deleted [Feature:NetworkPolicy]

And so on...

Expected
Tests should pass consistently.

Actual behavior
The job is flaky, but it is never the same test failing.

Versions:
Antrea TOT

Additional context
Full build history: http://jenkins.antrea-ci.rocks/view/cloud/job/cloud-antrea-gke-conformance-net-policy/

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers labels Aug 4, 2020
@antoninbas antoninbas added this to the Antrea v0.9.0 release milestone Aug 4, 2020
@Dyanngg
Copy link
Contributor

Dyanngg commented Aug 6, 2020

Some findings after digging into the full logs of this particular failure: http://jenkins.antrea-ci.rocks/view/cloud/job/cloud-antrea-gke-conformance-net-policy/52/

It seems that the testcase [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce updated policy [Feature:NetworkPolicy] failed before any NetworkPolicy is applied.
In most of the testcases defined in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/network_policy.go, the initial setup phase is to bring up a server pod that listens on 80 and 81 in a new namespace, create a client pod in that namespace and test that the client pod can reach the server pod on both ports when no NetworkPolicies are applied at all. However, this initial step failed based on the following log:

[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client 
  should enforce updated policy [Feature:NetworkPolicy]
  /workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:685
[BeforeEach] [sig-network] NetworkPolicy [LinuxOnly]
  /workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:178
STEP: Creating a kubernetes client
Aug  6 00:54:57.196: INFO: >>> kubeConfig: /tmp/kubeconfig-680195992
STEP: Building a namespace api object, basename network-policy
STEP: Binding the e2e-test-privileged-psp PodSecurityPolicy to the default service account in network-policy-4403
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [sig-network] NetworkPolicy [LinuxOnly]
  /workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:52
[BeforeEach] NetworkPolicy between server and client
  /workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:58
STEP: Creating a simple server that serves on port 80 and 81.
STEP: Creating a server pod server in namespace network-policy-4403
Aug  6 00:54:57.402: INFO: Created pod server-5wjbw
STEP: Creating a service svc-server for pod server in namespace network-policy-4403
Aug  6 00:54:57.448: INFO: Created service svc-server
STEP: Waiting for pod ready
Aug  6 00:54:57.458: INFO: The status of Pod server-5wjbw is Pending, waiting for it to be Running (with Ready = true)
Aug  6 00:54:59.464: INFO: The status of Pod server-5wjbw is Running (Ready = false)
Aug  6 00:55:01.493: INFO: The status of Pod server-5wjbw is Running (Ready = false)
Aug  6 00:55:03.463: INFO: The status of Pod server-5wjbw is Running (Ready = false)
Aug  6 00:55:05.465: INFO: The status of Pod server-5wjbw is Running (Ready = false)
Aug  6 00:55:07.464: INFO: The status of Pod server-5wjbw is Running (Ready = true)
STEP: Testing pods can connect to both ports when no policy is present.
STEP: Creating client pod client-can-connect-80 that should successfully connect to svc-server.
Aug  6 00:55:07.494: INFO: Waiting for client-can-connect-80-82ldp to complete.
Aug  6 00:55:19.506: INFO: Waiting for client-can-connect-80-82ldp to complete.
Aug  6 00:55:19.507: INFO: Waiting up to 5m0s for pod "client-can-connect-80-82ldp" in namespace "network-policy-4403" to be "Succeeded or Failed"
Aug  6 00:55:19.513: INFO: Pod "client-can-connect-80-82ldp": Phase="Failed", Reason="", readiness=false. Elapsed: 6.256655ms
Aug  6 00:55:19.552: FAIL: Pod client-can-connect-80-82ldp should be able to connect to service svc-server, but was not able to connect.
Pod logs:


 Current NetworkPolicies:
	[]

 Pods:
	[Pod: client-can-connect-80-82ldp, Status: &PodStatus{Phase:Failed,Conditions:[]PodCondition{PodCondition{Type:Initialized,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:55:07 +0000 UTC,Reason:,Message:,},PodCondition{Type:Ready,Status:False,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:55:18 +0000 UTC,Reason:ContainersNotReady,Message:containers with unready status: [client],},PodCondition{Type:ContainersReady,Status:False,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:55:18 +0000 UTC,Reason:ContainersNotReady,Message:containers with unready status: [client],},PodCondition{Type:PodScheduled,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:55:07 +0000 UTC,Reason:,Message:,},},Message:,Reason:,HostIP:10.138.0.25,PodIP:10.40.4.47,StartTime:2020-08-06 00:55:07 +0000 UTC,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:client,State:ContainerState{Waiting:nil,Running:nil,Terminated:&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2020-08-06 00:55:08 +0000 UTC,FinishedAt:2020-08-06 00:55:18 +0000 UTC,ContainerID:docker://7d872b155f6fbbb5e2f25a08e5d82b94efca472136bdd843b81ff6a64a09c018,},},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:busybox:1.29,ImageID:docker-pullable://busybox@sha256:e004c2cc521c95383aebb1fb5893719aa7a8eae2e7a71f316a4410784edb00a9,ContainerID:docker://7d872b155f6fbbb5e2f25a08e5d82b94efca472136bdd843b81ff6a64a09c018,Started:*false,},},QOSClass:BestEffort,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{PodIP{IP:10.40.4.47,},},EphemeralContainerStatuses:[]ContainerStatus{},}
 Pod: server-5wjbw, Status: &PodStatus{Phase:Running,Conditions:[]PodCondition{PodCondition{Type:Initialized,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:54:57 +0000 UTC,Reason:,Message:,},PodCondition{Type:Ready,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:55:07 +0000 UTC,Reason:,Message:,},PodCondition{Type:ContainersReady,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:55:07 +0000 UTC,Reason:,Message:,},PodCondition{Type:PodScheduled,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-08-06 00:54:57 +0000 UTC,Reason:,Message:,},},Message:,Reason:,HostIP:10.138.15.235,PodIP:10.40.7.31,StartTime:2020-08-06 00:54:57 +0000 UTC,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:server-container-80,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2020-08-06 00:54:58 +0000 UTC,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:true,RestartCount:0,Image:us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.10,ImageID:docker-pullable://us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost@sha256:20ff7cb1c0960acec927b4d3c6e8a6a9e3758f7c227b1fd14608b967dd9a487d,ContainerID:docker://bd35add3570be2128c9cb69daf3d7129a055c50458386e0f5d3a2fd3de1ebda9,Started:*true,},ContainerStatus{Name:server-container-81,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2020-08-06 00:54:58 +0000 UTC,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:true,RestartCount:0,Image:us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.10,ImageID:docker-pullable://us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost@sha256:20ff7cb1c0960acec927b4d3c6e8a6a9e3758f7c227b1fd14608b967dd9a487d,ContainerID:docker://d9cc303c214f50aad45065add0c36aadca2d0ad44658325a207aa24cc2a3efc3,Started:*true,},},QOSClass:BestEffort,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{PodIP{IP:10.40.7.31,},},EphemeralContainerStatuses:[]ContainerStatus{},}
]

Full Stack Trace
k8s.io/kubernetes/test/e2e/network.checkConnectivity(0xc00128c000, 0xc001f06f20, 0xc000c8f800, 0xc0006b1b00)
	/workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:1522 +0x4c1
k8s.io/kubernetes/test/e2e/network.testCanConnect(0xc00128c000, 0xc001f06f20, 0x4b7d5c9, 0x15, 0xc0006b1b00, 0x50)
	/workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:1498 +0x218
k8s.io/kubernetes/test/e2e/network.glob..func13.2.1()
	/workspace/anago-v1.18.0-beta.0.236+78ccbd44840fcd/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:72 +0x248
k8s.io/kubernetes/test/e2e.RunE2ETests(0xc002803500)
	_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:119 +0x30a
k8s.io/kubernetes/test/e2e.TestE2E(0xc002803500)
	_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:111 +0x2b
testing.tRunner(0xc002803500, 0x4d28498)
	/usr/local/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:960 +0x350

The log is just from one run and we need more samples to confirm if this is happening for each of the GKE conformance failures. Some suspects:

  1. Geneve tunnel issue (Netpol test suite randomly fails when run in Kind and with tunnel type set to Geneve #897) this could explain why the test is very flaky and can succeed occasionally. We run the GKE tests on a 9 node setup so there are a lot of variances in terms of where the test workloads can be scheduled.
  2. NetworkPolicy flows from previous testcases are not cleaned up in time properly, causing packets to be dropped. This is not very likely however, as for each testcase there are new namespaces and new pods created, so the old rules should never apply to the new workloads.

@antoninbas should we try changing the tunnel type to VXLAN for GKE in CI and experiment whether the tests can pass consistently?

@antoninbas
Copy link
Contributor Author

@Dyanngg you can experiment with VXLAN if you want, but if this started happening after the switch to Geneve, then I think it may indeed be connected to #897. If @srikartati is working on this actively, maybe we can wait and see if there is progress in the next couple days.

@antoninbas
Copy link
Contributor Author

Moving this to the v0.10.0 milestone. We are making progress on #897 but we don't have a solution yet.

@Dyanngg
Copy link
Contributor

Dyanngg commented Aug 13, 2020

The positive news is that GKE tests have been consistently passing this week. I experimented with VXLAN a week ago and it not resolve the issue. From those test logs last week I was still seeing NetworkPolicy tests fail because of initial setup phase failure (client pod cannot reach server pod w/o any policy enforced).

@Dyanngg
Copy link
Contributor

Dyanngg commented Sep 10, 2020

Closing this issue as it is not occuring any more.

@Dyanngg Dyanngg closed this as completed Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

2 participants