-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BugFix: Agent NetworkPolicyController handle ipBlocks in NP update. #1625
Conversation
Thanks for your PR. The following commands are available:
|
Codecov Report
@@ Coverage Diff @@
## master #1625 +/- ##
==========================================
- Coverage 63.31% 61.70% -1.62%
==========================================
Files 170 181 +11
Lines 14250 15430 +1180
==========================================
+ Hits 9023 9521 +498
- Misses 4292 4875 +583
- Partials 935 1034 +99
Flags with carried forward coverage won't be shown. Click here to find out more.
|
if len(newRule.To.IPBlocks) > 0 { | ||
to := ipBlocksToOFAddresses(newRule.To.IPBlocks, r.ipv4Enabled, r.ipv6Enabled) | ||
ofRule.To = append(ofRule.To, to...) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@suwang48404 if this is a bug fix, can we add a test case for this?
/cc @Dyanngg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dyanngg I saw the problem when working on a separate product. I think in my setup, the initial add rule failed due to underlying system not ready, then sequent action becomes update, and this problem was observed.
But even just by eye balling the code, it seems update missed ipBlock in egress direction, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas not sure what this bug may be triggered externally. It is not normal work flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be by design. If the ipBlock CIDR is not changed before/after the rule update (rule update can happen when the rule itself did not change, be the pods it selects is added/deleted etc.), we do not need to process the the ipBlock in update
function. On the other hand, if a ipBlock CIDR is added/removed/modified in a rule, it becomes a NEW RULE so the update logic won't be involved here. Hence, I tend to believe the current logic is solid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u r correct, the IPBLock has not been changed before/after the rule change because there is no rule change. The reason updateRule is called is because initial AddRule failed due to some underlying system not ready, and retry logic calls updateRule to (re)install the rule.
And when rule is not created initially and update is used for retry, and egress rule has ip blocks, we will have a problem, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I put in the fix because I experience the problem, unless u guys tell me that there is change made in v0.11 such that updateRule is not being called if initially addRule fails, If that is the case, perhaps if !exist{} block in updateRule should have been removed, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@suwang48404 thanks for catching this. However it's not supposed to append the IPBlock to every "member group". The reason why we call "groupMembersByServices" is to make Pods that have same named port definition fall into same group so that they can share openflows. As IPBlock cannot resolve any named port, they should share the openflows with the pods that cannot resolve any named port. See the logic in https://github.com/vmware-tanzu/antrea/blob/370e2f4e011da38b0559adca702243646b5bbc48/pkg/agent/controller/networkpolicy/reconciler.go#L458-L486.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @tnqn just to clarify, the bug is valid. But fix is not correct because conjunction flow need to share same matching criteria. We will need to come up with a better fix, right?
in that case, perhaps I should close this PR and file an issue instead? Is this something hard to fix? and timeline wise, how soon can this be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @tnqn just to clarify, the bug is valid. But fix is not correct because conjunction flow need to share same matching criteria. We will need to come up with a better fix, right?
Yes, the bug is valid but the fix will cause another issue.
in that case, perhaps I should close this PR and file an issue instead? Is this something hard to fix? and timeline wise, how soon can this be done?
I'm fine if you want to improve the fix in this PR or file the issue and assign it to me. I think it's not hard to fix but we do need to improve unit test to cover this scenario. I think we can get the fix in 0.12 which is supposed to next week, does the timeline work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnqn 0.12 next week sounds super good. Thx for helping out !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @suwang48404 Could you update the PR description on how to reproduce this bug?
I tested with the ToT build using the following spec:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: egress-np
spec:
podSelector:
matchLabels:
app: web
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
app: client
- ipBlock:
cidr: "10.10.1.17/32"
ports:
- protocol: TCP
port: https
where client
Pod has an IP address of 10.10.1.16
and port https
maps to 443.
I applied this spec, deleted the client
Pod to force a rule update event, and the result of rule update looks correct to me.
Before client
Pod deletion:
kubectl exec -it antrea-agent-zrvnl -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=50
cookie=0x4000000000000, duration=1152.720s, table=50, n_packets=4530, n_bytes=1086677, priority=210,ct_state=-new+est,ip actions=resubmit(,61)
cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.13 actions=conjunction(1,1/3),conjunction(2,1/3)
cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.14 actions=conjunction(1,1/3),conjunction(2,1/3)
cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.16 actions=conjunction(2,2/3)
cookie=0x4050000000000, duration=4.983s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.17 actions=conjunction(1,2/3)
cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,tcp,tp_dst=443 actions=conjunction(2,3/3)
cookie=0x4050000000000, duration=4.993s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=2,ip actions=load:0x2->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x2->NXM_NX_CT_LABEL[32..63]))
cookie=0x4050000000000, duration=4.985s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
cookie=0x4000000000000, duration=1152.731s, table=50, n_packets=231, n_bytes=17094, priority=0 actions=resubmit(,60)
Something I found alarming is that it seems the policy has been interpreted as 'allow egress traffic to client Pods on port https, and allow egress traffic to 10.10.1.17/32
on any port', which I'm not sure if is correct/by design. cc @abhiraut @tnqn @wenyingd
That being said, after an event which triggers rule update (client
Pod deleted), the flows get updated correctly:
kubectl exec -it antrea-agent-zrvnl -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=50
cookie=0x4000000000000, duration=1516.208s, table=50, n_packets=5963, n_bytes=1430937, priority=210,ct_state=-new+est,ip actions=resubmit(,61)
cookie=0x4050000000000, duration=368.480s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.14 actions=conjunction(1,1/3)
cookie=0x4050000000000, duration=368.480s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.13 actions=conjunction(1,1/3)
cookie=0x4050000000000, duration=368.471s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.17 actions=conjunction(1,2/3)
cookie=0x4050000000000, duration=368.473s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
cookie=0x4000000000000, duration=1516.219s, table=50, n_packets=303, n_bytes=22422, priority=0 actions=resubmit(,60)
and the isolation rules on table 60 also checked out. I also experimented with egress rule having regular port as opposed to named port, and rule updates also seem to work as expected.
@wenyingd @Dyanngg In my setup, the initial rule creation has failed, the egress update path take if !exists {}, is ur test cover that path too? It seems odd if it works, because as u see ipBlock is not added, how could it work? |
See #1625 (comment). I think |
I repurposed networkPolicyController package on another product, it is not on antrea per se. In our use case, underlying ovs up and down are normal scope of operations. But regardless where the use case may be, the NetworkPolicyController logic today seems able to handle rule creation failures, I think this fix just make it more solid, no? Or maybe u have different perspective why this fix should not be in? |
I hear your point on this makes it more solid. On the other hand, given the premise that rule add is handled correctly (if an exception occurred in Antrea, reconcile should have been retried), processing ipBlock in |
To summarize, if addRule fails for any reason (maybe ovs is not ready, maybe something else), the retry logic will call updateRule, and if the rule is egress and consists of ipBlocks, the dst ipBlock will be ignored, That is the bug this PR is trying to address and it is applicable to antrea agent in CNI deployment or other use cases. |
da969e2
to
9be2744
Compare
@jianjuns @abhiraut @tnqn can u help review this. To summarize, if addRule fails for any reason (maybe ovs is not ready, maybe something else), the retry logic will call updateRule, and if the rule is egress and consists of ipBlocks, the dst ipBlock will be ignored, That is the bug this PR is trying to address and it is applicable to antrea agent in CNI deployment or other use cases. |
I don't think the retry logic will call
When installation fails, lastRealized should not contain ruleID, and next time rule is synced, |
I may not have been clear regarding to ofFlow rule and policy rule.
Hope this helps, and let me know if I missed anything. Thx, Su |
Thanks Su for walking through this. So yes this is an issue which is hard to write a testcase for. I'm okay with this change except that I believe to fix it more properly, we need to address the TODO in L411.
That involves removing rule.ID from lastRealized if flow installation failed at any point. @tnqn |
I will let u guys decide what to do with 411. I simply fix a bug that for as long as a rule flow are created in update, the rule flow should include ipBlock, it seems just a obvious omission, seems irrespective of L411. |
BTW, I appreciate a detail review process and Q &A, which help produce good quality code. But let us try not to send people to some fishing expedition, that is just going to waste a lot time that we don't have. |
Apologies, that's not our intention here, but quality and robustness is our priority, especially when it comes to a feature like NetworkPolicy. And it's important that the folks who maintain this code such as @Dyanngg have a good understanding of the changes. Thanks for sharing the detailed sequence. Isn't is pretty straightforward to test this case, by returning an error in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with merging this and I will update the logic for lastRealized as well as adding testcase for this particular scenario
@Dyanngg @antoninbas thx for helping out on testing and further improvements. |
@Dyanngg is ur proposed change addresses issue fixed by this PR? I am assuming ur change would remove the need for update() to create new ofFlow rules? If yes and u plan to work on it promptly, maybe we don't need this PR merged, and instead waiting for ur change along with companion test(s)? Let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @suwang48404 for raising this bug, I left a comment at https://github.com/vmware-tanzu/antrea/pull/1625/files#r538961037, I guess it would cause duplicate conjunction clauses, mark it as "request changes" to hold the merge for now.
No description provided.