Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BugFix: Agent NetworkPolicyController handle ipBlocks in NP update. #1625

Closed
wants to merge 1 commit into from

Conversation

suwang48404
Copy link
Contributor

No description provided.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-all-features-conformance: to trigger conformance tests with all alpha features enabled.
  • /skip-all-features-conformance: to skip conformance tests with all alpha features enabled.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@codecov-io
Copy link

codecov-io commented Dec 7, 2020

Codecov Report

Merging #1625 (9be2744) into master (9d3d10b) will decrease coverage by 1.61%.
The diff coverage is 53.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1625      +/-   ##
==========================================
- Coverage   63.31%   61.70%   -1.62%     
==========================================
  Files         170      181      +11     
  Lines       14250    15430    +1180     
==========================================
+ Hits         9023     9521     +498     
- Misses       4292     4875     +583     
- Partials      935     1034      +99     
Flag Coverage Δ
kind-e2e-tests 51.84% <50.17%> (-3.55%) ⬇️
unit-tests 40.49% <16.90%> (-0.78%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/antrea-agent/agent.go 0.00% <0.00%> (ø)
cmd/antrea-agent/options.go 26.89% <0.00%> (+6.17%) ⬆️
.../agent/apiserver/handlers/networkpolicy/handler.go 58.33% <ø> (ø)
...gent/controller/noderoute/node_route_controller.go 46.06% <ø> (-0.41%) ⬇️
.../agent/flowexporter/connections/conntrack_linux.go 29.48% <0.00%> (+0.32%) ⬆️
pkg/agent/stats/collector.go 97.72% <ø> (ø)
pkg/antctl/antctl.go 100.00% <ø> (ø)
pkg/antctl/transform/addressgroup/transform.go 0.00% <0.00%> (ø)
pkg/antctl/transform/controllerinfo/transform.go 0.00% <ø> (ø)
pkg/antctl/transform/version/transform.go 0.00% <ø> (ø)
... and 79 more

Comment on lines +616 to +619
if len(newRule.To.IPBlocks) > 0 {
to := ipBlocksToOFAddresses(newRule.To.IPBlocks, r.ipv4Enabled, r.ipv6Enabled)
ofRule.To = append(ofRule.To, to...)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suwang48404 if this is a bug fix, can we add a test case for this?

/cc @Dyanngg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dyanngg I saw the problem when working on a separate product. I think in my setup, the initial add rule failed due to underlying system not ready, then sequent action becomes update, and this problem was observed.

But even just by eye balling the code, it seems update missed ipBlock in egress direction, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas not sure what this bug may be triggered externally. It is not normal work flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be by design. If the ipBlock CIDR is not changed before/after the rule update (rule update can happen when the rule itself did not change, be the pods it selects is added/deleted etc.), we do not need to process the the ipBlock in update function. On the other hand, if a ipBlock CIDR is added/removed/modified in a rule, it becomes a NEW RULE so the update logic won't be involved here. Hence, I tend to believe the current logic is solid.

Copy link
Contributor Author

@suwang48404 suwang48404 Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u r correct, the IPBLock has not been changed before/after the rule change because there is no rule change. The reason updateRule is called is because initial AddRule failed due to some underlying system not ready, and retry logic calls updateRule to (re)install the rule.

And when rule is not created initially and update is used for retry, and egress rule has ip blocks, we will have a problem, no?

Copy link
Contributor Author

@suwang48404 suwang48404 Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I put in the fix because I experience the problem, unless u guys tell me that there is change made in v0.11 such that updateRule is not being called if initially addRule fails, If that is the case, perhaps if !exist{} block in updateRule should have been removed, perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suwang48404 thanks for catching this. However it's not supposed to append the IPBlock to every "member group". The reason why we call "groupMembersByServices" is to make Pods that have same named port definition fall into same group so that they can share openflows. As IPBlock cannot resolve any named port, they should share the openflows with the pods that cannot resolve any named port. See the logic in https://github.com/vmware-tanzu/antrea/blob/370e2f4e011da38b0559adca702243646b5bbc48/pkg/agent/controller/networkpolicy/reconciler.go#L458-L486.

Copy link
Contributor Author

@suwang48404 suwang48404 Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @tnqn just to clarify, the bug is valid. But fix is not correct because conjunction flow need to share same matching criteria. We will need to come up with a better fix, right?

in that case, perhaps I should close this PR and file an issue instead? Is this something hard to fix? and timeline wise, how soon can this be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @tnqn just to clarify, the bug is valid. But fix is not correct because conjunction flow need to share same matching criteria. We will need to come up with a better fix, right?

Yes, the bug is valid but the fix will cause another issue.

in that case, perhaps I should close this PR and file an issue instead? Is this something hard to fix? and timeline wise, how soon can this be done?

I'm fine if you want to improve the fix in this PR or file the issue and assign it to me. I think it's not hard to fix but we do need to improve unit test to cover this scenario. I think we can get the fix in 0.12 which is supposed to next week, does the timeline work for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn 0.12 next week sounds super good. Thx for helping out !!

Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suwang48404 Could you update the PR description on how to reproduce this bug?
I tested with the ToT build using the following spec:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: egress-np
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: client
    - ipBlock:
        cidr: "10.10.1.17/32"
    ports:
    - protocol: TCP
      port: https

where client Pod has an IP address of 10.10.1.16 and port https maps to 443.
I applied this spec, deleted the client Pod to force a rule update event, and the result of rule update looks correct to me.

Before client Pod deletion:

kubectl exec -it antrea-agent-zrvnl -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=50
 cookie=0x4000000000000, duration=1152.720s, table=50, n_packets=4530, n_bytes=1086677, priority=210,ct_state=-new+est,ip actions=resubmit(,61)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.13 actions=conjunction(1,1/3),conjunction(2,1/3)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.14 actions=conjunction(1,1/3),conjunction(2,1/3)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.16 actions=conjunction(2,2/3)
 cookie=0x4050000000000, duration=4.983s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.17 actions=conjunction(1,2/3)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,tcp,tp_dst=443 actions=conjunction(2,3/3)
 cookie=0x4050000000000, duration=4.993s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=2,ip actions=load:0x2->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x2->NXM_NX_CT_LABEL[32..63]))
 cookie=0x4050000000000, duration=4.985s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
 cookie=0x4000000000000, duration=1152.731s, table=50, n_packets=231, n_bytes=17094, priority=0 actions=resubmit(,60)

Something I found alarming is that it seems the policy has been interpreted as 'allow egress traffic to client Pods on port https, and allow egress traffic to 10.10.1.17/32 on any port', which I'm not sure if is correct/by design. cc @abhiraut @tnqn @wenyingd
That being said, after an event which triggers rule update (client Pod deleted), the flows get updated correctly:

kubectl exec -it antrea-agent-zrvnl -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=50
 cookie=0x4000000000000, duration=1516.208s, table=50, n_packets=5963, n_bytes=1430937, priority=210,ct_state=-new+est,ip actions=resubmit(,61)
 cookie=0x4050000000000, duration=368.480s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.14 actions=conjunction(1,1/3)
 cookie=0x4050000000000, duration=368.480s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.13 actions=conjunction(1,1/3)
 cookie=0x4050000000000, duration=368.471s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.17 actions=conjunction(1,2/3)
 cookie=0x4050000000000, duration=368.473s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
 cookie=0x4000000000000, duration=1516.219s, table=50, n_packets=303, n_bytes=22422, priority=0 actions=resubmit(,60)

and the isolation rules on table 60 also checked out. I also experimented with egress rule having regular port as opposed to named port, and rule updates also seem to work as expected.

@suwang48404
Copy link
Contributor Author

Hi @suwang48404 Could you update the PR description on how to reproduce this bug?
I tested with the ToT build using the following spec:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: egress-np
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: client
    - ipBlock:
        cidr: "10.10.1.17/32"
    ports:
    - protocol: TCP
      port: https

where client Pod has an IP address of 10.10.1.16 and port https maps to 443.
I applied this spec, deleted the client Pod to force a rule update event, and the result of rule update looks correct to me.

Before client Pod deletion:

kubectl exec -it antrea-agent-zrvnl -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=50
 cookie=0x4000000000000, duration=1152.720s, table=50, n_packets=4530, n_bytes=1086677, priority=210,ct_state=-new+est,ip actions=resubmit(,61)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.13 actions=conjunction(1,1/3),conjunction(2,1/3)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.14 actions=conjunction(1,1/3),conjunction(2,1/3)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.16 actions=conjunction(2,2/3)
 cookie=0x4050000000000, duration=4.983s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.17 actions=conjunction(1,2/3)
 cookie=0x4050000000000, duration=4.992s, table=50, n_packets=0, n_bytes=0, priority=200,tcp,tp_dst=443 actions=conjunction(2,3/3)
 cookie=0x4050000000000, duration=4.993s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=2,ip actions=load:0x2->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x2->NXM_NX_CT_LABEL[32..63]))
 cookie=0x4050000000000, duration=4.985s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
 cookie=0x4000000000000, duration=1152.731s, table=50, n_packets=231, n_bytes=17094, priority=0 actions=resubmit(,60)

Something I found alarming is that it seems the policy has been interpreted as 'allow egress traffic to client Pods on port https, and allow egress traffic to 10.10.1.17/32 on any port', which I'm not sure if is correct/by design. cc @abhiraut @tnqn @wenyingd
That being said, after an event which triggers rule update (client Pod deleted), the flows get updated correctly:

kubectl exec -it antrea-agent-zrvnl -n kube-system -c antrea-ovs -- ovs-ofctl dump-flows br-int table=50
 cookie=0x4000000000000, duration=1516.208s, table=50, n_packets=5963, n_bytes=1430937, priority=210,ct_state=-new+est,ip actions=resubmit(,61)
 cookie=0x4050000000000, duration=368.480s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.14 actions=conjunction(1,1/3)
 cookie=0x4050000000000, duration=368.480s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_src=10.10.1.13 actions=conjunction(1,1/3)
 cookie=0x4050000000000, duration=368.471s, table=50, n_packets=0, n_bytes=0, priority=200,ip,nw_dst=10.10.1.17 actions=conjunction(1,2/3)
 cookie=0x4050000000000, duration=368.473s, table=50, n_packets=0, n_bytes=0, priority=190,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
 cookie=0x4000000000000, duration=1516.219s, table=50, n_packets=303, n_bytes=22422, priority=0 actions=resubmit(,60)

and the isolation rules on table 60 also checked out. I also experimented with egress rule having regular port as opposed to named port, and rule updates also seem to work as expected.

@wenyingd @Dyanngg In my setup, the initial rule creation has failed, the egress update path take if !exists {}, is ur test cover that path too? It seems odd if it works, because as u see ipBlock is not added, how could it work?

@Dyanngg
Copy link
Contributor

Dyanngg commented Dec 7, 2020

@wenyingd @Dyanngg In my setup, the initial rule creation has failed, the egress update path take if !exists {}, is ur test cover that path too? It seems odd if it works, because as u see ipBlock is not added, how could it work?

See #1625 (comment). I think the initial rule creation has failed is something we should look into instead.

@suwang48404
Copy link
Contributor Author

suwang48404 commented Dec 8, 2020

@wenyingd @Dyanngg In my setup, the initial rule creation has failed, the egress update path take if !exists {}, is ur test cover that path too? It seems odd if it works, because as u see ipBlock is not added, how could it work?

See #1625 (comment). I think the initial rule creation has failed is something we should look into instead.

I repurposed networkPolicyController package on another product, it is not on antrea per se. In our use case, underlying ovs up and down are normal scope of operations.

But regardless where the use case may be, the NetworkPolicyController logic today seems able to handle rule creation failures, I think this fix just make it more solid, no? Or maybe u have different perspective why this fix should not be in?

@Dyanngg
Copy link
Contributor

Dyanngg commented Dec 8, 2020

@wenyingd @Dyanngg In my setup, the initial rule creation has failed, the egress update path take if !exists {}, is ur test cover that path too? It seems odd if it works, because as u see ipBlock is not added, how could it work?

See #1625 (comment). I think the initial rule creation has failed is something we should look into instead.

I repurpose networkPolicyController package on another product, it is not on antrea per se. But regardless where the use case may be, it seems NetworkPolicyController logic today seems able to handle rule creation failures, I think the fix just make it more solid, no? Or maybe u have different perspective why this fix should not be in?

I hear your point on this makes it more solid. On the other hand, given the premise that rule add is handled correctly (if an exception occurred in Antrea, reconcile should have been retried), processing ipBlock in update function might be redundant and lead to performance cost. I'll let @tnqn and @wenyingd comment here since I did not implement this function in the first hand.

@suwang48404
Copy link
Contributor Author

suwang48404 commented Dec 8, 2020

@wenyingd @Dyanngg In my setup, the initial rule creation has failed, the egress update path take if !exists {}, is ur test cover that path too? It seems odd if it works, because as u see ipBlock is not added, how could it work?

See #1625 (comment). I think the initial rule creation has failed is something we should look into instead.

I repurpose networkPolicyController package on another product, it is not on antrea per se. But regardless where the use case may be, it seems NetworkPolicyController logic today seems able to handle rule creation failures, I think the fix just make it more solid, no? Or maybe u have different perspective why this fix should not be in?

I hear your point on this makes it more solid. On the other hand, given the premise that rule add is handled correctly (if an exception occurred in Antrea, reconcile should have been retried), processing ipBlock in update function might be redundant and lead to performance cost. I'll let @tnqn and @wenyingd comment here since I did not implement this function in the first hand.

To summarize, if addRule fails for any reason (maybe ovs is not ready, maybe something else), the retry logic will call updateRule, and if the rule is egress and consists of ipBlocks, the dst ipBlock will be ignored, That is the bug this PR is trying to address and it is applicable to antrea agent in CNI deployment or other use cases.

@suwang48404
Copy link
Contributor Author

@jianjuns @abhiraut @tnqn can u help review this.

To summarize, if addRule fails for any reason (maybe ovs is not ready, maybe something else), the retry logic will call updateRule, and if the rule is egress and consists of ipBlocks, the dst ipBlock will be ignored, That is the bug this PR is trying to address and it is applicable to antrea agent in CNI deployment or other use cases.

@Dyanngg
Copy link
Contributor

Dyanngg commented Dec 8, 2020

@jianjuns @abhiraut @tnqn can u help review this.

To summarize, if addRule fails for any reason (maybe ovs is not ready, maybe something else), the retry logic will call updateRule, and if the rule is egress and consists of ipBlocks, the dst ipBlock will be ignored, That is the bug this PR is trying to address and it is applicable to antrea agent in CNI deployment or other use cases.

I don't think the retry logic will call update for the rule. If you look at L402

// Record ofID only if its Openflow is installed successfully.
lastRealized.ofIDs[svcKey] = ofRule.FlowID

When installation fails, lastRealized should not contain ruleID, and next time rule is synced, !exist will evaluate to true.

@suwang48404
Copy link
Contributor Author

@jianjuns @abhiraut @tnqn can u help review this.
To summarize, if addRule fails for any reason (maybe ovs is not ready, maybe something else), the retry logic will call updateRule, and if the rule is egress and consists of ipBlocks, the dst ipBlock will be ignored, That is the bug this PR is trying to address and it is applicable to antrea agent in CNI deployment or other use cases.

I don't think the retry logic will call update for the rule. If you look at L402

// Record ofID only if its Openflow is installed successfully.
lastRealized.ofIDs[svcKey] = ofRule.FlowID

When installation fails, lastRealized should not contain ruleID, and next time rule is synced, !exist will evaluate to true.

I may not have been clear regarding to ofFlow rule and policy rule.
Let us walk through the steps:

  1. an new egress policy rule with dst IP block was added via add()(L391)
  2. the policy rule is added to cache via computeOFRulesForAdd(L413)
  3. in case that ovs is not ready, add() returns error in L400 without adding ofFlow rule to cache.
  4. in next reconcile cycle, update() (L256) is called because policy rule has been added in step 2 above, and therefore exists.
  5. a new ofLow rule is created in update() (L604) because *ofFlow rule" is not installed therefore not in cache, and dst IP Block will be ignored. Hence the bug.

Hope this helps, and let me know if I missed anything.

Thx, Su

@Dyanngg
Copy link
Contributor

Dyanngg commented Dec 8, 2020

I may not have been clear regarding to ofFlow rule and policy rule.
Let us walk through the steps:

  1. an new egress policy rule with dst IP block was added via add()(L391)
  2. the policy rule is added to cache via computeOFRulesForAdd(L413)
  3. in case that ovs is not ready, add() returns error in L400 without adding ofFlow rule to cache.
  4. in next reconcile cycle, update() (L256) is called because policy rule has been added in step 2 above, and therefore exists.
  5. a new ofLow rule is created in update() (L604) because *ofFlow rule" is not installed therefore not in cache, and dst IP Block will be ignored. Hence the bug.

Hope this helps, and let me know if I missed anything.

Thx, Su

Thanks Su for walking through this. So yes this is an issue which is hard to write a testcase for. I'm okay with this change except that I believe to fix it more properly, we need to address the TODO in L411.

// TODO: Handle the case that the following processing fails or partially succeeds.
r.lastRealizeds.Store(rule.ID, lastRealized)  

That involves removing rule.ID from lastRealized if flow installation failed at any point. @tnqn

@suwang48404
Copy link
Contributor Author

I may not have been clear regarding to ofFlow rule and policy rule.
Let us walk through the steps:

  1. an new egress policy rule with dst IP block was added via add()(L391)
  2. the policy rule is added to cache via computeOFRulesForAdd(L413)
  3. in case that ovs is not ready, add() returns error in L400 without adding ofFlow rule to cache.
  4. in next reconcile cycle, update() (L256) is called because policy rule has been added in step 2 above, and therefore exists.
  5. a new ofLow rule is created in update() (L604) because *ofFlow rule" is not installed therefore not in cache, and dst IP Block will be ignored. Hence the bug.

Hope this helps, and let me know if I missed anything.
Thx, Su

Thanks Su for walking through this. So yes this is an issue which is hard to write a testcase for. I'm okay with this change except that I believe to fix it more properly, we need to address the TODO in L411.

// TODO: Handle the case that the following processing fails or partially succeeds.
r.lastRealizeds.Store(rule.ID, lastRealized)  

That involves removing rule.ID from lastRealized if flow installation failed at any point. @tnqn

I will let u guys decide what to do with 411. I simply fix a bug that for as long as a rule flow are created in update, the rule flow should include ipBlock, it seems just a obvious omission, seems irrespective of L411.

@suwang48404
Copy link
Contributor Author

I may not have been clear regarding to ofFlow rule and policy rule.
Let us walk through the steps:

  1. an new egress policy rule with dst IP block was added via add()(L391)
  2. the policy rule is added to cache via computeOFRulesForAdd(L413)
  3. in case that ovs is not ready, add() returns error in L400 without adding ofFlow rule to cache.
  4. in next reconcile cycle, update() (L256) is called because policy rule has been added in step 2 above, and therefore exists.
  5. a new ofLow rule is created in update() (L604) because *ofFlow rule" is not installed therefore not in cache, and dst IP Block will be ignored. Hence the bug.

Hope this helps, and let me know if I missed anything.
Thx, Su

Thanks Su for walking through this. So yes this is an issue which is hard to write a testcase for. I'm okay with this change except that I believe to fix it more properly, we need to address the TODO in L411.

// TODO: Handle the case that the following processing fails or partially succeeds.
r.lastRealizeds.Store(rule.ID, lastRealized)  

That involves removing rule.ID from lastRealized if flow installation failed at any point. @tnqn

I will let u guys decide what to do with 411. I simply fix a bug that for as long as a rule flow are created in update, the rule flow should include ipBlock, it seems just a obvious omission, seems irrespective of L411.

BTW, I appreciate a detail review process and Q &A, which help produce good quality code. But let us try not to send people to some fishing expedition, that is just going to waste a lot time that we don't have.

@antoninbas
Copy link
Contributor

BTW, I appreciate a detail review process and Q &A, which help produce good quality code. But let us try not to send people to some fishing expedition, that is just going to waste a lot time that we don't have.

Apologies, that's not our intention here, but quality and robustness is our priority, especially when it comes to a feature like NetworkPolicy. And it's important that the folks who maintain this code such as @Dyanngg have a good understanding of the changes.

Thanks for sharing the detailed sequence. Isn't is pretty straightforward to test this case, by returning an error in ofClient.InstallPolicyRuleFlows, given that this interface is mocked? It seems that a transient failure when adding OVS flows is something we should indeed accommodate for.

Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with merging this and I will update the logic for lastRealized as well as adding testcase for this particular scenario

@suwang48404
Copy link
Contributor Author

I'm good with merging this and I will update the logic for lastRealized as well as adding testcase for this particular scenario

@Dyanngg @antoninbas thx for helping out on testing and further improvements.

@suwang48404
Copy link
Contributor Author

I'm good with merging this and I will update the logic for lastRealized as well as adding testcase for this particular scenario

@Dyanngg @antoninbas thx for helping out on testing and further improvements.

@Dyanngg is ur proposed change addresses issue fixed by this PR? I am assuming ur change would remove the need for update() to create new ofFlow rules? If yes and u plan to work on it promptly, maybe we don't need this PR merged, and instead waiting for ur change along with companion test(s)? Let me know.

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @suwang48404 for raising this bug, I left a comment at https://github.com/vmware-tanzu/antrea/pull/1625/files#r538961037, I guess it would cause duplicate conjunction clauses, mark it as "request changes" to hold the merge for now.

@suwang48404
Copy link
Contributor Author

@antoninbas @Dyanngg @tnqn

as the fix presented in PR is incomplete.

@tnqn has gracefully agree to help with the issue #1635.

closing this PR, and thank all for helping out.

@suwang48404 suwang48404 closed this Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants