Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance error handling of policy reconciler #1667

Merged
merged 1 commit into from
Dec 21, 2020

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Dec 16, 2020

The reconciler might fail to install flows due to transient OVS error. It should add the IPBlocks to the PolicyRule when it retries, in which case "update" method is called. This patch fixes it and adds an unit test for it.

Fixes #1635

@tnqn
Copy link
Member Author

tnqn commented Dec 16, 2020

/test-all

@codecov-io
Copy link

codecov-io commented Dec 16, 2020

Codecov Report

Merging #1667 (6fa3bd0) into master (9d3d10b) will increase coverage by 1.10%.
The diff coverage is 59.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1667      +/-   ##
==========================================
+ Coverage   63.31%   64.42%   +1.10%     
==========================================
  Files         170      181      +11     
  Lines       14250    15494    +1244     
==========================================
+ Hits         9023     9982     +959     
- Misses       4292     4484     +192     
- Partials      935     1028      +93     
Flag Coverage Δ
e2e-tests 47.52% <38.95%> (?)
kind-e2e-tests 53.10% <51.75%> (-2.29%) ⬇️
unit-tests 40.82% <27.20%> (-0.46%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/antrea-agent/agent.go 0.00% <0.00%> (ø)
.../agent/apiserver/handlers/networkpolicy/handler.go 58.33% <ø> (ø)
...gent/controller/noderoute/node_route_controller.go 60.98% <0.00%> (+14.51%) ⬆️
pkg/agent/stats/collector.go 97.72% <ø> (ø)
pkg/antctl/antctl.go 100.00% <ø> (ø)
pkg/antctl/transform/addressgroup/transform.go 0.00% <0.00%> (ø)
pkg/antctl/transform/controllerinfo/transform.go 0.00% <ø> (ø)
pkg/antctl/transform/networkpolicy/transform.go 0.00% <0.00%> (ø)
pkg/antctl/transform/version/transform.go 44.82% <ø> (ø)
pkg/controller/networkpolicy/tier.go 90.00% <ø> (ø)
... and 105 more

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I need a clarification regarding the necessity of forgetRuleImmediately. I understand why it could be an "optimization" to release the rule ID immediately when there is an error, but I don't understand why it is necessary for code correctness.

pkg/agent/controller/networkpolicy/reconciler.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/reconciler_test.go Outdated Show resolved Hide resolved
@suwang48404
Copy link
Contributor

LGTM.

I suspect @antoninbas and @Dyanngg may be better reviewer than I here.

@tnqn
Copy link
Member Author

tnqn commented Dec 17, 2020

@antoninbas @suwang48404 I just realized the BatchInstallPolicyRuleFlows method is not atomical because it doesn't roll back some conjunction cache in agent upon failure and confirmed with @Dyanngg. So using it would introduce another bug if there is transient error in OVS. I don't think of how to make the method atomical with simple change. I have to use another approach to track each openflow rule's realization status... Let me know if you have good idea for this problem.

@tnqn tnqn force-pushed the policy-update branch 2 times, most recently from 45695a6 to fa9bb85 Compare December 17, 2020 10:41
@tnqn
Copy link
Member Author

tnqn commented Dec 17, 2020

/test-all

@tnqn tnqn force-pushed the policy-update branch 2 times, most recently from 34743cf to a12ab7b Compare December 17, 2020 12:44
@tnqn
Copy link
Member Author

tnqn commented Dec 17, 2020

/test-all

@antoninbas
Copy link
Contributor

@antoninbas @suwang48404 I just realized the BatchInstallPolicyRuleFlows method is not atomical because it doesn't roll back some conjunction cache in agent upon failure and confirmed with @Dyanngg. So using it would introduce another bug if there is transient error in OVS. I don't think of how to make the method atomical with simple change. I have to use another approach to track each openflow rule's realization status... Let me know if you have good idea for this problem.

I would think that addressing this TODO https://github.com/vmware-tanzu/antrea/blob/8b7430cc193bd657ca00277a417406ca9558df9b/pkg/agent/openflow/network_policy.go#L848 would be the right thing to do, but I am not super familiar with that code TBH.

It seems that you went with a different approach. Does it cover all possible error cases when adding flows, are just the one Su ran into?

@Dyanngg
Copy link
Contributor

Dyanngg commented Dec 17, 2020

It seems that you went with a different approach. Does it cover all possible error cases when adding flows, are just the one Su ran into?

I would say what Su discovered is the only possible error case I can think of as of now that the update function cannot handle. The update function has does all the computation as add except for the egress ipBlock. This was also the approach I had in my if I were to fix this issue.

@tnqn
Copy link
Member Author

tnqn commented Dec 18, 2020

@antoninbas I agree that making the BatchInstallPolicyRuleFlows method atomical would be the best. However I am not super familiar with that code either. It would take some time for me to figure out how to achieve it, and guess it's not a simple change. @Dyanngg might comment.

This patch fixes the case Su reported, it should be the only case could happen if InstallPolicyRuleFlows is atomical, unfortunally even that method is not at the moment. Besides, the update method calls multiple openflow methods, which could cause potential issues.
We may need to revise the openflow interfaces first, then we can switch to the atomical interfaces in reconciler, that will simplify the code in reconciler a lot I guess.

@antoninbas
Copy link
Contributor

@tnqn @Dyanngg thanks for the clarifications

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, otherwise LGTM

pkg/agent/controller/networkpolicy/reconciler_test.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/reconciler_test.go Outdated Show resolved Hide resolved
Copy link
Member Author

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas addressed your comments, thanks for the review.

pkg/agent/controller/networkpolicy/reconciler_test.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/reconciler_test.go Outdated Show resolved Hide resolved
@tnqn
Copy link
Member Author

tnqn commented Dec 18, 2020

/test-all

antoninbas
antoninbas previously approved these changes Dec 18, 2020
pkg/agent/controller/networkpolicy/reconciler_test.go Outdated Show resolved Hide resolved
The reconciler might fail to install flows due to transient OVS error.
It should add the IPBlocks to the PolicyRule when it retries, in which
case "update" method is called. This patch fixes it and adds an unit
test for it.
@tnqn
Copy link
Member Author

tnqn commented Dec 21, 2020

/test-all

@antoninbas antoninbas merged commit f86c262 into antrea-io:master Dec 21, 2020
zyiou pushed a commit to zyiou/antrea that referenced this pull request Dec 21, 2020
The reconciler might fail to install flows due to transient OVS error.
It should add the IPBlocks to the PolicyRule when it retries, in which
case "update" method is called. This patch fixes it and adds an unit
test for it.
zyiou pushed a commit to zyiou/antrea that referenced this pull request Dec 21, 2020
The reconciler might fail to install flows due to transient OVS error.
It should add the IPBlocks to the PolicyRule when it retries, in which
case "update" method is called. This patch fixes it and adds an unit
test for it.
antoninbas pushed a commit that referenced this pull request Dec 23, 2020
The reconciler might fail to install flows due to transient OVS error.
It should add the IPBlocks to the PolicyRule when it retries, in which
case "update" method is called. This patch fixes it and adds an unit
test for it.
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Feb 11, 2021
The reconciler might fail to install flows due to transient OVS error.
It should add the IPBlocks to the PolicyRule when it retries, in which
case "update" method is called. This patch fixes it and adds an unit
test for it.
antoninbas pushed a commit that referenced this pull request Feb 11, 2021
The reconciler might fail to install flows due to transient OVS error.
It should add the IPBlocks to the PolicyRule when it retries, in which
case "update" method is called. This patch fixes it and adds an unit
test for it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent NetworkPolicyController reconciler to update egress rule ignores IP block
6 participants