-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support EgressIP assigning and failover in antrea-agent #2186
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2186 +/- ##
==========================================
+ Coverage 61.91% 64.38% +2.46%
==========================================
Files 281 283 +2
Lines 21771 22121 +350
==========================================
+ Hits 13480 14243 +763
+ Misses 6871 6409 -462
- Partials 1420 1469 +49
Flags with carried forward coverage won't be shown. Click here to find out more.
|
8536fd3
to
dbe3f3d
Compare
1f2aa4b
to
1e7621e
Compare
9f10586
to
386e98f
Compare
c39a1f3
to
7d59d3a
Compare
pkg/agent/memberlist/cluster_test.go
Outdated
} | ||
} | ||
assert.Equal(t, 1, len(actualNodes), "Selected Node num for Egress not match") | ||
assert.Equal(t, []string{tCase.expectedNode}, actualNodes, "Select Node for Egress not match") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test itself is O(n^2), I understand it needs to construct a fakeCluster for each Node to check whether a Node can select the Egress in current implementation. Why don't just change ShouldSelectNode
return the desired Node to make the test much easier. The client just needs to check whether the desired Node is itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cmd/antrea-agent/config.go
Outdated
@@ -98,6 +98,11 @@ type AgentConfig struct { | |||
// APIPort is the port for the antrea-agent APIServer to serve on. | |||
// Defaults to 10350. | |||
APIPort int `yaml:"apiPort,omitempty"` | |||
|
|||
// ClusterMembershipPort is the server port used by the antrea-agent to run a gossip-based cluster membership protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You missed the comment: "Currently it's used only when the Egress feature is enabled."
pkg/apis/crd/v1alpha2/types.go
Outdated
// EgressStatus represents the current status of an Egress. | ||
type EgressStatus struct { | ||
// The name of the Node that holds the Egress IP. | ||
NodeName string `json:"nodeName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes
5f1c4ec
to
dbc7e0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the commit message which has out-of-date description about when the node will join and leave the cluster.
cmd/antrea-agent/config.go
Outdated
@@ -98,6 +98,11 @@ type AgentConfig struct { | |||
// APIPort is the port for the antrea-agent APIServer to serve on. | |||
// Defaults to 10350. | |||
APIPort int `yaml:"apiPort,omitempty"` | |||
|
|||
// ClusterMembershipPort is the server port used by the antrea-agent to run a gossip-based cluster membership protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't mix the usage and the default value in same line.
ClusterMembershipPort is the server port used by the antrea-agent to run a gossip-based cluster membership protocol. Currently it's used only when the Egress feature is enabled.
Defaults to 10351.
pkg/agent/memberlist/cluster_test.go
Outdated
{ | ||
name: fmt.Sprintf("Recover to %d nodes", 10), | ||
nodes: nodes[:10], | ||
consistentHash: consistenthash.New(defaultVirtualNodeReplicas, nil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the test:
- what does expectEgressSeqSum mean?
- Does the 1st subtest have any difference from the 4th one given they have exactly same input?
- What is it verifying given there is no expected value and it doesn't call any function/method in source file?
I feel it's likely just testing consistentHash
directly which should be already covered by TestCluster_ShouldSelectEgress. If so, maybe just remove this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectEgressSeqSum is the sum of all selected Egress sequence numbers. It ensures each Egress has been selected once. That is, Egresses distributed in different Nodes, no duplicate or missing items. Maybe I should rephrase nodeSelectedForEgress and use ShouldSelectEgress in this test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If verifying duplicate or missing items is the purpose, maybe you can just make a little change toTestCluster_ShouldSelectEgress
;
genNodes := func(n int) []string {
nodes := make([]string, n)
for i := 0; i < n; i++ {
nodes[i] = fmt.Sprintf("node-%d", i)
}
return nodes
}
fakeEIPName := "fakeExternalIPPool"
fakeEgress := &crdv1a2.Egress{
ObjectMeta: metav1.ObjectMeta{Name: "fakeEgress"},
Spec: crdv1a2.EgressSpec{ExternalIPPool: fakeEIPName, EgressIP: tCase.egressIP},
}
consistentHashMap := newNodeConsistentHashMap()
consistentHashMap.Add(genNodes(tCase.nodeNum)...)
fakeCluster := &Cluster{
consistentHashMap: map[string]*consistenthash.Map{fakeEIPName: consistentHashMap},
}
for i := 0; i < n; i++ {
node := fmt.Sprintf("node-%d", i)
fakeCluster.nodeName = node
selected, err := fakeCluster.ShouldSelectEgress(fakeEgress)
assert.NoError(t, err)
assert.Equal(t, node == tCase.expectedNode, selected, "Selected Node for Egress not match")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// BenchmarkCluster_ShouldSelect/select_node_from_100_alive_nodes-16 13036746 103 ns/op | ||
// BenchmarkCluster_ShouldSelect/select_node_from_10_alive_nodes | ||
// BenchmarkCluster_ShouldSelect/select_node_from_10_alive_nodes-16 14923483 77.5 ns/op | ||
func BenchmarkCluster_ShouldSelect(b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark doesn't really measure a method in source code, it calls one step in ShouldSelectEgress
but also counts the time spent on creating a consistenthashMap. Can it focus one method i.e. ShouldSelectEgress? I don't think we need many input matrix either (as long as it's efficient in 1000 nodes, no one will worry about 10 nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Maybe name the patch and PR "Support EgressIP assigning and failover in antrea-agent" |
@jianjuns @antoninbas do you have more comments after the existing ones are resolved? If no, I will merge it after all comments are resolved so I can rebase #2345 and #2342 on it. |
adfca93
to
585a3ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
Integration test failed. |
@tnqn no further comments from me |
Fix integration test case.
|
1.Add memberlist cluster in antrea-agent A cluster will be created in the background when the Egress feature is turned on. And the local Node will join all the other K8s Nodes in a memberlist cluster. Each Node in the cluster holds the same consistent hash ring for each ExternalIPPool, in order to distribute egress IPs equally among the selected Nodes (which are part of the memberlist cluster). When a Node leaves the cluster, its IPs are redistributed. When a Node joins the cluster, it's added to the hash ring and a small fraction of IPs are re-assigned to that Node. 2.Assign EgressIP to owner Node and update Egress status Assign a owner node for Egress which with a valid externalIPPool. When Egress has been assigned a owner node and egressIP has assigned, the Egress status will updated, Egress status is the name of the Node that holds the Egress IP. Signed-off-by: wenqiq <wenqiq@vmware.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Some nits on comments.
pkg/agent/memberlist/cluster.go
Outdated
return nodes | ||
} | ||
|
||
// ShouldSelectEgress the local Node in the cluster holds the same consistent hash ring for each ExternalIPPool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you missed a verb after ShouldSelectEgress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-// ShouldSelectEgress the local Node in the cluster holds the same consistent hash ring for each ExternalIPPool,
-// when selecting an owner Node for an Egress, the local Node labels must match an ExternalIPPool nodeSelectors.
+// ShouldSelectEgress returns true if the local Node selected as the owner Node of the Egress,
+// the local Node in the cluster holds the same consistent hash ring for each ExternalIPPool,
pkg/agent/memberlist/cluster.go
Outdated
} | ||
|
||
// ShouldSelectEgress the local Node in the cluster holds the same consistent hash ring for each ExternalIPPool, | ||
// when selecting an owner Node for an Egress, the local Node labels must match an ExternalIPPool nodeSelectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to rephrase this line too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/agent/memberlist/cluster.go
Outdated
switch event { | ||
case memberlist.NodeJoin, memberlist.NodeLeave: | ||
// When a Node joins cluster, all matched ExternalIPPools consistentHash should be updated. | ||
// when a Node leave cluster, the Node may failed or deleted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a Node leaves
may have failed or have been deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- // When a Node joins cluster, all matched ExternalIPPools consistentHash should be updated.
- // when a Node leave cluster, the Node may failed or deleted,
- // if Node has been deleted, affected ExternalIPPool should enqueue, deleteNode handler has processed
- // if the Node failed, ExternalIPPools consistentHash maybe changed, affected ExternalIPPool should enqueue.
+ // When a Node joins cluster, all matched ExternalIPPools consistentHash should be updated;
+ // when a Node leaves cluster, the Node may have failed or have been deleted,
+ // if the Node has been deleted, affected ExternalIPPool should be enqueued, and deleteNode handler has been executed,
+ // if the Node has failed, ExternalIPPools consistentHash maybe changed, and affected ExternalIPPool should be enqueued.
pkg/agent/memberlist/cluster.go
Outdated
node, event := nodeEvent.Node, nodeEvent.Event | ||
switch event { | ||
case memberlist.NodeJoin, memberlist.NodeLeave: | ||
// When a Node joins cluster, all matched ExternalIPPools consistentHash should be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do not like to capitalize "when" in the next sentence, probably change "." to ";"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/agent/memberlist/cluster.go
Outdated
case memberlist.NodeJoin, memberlist.NodeLeave: | ||
// When a Node joins cluster, all matched ExternalIPPools consistentHash should be updated. | ||
// when a Node leave cluster, the Node may failed or deleted, | ||
// if Node has been deleted, affected ExternalIPPool should enqueue, deleteNode handler has processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be enqueued
and deleteNode handler has been executed (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/agent/memberlist/cluster.go
Outdated
// When a Node joins cluster, all matched ExternalIPPools consistentHash should be updated. | ||
// when a Node leave cluster, the Node may failed or deleted, | ||
// if Node has been deleted, affected ExternalIPPool should enqueue, deleteNode handler has processed | ||
// if the Node failed, ExternalIPPools consistentHash maybe changed, affected ExternalIPPool should enqueue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and affected...
should be enqueued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/agent/memberlist/cluster.go
Outdated
coreNode, err := c.nodeLister.Get(node.Name) | ||
if err != nil { | ||
if errors.IsNotFound(err) { | ||
// Node has been deleted, deleteNode handler has processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and deleteNode handler has been executed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- // Node has been deleted, deleteNode handler has processed.
+ // Node has been deleted, and deleteNode handler has been executed.
pkg/agent/memberlist/cluster.go
Outdated
return err | ||
} | ||
|
||
// updateConsistentHash refresh the consistentHashMap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refreshes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- // updateConsistentHash refresh the consistentHashMap.
+ // updateConsistentHash refreshes the consistentHashMap.
pkg/agent/memberlist/cluster.go
Outdated
return err | ||
} | ||
aliveNodes := c.aliveNodes() | ||
// Node alive and Node labels matches with ExternalIPPool nodeSelector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matches with -> match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- // Node alive and Node labels matches with ExternalIPPool nodeSelector.
+ // Node alive and Node labels match ExternalIPPool nodeSelector.
go c.cluster.Run(stopCh) | ||
|
||
// The Egress has been deleted but assigned IP has not been deleted, | ||
// agent should delete those IPs when it starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so agent..
Or change "," to "." in the previous line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- // The Egress has been deleted but assigned IP has not been deleted,
- // agent should delete those IPs when it starts.
+ // The Egress has been deleted but assigned IP has not been deleted, so agent should delete those IPs when it starts.
Signed-off-by: wenqiq <wenqiq@vmware.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
It seems all comments have been addressed. I will merge this to unblock other PRs. If there are new comments, @wenqiq or I can address them in new PRs. |
Thanks for your professional and careful reviews, very helpful and inspiring for me, also thank @antoninbas and @jianjuns, no further comments from me at the moment. |
Support EgressIP assigning and failover in antrea-agent
A cluster will be created in the background when the Egress feature is turned on.
And the local Node will join all the other K8s Nodes in a memberlist cluster.
Each Node in the cluster holds the same consistent hash ring for each ExternalIPPool,
in order to distribute egress IPs equally among the selected Nodes (which are part of the
memberlist cluster). When a Node leaves the cluster, its IPs are redistributed.
When a Node joins the cluster, it's added to the hash ring and a small fraction of IPs are re-assigned to that Node.
For #2128