-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPPool counters #3072
IPPool counters #3072
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3072 +/- ##
==========================================
+ Coverage 64.40% 67.28% +2.88%
==========================================
Files 293 297 +4
Lines 43658 45365 +1707
==========================================
+ Hits 28116 30526 +2410
+ Misses 13253 12461 -792
- Partials 2289 2378 +89
|
Usage IPPoolUsage `json:"usage,omitempty"` | ||
} | ||
|
||
type IPPoolUsage struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can combine this type with ExternalIPPoolUsage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this is doable of course. IDK enough the external pool functionality to conclude whether these two feature will have the same counters forever.
@tnqn do you have an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good suggestion. They should be same from pool's perspective, I can't think of any difference between the two cases for now.
I'm fine with making ExternalIPPoolStatus
use IPPoolUsage
if it doesn't impact API compatibility.
e42b216
to
02fffe9
Compare
Usage IPPoolUsage `json:"usage,omitempty"` | ||
} | ||
|
||
type IPPoolUsage struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good suggestion. They should be same from pool's perspective, I can't think of any difference between the two cases for now.
I'm fine with making ExternalIPPoolStatus
use IPPoolUsage
if it doesn't impact API compatibility.
pkg/ipam/poolallocator/allocator.go
Outdated
@@ -132,6 +132,9 @@ func (a *IPPoolAllocator) appendPoolUsage(ipPool *v1alpha2.IPPool, ip net.IP, st | |||
} | |||
|
|||
newPool.Status.IPAddresses = append(newPool.Status.IPAddresses, usageEntry) | |||
// CRD is not updated yet, therefore the newly allocated IP will not be reflected by Used(). Therefore +1 | |||
newPool.Status.Usage.Used = a.Used() + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use len(newPool.Status.IPAddresses)
?
pkg/ipam/poolallocator/allocator.go
Outdated
@@ -160,6 +163,10 @@ func (a *IPPoolAllocator) removePoolUsage(ipPool *v1alpha2.IPPool, ip net.IP) er | |||
|
|||
newPool.Status.IPAddresses = newList | |||
|
|||
// CRD is not updated yet, therefore the newly deleted allocation will not be reflected by Used(). Therefore -1 | |||
newPool.Status.Usage.Used = a.Used() - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/ipam/poolallocator/allocator.go
Outdated
@@ -384,3 +391,21 @@ func (a *IPPoolAllocator) HasContainer(containerID string) (bool, error) { | |||
} | |||
return false, nil | |||
} | |||
|
|||
func (a IPPoolAllocator) Used() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is not needed if we just use len(newPool.Status.IPAddresses)
.
pkg/ipam/poolallocator/allocator.go
Outdated
_, allocators, _ := a.readPoolAndInitIPAllocators() | ||
total := 0 | ||
for _, allocator := range allocators { | ||
total += allocator.Total() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reading the pool and init the allocators another time, could we use the existing allocators
to get Total
directly, maybe pass it as an argument of appendPoolUsage
and removePoolUsage
? And I think you have added ToTal
method to MultiIPAllocator
, we could call it directly?
8deb058
to
3010632
Compare
pkg/controller/ipam/metrics.go
Outdated
} | ||
|
||
func (c *AntreaIPAMMetricsHandler) Run(stopCh <-chan struct{}) { | ||
c.ipPoolInformer.Informer().AddEventHandlerWithResyncPeriod(cache.ResourceEventHandlerFuncs{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I notice many places in the code use AddEventHandlerWithResyncPeriod
while supplying 0 for resyncPeriod
, but I think you can just use AddEventHandler
without the resyncPeriod
parameter
pkg/controller/ipam/metrics.go
Outdated
func (c *AntreaIPAMMetricsHandler) updateIPPoolCounters(obj interface{}) { | ||
ipPool := obj.(*crdv1a2.IPPool) | ||
|
||
allocator, err := poolallocator.NewIPPoolAllocator(ipPool.Name, c.crdClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to change a bit once #3046 is merged (pool informer is already included in this class so not a big deal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe it's better if I rebase on top of #3046
pkg/controller/ipam/metrics.go
Outdated
total := allocator.Total() | ||
|
||
// Used is gathered from IP allocation status within the CRD - as it can be set by each one of the agents | ||
used := len(ipPool.Status.IPAddresses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm not sure whether we want to count reserved addresses? Those would be the addresses reserved for StatefulSet but not actually allocated for Pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense that we count them. These numbers should reflect the usage, whether it's a reserved or allocated for a Pod.
Otherwise total minus used won't reflect the available IPs in the pool, right?
pkg/controller/ipam/metrics.go
Outdated
if pool.Status.Usage.Used == used && pool.Status.Usage.Total == total { | ||
return | ||
} | ||
pool.Status.Usage.Used = used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to update these counters along with other Status updates (when addresses are actually allocated/released), rather then asynchronously here? This would also save the need for an extra CRD update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this actually, here is the problems though:
If I update the total number in the other status update, it won't catch IPPool creation or updates when an IPRange is added. So for the total it makes sense to use an informer.
As for updating the used count, I could indeed place it where the other updates are done. It will clutter the code though a bit.
Another option is using a mutating webhook, I didn't try it though.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes sense to me, as long as we perform the update here only if totals are changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess mutating webhook doesn't work here because "PUT/POST/PATCH requests to the custom resource ignore changes to the status stanza." Spec and Status cannot be updated via same request.
At least for spec update, asynchronous counter update might be the only way.
But does it need to introduce a metrics controller to do this? The name sounds it's for prometheus metrics. We don't expose other counters as prometheus metrics and no real requirements are asking for it, and it's kind of duplicated with the IPPool status. We already have a AntreaIPAMController which basically handles the IPPool Status stuff, I wonder we should just add the logic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnqn the change does expose Prometheus metrics as well as the stats via CRD. As this PR has been around for while, I can't recall why it's so. I can remove this.
This change has to be revised, maybe discard the Prometheus code as you suggest.
I think its relevant, yes |
build/yamls/antrea-kind.yml
Outdated
@@ -0,0 +1,3585 @@ | |||
apiVersion: apiextensions.k8s.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this file should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks!
This PR is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
b98890e
to
b04c91d
Compare
if err != nil { | ||
klog.Warningf("Failed to initialize allocator for IPPool %s", ipPool.Name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code shouldn't continue to run if err is not nil, otherwise it would panic
pool.Status.Usage.Used = used | ||
pool.Status.Usage.Total = total | ||
|
||
_, err = c.crdClient.CrdV1alpha2().IPPools().UpdateStatus(context.TODO(), pool, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- event handlers are executed sequentially, doing all jobs here may not scale well, especially when there is an external API call.
- Every update to a pool will call the event handler once, doing all jobs here is not as effecient as doing it in workers, as multiple updates to a pool will only trigger one processing if the time the events happen is close.
A more typical and efficient way is:
Event handlers extract the pool's key and enqueue it; multiple workers get keys from the queue and do the actual job.
if ipPool.Status.Usage.Used == used && ipPool.Status.Usage.Total == total { | ||
return nil | ||
} | ||
ipPool.Status.Usage.Used = used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects returned here must be treated as read-only as they are shared among all consumers.
ipPoolToUpdate := ipPool.DeepCopy()
ipPoolToUpdate.Status.Usage.Used = used
...
c.statusQueue.Add(ipPool.Name) | ||
} | ||
|
||
func (c *AntreaIPAMController) updateHandler(oldObj, newObj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not registered to informer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, but was confused by the new test
// This test verifies correct behavior in case of update conflict. Allocation should be retiried | ||
// Taking into account the latest status | ||
func TestAllocateConflict(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find how the test simulate a case of update conflict and how it is different from TestAllocateNext
: there is one line difference about whether check allocator is nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a merge error, that test was around in prior revisions in allocator_test.go. I'll remove it.
Add IPPools usage counters, and expose them via CRD. Signed-off-by: Kobi Samoray <ksamoray@vmware.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
/test-e2e |
/skip-e2e which failed on a known flaky test ACNPFQDNPolicy |
Add IPPools usage counters, and expose them via CRD.
Signed-off-by: Kobi Samoray ksamoray@vmware.com