Flow exporter memory bloat when unable to connect with downstream collector #3972
Labels
area/flow-visibility/exporter
Issues or PRs related to the Flow Exporter functions in the Agent
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
Describe the bug
When flow exporter is enabled, but failed to connect to downstream IPFIX collector, antrea agent will run into various memory issues. In a scale setup with limited worker node memory, iptables could be crashed due to memory exhaustion, and host reboot is needed for node recovery.
The issue is introduced as part of priority queue refactor #2360, mainly in the following aspects:
Stale connection objects cannot be evicted from buffer (
ExpirePriorityQueue.items
andExpirePriorityQueue.KeyToItem
) when export process is dead.Prior to PQ refactor,
flowRecords
maps acts as the buffer between connection store map and flow export process. Even if export process is dead, the store polling goroutine will periodical expire stale flow records.with the current approach, clean up is only run part of
sendFlowRecords
(->GetExpiredConns
->UpdateConnAndQueue
->RemoveItemFromMap
), which can not happen if export process is dead.Leaked pqItem in ExpirePriorityQueue due to uninitialized LastExportTime
During connection store periodical polling, connections determined as stale will be removed from connection map
However,
conn.LastExportTime
will only be set duringsendFlowRecords
workflow (inUpdateConnAndQueue
). When exporting process is dead,time.Since(conn.LastExportTime)
will always return a very large difference due to compared to zero value, causing such connection to be removed from conn store map as soon as conntrack removes such flow.In such case, duplicate entries with same flow key will be added to PQ, rather than properly updating the existing item (note the difference between pq map size and pq queue size).
Before PQ change,
LastExportTime
is set toconn.StartTime
soon after being dumped:FlowExporter expiredConns slice could potentially exceed expected size limit of 128
expiredConns
is initialized with expected size limit of 128 to achieve a bounded time of holding export lock:However, during
sendFlowRecords
, ifexp.exportConn
encounters error, theexpiredConns
slice won't have its offset reset. This could lead to 2 potential issues during next iteration ofsendFlowRecords
:maxConnsToExport
not being adjusted, the size of the underlying array will be doubled to fit more than 128 items in the arrayTo Reproduce
Deploy antrea v1.6.0 with FlowExporter enabled without flow aggregator.
Expected
Flow Exporter memory should not bloat, and stale connection objects should be cleaned from memory after a certain period.
Actual behavior
See above
Versions:
The text was updated successfully, but these errors were encountered: