-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the deadlock between the exporter and the conntrack polling go routines #2429
Fix the deadlock between the exporter and the conntrack polling go routines #2429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. The logic is much cleaner. This solution LGTM. Just some tiny comments.
} | ||
} else { | ||
// If the connection is in dying state and the corresponding flow records are already | ||
// exported, then update the DoneExport flag on the connection. | ||
if flowexporter.IsConnectionDying(conn) && record.DoneExport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check whether connection is dying here again?
klog.V(2).Infof("Deleting the inactive flow records with key: %v from record map", key) | ||
if err := exp.flowRecords.DeleteFlowRecordWithoutLock(key); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function DeleteFlowRecordWithoutLock(key)
can be removed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done.
Sorry I was intended to select |
Codecov Report
@@ Coverage Diff @@
## main #2429 +/- ##
==========================================
+ Coverage 59.63% 64.96% +5.32%
==========================================
Files 284 284
Lines 22178 25417 +3239
==========================================
+ Hits 13226 16512 +3286
+ Misses 7527 7358 -169
- Partials 1425 1547 +122
Flags with carried forward coverage won't be shown. Click here to find out more.
|
34e5dd3
to
2d2b19d
Compare
klog.V(2).Infof("Deleting the inactive flow records with key: %v from record map", key) | ||
if err := exp.flowRecords.DeleteFlowRecordWithoutLock(key); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done.
func (fr *FlowRecords) AddOrUpdateFlowRecord(key flowexporter.ConnectionKey, conn *flowexporter.Connection) error { | ||
// If the connection is in dying state and the corresponding flow records are already | ||
// exported, then there is no need to add or update the record. | ||
if flowexporter.IsConnectionDying(conn) && conn.DoneExport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check whether connection is dying here again?
Changed the name of the flag. Removed this extra check here and other places as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// SetDyingAndDoneExport sets the appropriate field of the record to true given | ||
// the connection key and record. It also updates the record map. Caller is expected | ||
// to grab lock to the record map. | ||
func (fr *FlowRecords) SetDyingAndDoneExport(connKey flowexporter.ConnectionKey, record flowexporter.FlowRecord) { | ||
record.DyingAndDoneExport = true | ||
|
||
fr.recordsMap[connKey] = record | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this function a bit confusing, and the name is not really helping. Since you only use it in one place, do you think you could inline the code instead, and not define this function at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to do that initially but could not as the flow record map is not public. I modified it in a different way.
Deadlock is due to the access of connection map from exporter goroutine waiting to acquire connection map lock to update the "DoneExport" flag of the stored connection. At the same time, the connection polling goroutine acquires the connection map lock waiting to acquire flow record map, which is acquired by the exporter goroutine. This was caught in scale testing. Resolved this through a temporary fix by adding the same flag in the flow record data struct. The connection and record deletion logic will be re-evaluated through PR antrea-io#2360 as it refactors the related code quite a bit. Signed-off-by: Srikar Tati <stati@vmware.com>
2d2b19d
to
c3cd2f4
Compare
/test-all |
/test-ipv6-e2e /test-ipv6-only-e2e |
/test-e2e /test-conformance /test-networkpolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once merged, please cherry pick to the release-1.2 branch if appropriate (see https://github.com/antrea-io/antrea/blob/main/docs/contributors/cherry-picks.md)
} | ||
// If the connection is in dying state or connection is not in conntrack | ||
// table, we set the DyingAndDoneExport flag to do the deletion later. | ||
record.DyingAndDoneExport = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a future improvement, maybe we should just change ForAllFlowRecordsDo
so that updateOrSendFlowRecord
uses a flow record pointer instead of a copy of stored flow record. This whole code is executed with the lock any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Or I was thinking to store the pointer to the record in the record map instead of value.
As this code is being refactored and the flow record map would be mostly removed in PR #2360, I tried to keep the changes to a minimum.
@@ -83,6 +94,12 @@ func (fr *FlowRecords) AddFlowRecordToMap(connKey *flowexporter.ConnectionKey, r | |||
fr.recordsMap[*connKey] = *record | |||
} | |||
|
|||
// AddFlowRecordWithoutLock adds the flow record from record map given connection key. | |||
// Caller is expected to grab the lock the record map. | |||
func (fr *FlowRecords) AddFlowRecordWithoutLock(connKey *flowexporter.ConnectionKey, record *flowexporter.FlowRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can address naming consistency with the function above (AddFlowRecordToMap
) in a separate PR, or implement the approach I suggested above if it is acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this can be done. Hope you are ok with the explanation in the other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once merged, please cherry pick to the release-1.2 branch if appropriate (see https://github.com/antrea-io/antrea/blob/main/docs/contributors/cherry-picks.md)
Thanks for the review.
Sure, will do that. We could cherrypick to the release-1.1 branch as well as bug is applicable in that branch too. Not sure if there is any user who may want to pick that patch release on release-1.1 branch.
} | ||
// If the connection is in dying state or connection is not in conntrack | ||
// table, we set the DyingAndDoneExport flag to do the deletion later. | ||
record.DyingAndDoneExport = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Or I was thinking to store the pointer to the record in the record map instead of value.
As this code is being refactored and the flow record map would be mostly removed in PR #2360, I tried to keep the changes to a minimum.
@@ -83,6 +94,12 @@ func (fr *FlowRecords) AddFlowRecordToMap(connKey *flowexporter.ConnectionKey, r | |||
fr.recordsMap[*connKey] = *record | |||
} | |||
|
|||
// AddFlowRecordWithoutLock adds the flow record from record map given connection key. | |||
// Caller is expected to grab the lock the record map. | |||
func (fr *FlowRecords) AddFlowRecordWithoutLock(connKey *flowexporter.ConnectionKey, record *flowexporter.FlowRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this can be done. Hope you are ok with the explanation in the other comment.
Don't think this will affect integration tests (the testbed has been fixed now but I don't know of an easy way to re-trigger the tests), so I'll go ahead and merge this |
@srikartati please cherry-pick to release branches as needed |
Deadlock is due to the access of connection map from exporter goroutine waiting to acquire connection map lock to update the "DoneExport" flag of the stored connection. At the same time, the connection polling goroutine acquires the connection map lock waiting to acquire flow record map, which is acquired by the exporter goroutine.
This was caught in scale testing.
Resolved this through a temporary fix by adding the same flag in the flow record data struct.
The connection and record deletion logic will be re-evaluated through PR #2360 as it refactors the related code quite a bit.