-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: clear fast-failover status when node goes not ready #3340
fix: clear fast-failover status when node goes not ready #3340
Conversation
WalkthroughThe changes in this pull request introduce enhancements to the Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
controller/node_controller.go (1)
2227-2227
: Fix typographical error in error messageThere's a typo in the error message format string. The string
"failed to get setting i%v"
has an unintended 'i' before%v
. It should be corrected to"failed to get setting %v"
.Apply this diff to fix the typo:
-return errors.Wrapf(err, "failed to get setting i%v", types.SettingNameRWXVolumeFastFailover) +return errors.Wrapf(err, "failed to get setting %v", types.SettingNameRWXVolumeFastFailover)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
controller/node_controller.go
(2 hunks)controller/volume_controller.go
(1 hunks)datastore/kubernetes.go
(2 hunks)
🔇 Additional comments (5)
controller/volume_controller.go (1)
1483-1483
: Update function call to match new signature
The function ClearDelinquentAndStaleStateIfVolumeIsDelinquent
now accepts an additional nodeName
parameter. Passing an empty string (""
) is acceptable when no specific node filtering is needed, and ensures the function behaves as before.
datastore/kubernetes.go (2)
285-285
: Update method signature to include nodeName
parameter
The method ClearDelinquentAndStaleStateIfVolumeIsDelinquent
now accepts an additional nodeName
parameter, allowing for more granular control over clearing delinquent states based on the specified node.
304-307
: Verify nodeName
comparison logic
The conditional check if nodeName != "" && nodeName != holder { return nil }
ensures that the function only proceeds when nodeName
matches the lease holder (holder
). Verify that this logic correctly handles scenarios where delinquent states should only be cleared for a specific node.
controller/node_controller.go (2)
436-441
: Ensure proper handling of delinquent leases for not-ready nodes
The addition of nc.clearDelinquentLeasesIfNodeNotReady(node)
clears delinquent leases when a node becomes not ready. Verify that invoking this function at this point in syncNode
correctly handles all cases without unintended side effects, especially in environments with multiple RWX volumes.
2223-2254
: Implement clearDelinquentLeasesIfNodeNotReady
method
The new method clearDelinquentLeasesIfNodeNotReady
correctly checks if the RWX volume fast failover is enabled and if the node is down or deleted. It then iterates through share managers to clear delinquent leases associated with the node. The implementation is sound and follows best practices.
18d128f
to
5622b70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
controller/node_controller.go (1)
2241-2251
: Optimize share manager iteration for relevancyConsider filtering the share managers to only those associated with volumes that have leases held by the not-ready node. This optimization can improve performance in environments with a large number of share managers by reducing unnecessary iterations.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
controller/node_controller.go
(2 hunks)controller/volume_controller.go
(1 hunks)datastore/kubernetes.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- datastore/kubernetes.go
🔇 Additional comments (3)
controller/volume_controller.go (1)
1483-1483
: Updated method call to match new signature
The addition of the empty string parameter in the call to ClearDelinquentAndStaleStateIfVolumeIsDelinquent
aligns with the updated method signature that now accepts a nodeName
parameter. This ensures correct invocation of the method.
controller/node_controller.go (2)
436-441
: Appropriate error handling after clearing delinquent leases
In the syncNode
method, the error handling after calling clearDelinquentLeasesIfNodeNotReady
is correctly implemented. Logging the error and returning it ensures that issues during the clearance process are properly reported and handled.
2223-2254
: Correct implementation of clearDelinquentLeasesIfNodeNotReady
method
The new method clearDelinquentLeasesIfNodeNotReady
effectively checks if the RWX volume fast-failover feature is enabled and whether the node is down or deleted. It then iterates through share managers to clear delinquent leases associated with the specified node. Error handling and logging are properly managed, ensuring robustness.
5622b70
to
214dd0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
controller/node_controller.go (1)
2229-2259
: Add documentation and consider bulk operations.The implementation looks good but could benefit from:
- Add a method comment explaining the purpose and behavior:
// clearDelinquentLeasesIfNodeNotReady clears RWX volume fast-failover status when a node becomes not ready. // This ensures that the standard logic for determining if a node is down/deleted is applied instead of // waiting for fast-failover, which is no longer possible once a node is not ready.
- Consider optimizing the share manager loop to use bulk operations if the datastore supports it:
// TODO: Consider implementing bulk clear operation in datastore to handle multiple leases efficiently
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
controller/node_controller.go
(2 hunks)controller/volume_controller.go
(1 hunks)datastore/kubernetes.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- controller/volume_controller.go
- datastore/kubernetes.go
214dd0d
to
c1102a3
Compare
Address some rabbit review notes and rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
controller/node_controller.go (2)
2246-2261
: Consider batch processing for better performance.The current implementation processes share managers one by one. For better performance, consider:
- Batch processing share managers that need lease clearing
- Using a single datastore call to clear multiple leases
Example approach:
// Group share managers by node smsByNode := map[string][]string{} for _, sm := range sms { smsByNode[node.Name] = append(smsByNode[node.Name], sm.Name) } // Clear leases in batches for nodeName, smNames := range smsByNode { if err := nc.ds.ClearDelinquentAndStaleStateForVolumes(smNames, nodeName); err != nil { // Handle error } }
2250-2264
: Consider using error aggregation for comprehensive error reporting.The current implementation only stores the first error encountered. Consider using error aggregation to preserve all errors for better debugging.
Example approach:
var errs []error for _, sm := range sms { if err := nc.ds.ClearDelinquentAndStaleStateIfVolumeIsDelinquent(sm.Name, node.Name); err != nil { nc.logger.WithError(err).Warnf("failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name) errs = append(errs, errors.Wrapf(err, "volume %v", sm.Name)) } } if len(errs) > 0 { return errors.Errorf("failed to clear delinquent leases: %v", errs) } return nil
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
controller/node_controller.go
(2 hunks)controller/volume_controller.go
(1 hunks)datastore/kubernetes.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- controller/volume_controller.go
- datastore/kubernetes.go
🔇 Additional comments (3)
controller/node_controller.go (3)
437-442
: Improve error handling for lease clearing.
The error handling should be consistent with the comment in the code that states "if it's too late for fast failover to occur, we should use standard logic". Therefore, we should log the error but continue execution rather than returning it.
Apply this diff to improve error handling:
if err = nc.clearDelinquentLeasesIfNodeNotReady(node); err != nil {
- log.WithError(err).Warnf("Failed to clear delinquent leases for node %v", node.Name)
- return err
+ log.WithError(err).Warnf("Failed to clear delinquent leases for node %v, continuing since node is already not ready", node.Name)
}
2229-2264
: Add test coverage for the new functionality.
The new clearDelinquentLeasesIfNodeNotReady
method lacks test coverage. Consider adding tests for:
- Feature flag behavior (enabled/disabled)
- Node state handling (down/deleted vs. ready)
- Error handling scenarios
- Lease clearing with multiple share managers
Would you like me to help generate the test cases?
2229-2264
: Implementation aligns with PR objectives.
The code successfully implements the requirements to clear fast-failover status when a node becomes not ready:
- Respects the feature flag
- Properly checks node state
- Clears leases for all affected share managers
Run the following script to verify the implementation:
✅ Verification successful
Based on the gathered information, I can now generate the final response:
Implementation correctly handles lease clearing for node not-ready scenarios
The code implementation is well-structured and handles all necessary cases:
- Properly checks the RWX fast failover feature flag
- Correctly verifies node state using
IsNodeDownOrDeleted
- Implements proper lease clearing logic in
ClearDelinquentAndStaleStateIfVolumeIsDelinquent
:- Handles non-existent leases
- Respects node-specific clearing when node is specified
- Only clears truly delinquent leases (empty holder or zero acquire time)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the implementation handles the scenarios described in the PR
# Check if there are any other places in the codebase that modify delinquent leases
rg "ClearDelinquentAndStaleStateIfVolumeIsDelinquent|SetDelinquentLease"
# Check for related test coverage
rg -A 5 "TestClearDelinquentLeasesIfNodeNotReady|TestSetDelinquentLease"
Length of output: 623
Script:
#!/bin/bash
# Let's check the implementation details and test coverage
# Check the implementation in datastore
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/
# Check volume controller usage context
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" controller/volume_controller.go
# Look for related test files
fd "node.*test.go|volume.*test.go" | xargs rg "TestClearDelinquentLeases|TestSetDelinquentLease"
Length of output: 1547
Script:
#!/bin/bash
# Let's check the complete implementation in datastore and related lease handling
# Check the complete lease handling implementation
rg -A 20 "func.*ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/
# Check for lease-related structs and types
rg "type.*Lease" -A 5
# Look for any lease-related test files
fd "lease.*test.go" | xargs cat
Length of output: 57268
Script:
#!/bin/bash
# Let's check the complete implementation and related code
# Check the datastore implementation for lease clearing
rg -A 30 "func.*ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/
# Check for volume controller's usage
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" controller/volume_controller.go
# Check for related functions in node controller
rg -A 10 "IsNodeDownOrDeleted|GetSettingAsBool" datastore/
Length of output: 12837
Signed-off-by: James Munson <james.munson@suse.com>
c1102a3
to
978f2d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
controller/node_controller.go (2)
437-442
: Improve error handling to avoid redundant loggingThe error is both logged and returned with the same message. Since the error will be logged by the caller's error handling, we can either:
- Return the error without logging
- Log with additional context but don't return the error
if err = nc.clearDelinquentLeasesIfNodeNotReady(node); err != nil { - log.WithError(err).Warnf("Failed to clear delinquent leases for node %v", node.Name) - return err + return errors.Wrapf(err, "failed to clear delinquent leases for node %v", node.Name) }
2229-2264
: LGTM! Implementation looks correct and well-structuredThe implementation correctly:
- Checks the feature flag
- Verifies node state
- Processes all share managers to clear delinquent leases
- Uses a robust error handling pattern
Consider deduplicating the error message:
if err != nil { nc.logger.WithError(err).Warnf("failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name) if storedError == nil { - storedError = errors.Wrapf(err, "failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name) + storedError = err } }Would you like me to help generate unit tests for this new method? The tests should cover:
- Feature flag disabled
- Node not down/deleted
- Error handling when clearing leases
- Success case with multiple share managers
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
controller/node_controller.go
(2 hunks)controller/volume_controller.go
(1 hunks)datastore/kubernetes.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- datastore/kubernetes.go
- controller/volume_controller.go
🔇 Additional comments (1)
controller/node_controller.go (1)
2229-2264
: Verify the usage of clearDelinquentLeasesIfNodeNotReady
Let's verify that this is the only place where delinquent leases are cleared for not-ready nodes.
✅ Verification successful
Implementation of clearDelinquentLeasesIfNodeNotReady is correctly isolated
The verification shows that:
- The
ClearDelinquentAndStaleStateIfVolumeIsDelinquent
is used in only three places:- The implementation in datastore
- In node_controller.go for handling not-ready nodes (our case)
- In volume_controller.go with an empty node name parameter (different use case)
- The node state check via
IsNodeDownOrDeleted
is used consistently across the codebase for various node state handling scenarios - This is the only place where node down status is directly tied to clearing delinquent leases for share managers
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for other places that might clear delinquent leases
# Check for similar functionality elsewhere
rg "ClearDelinquentAndStaleStateIfVolumeIsDelinquent.*node.*Name" --type go
# Check for other node state checks combined with lease operations
rg "IsNodeDownOrDeleted.*ClearDelinquent" --type go
Length of output: 387
Script:
#!/bin/bash
# Let's check where IsNodeDownOrDeleted is used to understand the node state handling
rg "IsNodeDownOrDeleted" -A 5 --type go
# Also check for any other lease clearing mechanisms
rg "ClearDelinquent" --type go
# Check where this function is called from
ast-grep --pattern 'clearDelinquentLeasesIfNodeNotReady($$$)'
Length of output: 10555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@PhanLe1010 Could you take a look and see if there is any concern? Thank you. |
The idea sound good to me |
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9093
What this PR does / why we need it:
Do a fail-safe clearing of RWX fast-failover status still pending on a node that is NotReady. If it hasn't already happened, it is too late for fast failover, so let the normal IsNodeDownOrDeleted logic take over.
Special notes for your reviewer:
Additional documentation or context