fix: clear fast-failover status when node goes not ready #3340

james-munson · 2024-12-10T22:03:46Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

Do a fail-safe clearing of RWX fast-failover status still pending on a node that is NotReady. If it hasn't already happened, it is too late for fast failover, so let the normal IsNodeDownOrDeleted logic take over.

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2024-12-10T22:03:54Z

Walkthrough

The changes in this pull request introduce enhancements to the NodeController, VolumeController, and DataStore classes, focusing on the management of delinquent leases and volume states. A new method, clearDelinquentLeasesIfNodeNotReady, is added to handle scenarios where a node is not ready, ensuring that delinquent leases are cleared appropriately. Additionally, modifications in the VolumeController improve error handling, logging, and state management during volume operations, particularly regarding RWX volumes and snapshot settings. The DataStore method for clearing delinquent states is also updated to include a node-specific parameter.

Changes

File	Change Summary
controller/node_controller.go	Added method `clearDelinquentLeasesIfNodeNotReady` to clear delinquent leases for not-ready nodes. Modified `syncNode` to invoke this method after readiness checks.
controller/volume_controller.go	Updated `handleDelinquentAndStaleStateForFaultedRWXVolume` to include an additional parameter. Enhanced logging in `requestRemountIfFileSystemReadOnly`. Adjusted `syncVolumeSnapshotSetting` for migration state checks. Refined `ReconcileVolumeState` for better state handling.
datastore/kubernetes.go	Modified `ClearDelinquentAndStaleStateIfVolumeIsDelinquent` to accept an additional `nodeName` parameter for targeted state clearing. Enhanced error handling with context.

Assessment against linked issues

Objective	Addressed	Explanation
Delinquent state should be cleared when holder node goes `NotReady` (#9093)	✅

Possibly related PRs

fix: detached volume replicas not evicted #3293: The changes in the main PR regarding the management of delinquent leases in the NodeController are related to the eviction logic in the VolumeEvictionController, as both involve handling states of volumes and nodes in the context of readiness and eviction processes.

Suggested reviewers

PhanLe1010
shuo-wu

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

controller/node_controller.go (1)
2227-2227: Fix typographical error in error message

There's a typo in the error message format string. The string "failed to get setting i%v" has an unintended 'i' before %v. It should be corrected to "failed to get setting %v".

Apply this diff to fix the typo:
-return errors.Wrapf(err, "failed to get setting i%v", types.SettingNameRWXVolumeFastFailover)
+return errors.Wrapf(err, "failed to get setting %v", types.SettingNameRWXVolumeFastFailover)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f1c4b43 and 18d128f.

📒 Files selected for processing (3)

controller/node_controller.go (2 hunks)
controller/volume_controller.go (1 hunks)
datastore/kubernetes.go (2 hunks)

🔇 Additional comments (5)

controller/volume_controller.go (1)

1483-1483: Update function call to match new signature

The function ClearDelinquentAndStaleStateIfVolumeIsDelinquent now accepts an additional nodeName parameter. Passing an empty string ("") is acceptable when no specific node filtering is needed, and ensures the function behaves as before.

datastore/kubernetes.go (2)

285-285: Update method signature to include nodeName parameter

The method ClearDelinquentAndStaleStateIfVolumeIsDelinquent now accepts an additional nodeName parameter, allowing for more granular control over clearing delinquent states based on the specified node.

304-307: Verify nodeName comparison logic

The conditional check if nodeName != "" && nodeName != holder { return nil } ensures that the function only proceeds when nodeName matches the lease holder (holder). Verify that this logic correctly handles scenarios where delinquent states should only be cleared for a specific node.

controller/node_controller.go (2)

436-441: Ensure proper handling of delinquent leases for not-ready nodes

The addition of nc.clearDelinquentLeasesIfNodeNotReady(node) clears delinquent leases when a node becomes not ready. Verify that invoking this function at this point in syncNode correctly handles all cases without unintended side effects, especially in environments with multiple RWX volumes.

2223-2254: Implement clearDelinquentLeasesIfNodeNotReady method

The new method clearDelinquentLeasesIfNodeNotReady correctly checks if the RWX volume fast failover is enabled and if the node is down or deleted. It then iterates through share managers to clear delinquent leases associated with the node. The implementation is sound and follows best practices.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

controller/node_controller.go (1)

2241-2251: Optimize share manager iteration for relevancy

Consider filtering the share managers to only those associated with volumes that have leases held by the not-ready node. This optimization can improve performance in environments with a large number of share managers by reducing unnecessary iterations.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 18d128f and 5622b70.

📒 Files selected for processing (3)

controller/node_controller.go (2 hunks)
controller/volume_controller.go (1 hunks)
datastore/kubernetes.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

datastore/kubernetes.go

🔇 Additional comments (3)

controller/volume_controller.go (1)

1483-1483: Updated method call to match new signature

The addition of the empty string parameter in the call to ClearDelinquentAndStaleStateIfVolumeIsDelinquent aligns with the updated method signature that now accepts a nodeName parameter. This ensures correct invocation of the method.

controller/node_controller.go (2)

436-441: Appropriate error handling after clearing delinquent leases

In the syncNode method, the error handling after calling clearDelinquentLeasesIfNodeNotReady is correctly implemented. Logging the error and returning it ensures that issues during the clearance process are properly reported and handled.

2223-2254: Correct implementation of clearDelinquentLeasesIfNodeNotReady method

The new method clearDelinquentLeasesIfNodeNotReady effectively checks if the RWX volume fast-failover feature is enabled and whether the node is down or deleted. It then iterates through share managers to clear delinquent leases associated with the specified node. Error handling and logging are properly managed, ensuring robustness.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

controller/node_controller.go (1)
2229-2259: Add documentation and consider bulk operations.

The implementation looks good but could benefit from:

Add a method comment explaining the purpose and behavior:
// clearDelinquentLeasesIfNodeNotReady clears RWX volume fast-failover status when a node becomes not ready.
// This ensures that the standard logic for determining if a node is down/deleted is applied instead of
// waiting for fast-failover, which is no longer possible once a node is not ready.
Consider optimizing the share manager loop to use bulk operations if the datastore supports it:
// TODO: Consider implementing bulk clear operation in datastore to handle multiple leases efficiently

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5622b70 and 214dd0d.

📒 Files selected for processing (3)

controller/node_controller.go (2 hunks)
controller/volume_controller.go (1 hunks)
datastore/kubernetes.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

controller/volume_controller.go
datastore/kubernetes.go

controller/node_controller.go

james-munson · 2024-12-11T19:46:38Z

Address some rabbit review notes and rebase.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

controller/node_controller.go (2)
2246-2261: Consider batch processing for better performance.

The current implementation processes share managers one by one. For better performance, consider:

Batch processing share managers that need lease clearing

Using a single datastore call to clear multiple leases

Example approach:
// Group share managers by node
smsByNode := map[string][]string{}
for _, sm := range sms {
    smsByNode[node.Name] = append(smsByNode[node.Name], sm.Name)
}

// Clear leases in batches
for nodeName, smNames := range smsByNode {
    if err := nc.ds.ClearDelinquentAndStaleStateForVolumes(smNames, nodeName); err != nil {
        // Handle error
    }
}
2250-2264: Consider using error aggregation for comprehensive error reporting.

The current implementation only stores the first error encountered. Consider using error aggregation to preserve all errors for better debugging.

Example approach:
var errs []error
for _, sm := range sms {
    if err := nc.ds.ClearDelinquentAndStaleStateIfVolumeIsDelinquent(sm.Name, node.Name); err != nil {
        nc.logger.WithError(err).Warnf("failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name)
        errs = append(errs, errors.Wrapf(err, "volume %v", sm.Name))
    }
}
if len(errs) > 0 {
    return errors.Errorf("failed to clear delinquent leases: %v", errs)
}
return nil

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 214dd0d and c1102a3.

📒 Files selected for processing (3)

controller/node_controller.go (2 hunks)
controller/volume_controller.go (1 hunks)
datastore/kubernetes.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

controller/volume_controller.go
datastore/kubernetes.go

🔇 Additional comments (3)

controller/node_controller.go (3)

437-442: Improve error handling for lease clearing.

The error handling should be consistent with the comment in the code that states "if it's too late for fast failover to occur, we should use standard logic". Therefore, we should log the error but continue execution rather than returning it.

Apply this diff to improve error handling:

 if err = nc.clearDelinquentLeasesIfNodeNotReady(node); err != nil {
-    log.WithError(err).Warnf("Failed to clear delinquent leases for node %v", node.Name)
-    return err
+    log.WithError(err).Warnf("Failed to clear delinquent leases for node %v, continuing since node is already not ready", node.Name)
 }

2229-2264: Add test coverage for the new functionality.

The new clearDelinquentLeasesIfNodeNotReady method lacks test coverage. Consider adding tests for:

Feature flag behavior (enabled/disabled)
Node state handling (down/deleted vs. ready)
Error handling scenarios
Lease clearing with multiple share managers

Would you like me to help generate the test cases?

2229-2264: Implementation aligns with PR objectives.

The code successfully implements the requirements to clear fast-failover status when a node becomes not ready:

Respects the feature flag
Properly checks node state
Clears leases for all affected share managers

Run the following script to verify the implementation:

✅ Verification successful

Based on the gathered information, I can now generate the final response:

Implementation correctly handles lease clearing for node not-ready scenarios

The code implementation is well-structured and handles all necessary cases:

Properly checks the RWX fast failover feature flag
Correctly verifies node state using IsNodeDownOrDeleted
Implements proper lease clearing logic in ClearDelinquentAndStaleStateIfVolumeIsDelinquent:
- Handles non-existent leases
- Respects node-specific clearing when node is specified
- Only clears truly delinquent leases (empty holder or zero acquire time)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the implementation handles the scenarios described in the PR

# Check if there are any other places in the codebase that modify delinquent leases
rg "ClearDelinquentAndStaleStateIfVolumeIsDelinquent|SetDelinquentLease"

# Check for related test coverage
rg -A 5 "TestClearDelinquentLeasesIfNodeNotReady|TestSetDelinquentLease"

Length of output: 623

Script:

#!/bin/bash
# Let's check the implementation details and test coverage

# Check the implementation in datastore
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/

# Check volume controller usage context
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" controller/volume_controller.go

# Look for related test files
fd "node.*test.go|volume.*test.go" | xargs rg "TestClearDelinquentLeases|TestSetDelinquentLease"

Length of output: 1547

Script:

#!/bin/bash
# Let's check the complete implementation in datastore and related lease handling

# Check the complete lease handling implementation
rg -A 20 "func.*ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/

# Check for lease-related structs and types
rg "type.*Lease" -A 5

# Look for any lease-related test files
fd "lease.*test.go" | xargs cat

Length of output: 57268

Script:

#!/bin/bash
# Let's check the complete implementation and related code

# Check the datastore implementation for lease clearing
rg -A 30 "func.*ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/

# Check for volume controller's usage
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" controller/volume_controller.go

# Check for related functions in node controller
rg -A 10 "IsNodeDownOrDeleted|GetSettingAsBool" datastore/

Length of output: 12837

Signed-off-by: James Munson <james.munson@suse.com>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

controller/node_controller.go (2)
437-442: Improve error handling to avoid redundant logging

The error is both logged and returned with the same message. Since the error will be logged by the caller's error handling, we can either:

Return the error without logging

Log with additional context but don't return the error
 if err = nc.clearDelinquentLeasesIfNodeNotReady(node); err != nil {
-    log.WithError(err).Warnf("Failed to clear delinquent leases for node %v", node.Name)
-    return err
+    return errors.Wrapf(err, "failed to clear delinquent leases for node %v", node.Name)
 }
2229-2264: LGTM! Implementation looks correct and well-structured

The implementation correctly:

Checks the feature flag

Verifies node state

Processes all share managers to clear delinquent leases

Uses a robust error handling pattern

Consider deduplicating the error message:
 if err != nil {
     nc.logger.WithError(err).Warnf("failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name)
     if storedError == nil {
-        storedError = errors.Wrapf(err, "failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name)
+        storedError = err
     }
 }
Would you like me to help generate unit tests for this new method? The tests should cover:

Feature flag disabled

Node not down/deleted

Error handling when clearing leases

Success case with multiple share managers

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1102a3 and 978f2d3.

📒 Files selected for processing (3)

controller/node_controller.go (2 hunks)
controller/volume_controller.go (1 hunks)
datastore/kubernetes.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

datastore/kubernetes.go
controller/volume_controller.go

🔇 Additional comments (1)

controller/node_controller.go (1)

2229-2264: Verify the usage of clearDelinquentLeasesIfNodeNotReady

Let's verify that this is the only place where delinquent leases are cleared for not-ready nodes.

✅ Verification successful

Implementation of clearDelinquentLeasesIfNodeNotReady is correctly isolated

The verification shows that:

The ClearDelinquentAndStaleStateIfVolumeIsDelinquent is used in only three places:
1. The implementation in datastore
2. In node_controller.go for handling not-ready nodes (our case)
3. In volume_controller.go with an empty node name parameter (different use case)
The node state check via IsNodeDownOrDeleted is used consistently across the codebase for various node state handling scenarios
This is the only place where node down status is directly tied to clearing delinquent leases for share managers

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other places that might clear delinquent leases

# Check for similar functionality elsewhere
rg "ClearDelinquentAndStaleStateIfVolumeIsDelinquent.*node.*Name" --type go

# Check for other node state checks combined with lease operations
rg "IsNodeDownOrDeleted.*ClearDelinquent" --type go

Length of output: 387

Script:

#!/bin/bash
# Let's check where IsNodeDownOrDeleted is used to understand the node state handling
rg "IsNodeDownOrDeleted" -A 5 --type go

# Also check for any other lease clearing mechanisms
rg "ClearDelinquent" --type go

# Check where this function is called from
ast-grep --pattern 'clearDelinquentLeasesIfNodeNotReady($$$)'

Length of output: 10555

derekbit

LGTM.

derekbit · 2024-12-13T05:48:11Z

@PhanLe1010 Could you take a look and see if there is any concern? Thank you.

PhanLe1010 · 2024-12-14T02:13:55Z

The idea sound good to me

james-munson requested review from derekbit and PhanLe1010 December 10, 2024 22:04

coderabbitai bot reviewed Dec 10, 2024

View reviewed changes

james-munson force-pushed the 9093-not-ready-fail-safe branch from 18d128f to 5622b70 Compare December 10, 2024 22:37

coderabbitai bot reviewed Dec 10, 2024

View reviewed changes

derekbit force-pushed the 9093-not-ready-fail-safe branch from 5622b70 to 214dd0d Compare December 11, 2024 15:36

coderabbitai bot reviewed Dec 11, 2024

View reviewed changes

controller/node_controller.go Show resolved Hide resolved

controller/node_controller.go Outdated Show resolved Hide resolved

james-munson mentioned this pull request Dec 11, 2024

[BUG] Fast-failover lease needs a fail-safe mechanism to clear delinquent state. longhorn/longhorn#9093

Closed

james-munson force-pushed the 9093-not-ready-fail-safe branch from 214dd0d to c1102a3 Compare December 11, 2024 19:45

coderabbitai bot reviewed Dec 11, 2024

View reviewed changes

fix: clear fast-failover status when node goes not ready

978f2d3

Signed-off-by: James Munson <james.munson@suse.com>

derekbit force-pushed the 9093-not-ready-fail-safe branch from c1102a3 to 978f2d3 Compare December 12, 2024 12:27

coderabbitai bot reviewed Dec 12, 2024

View reviewed changes

derekbit approved these changes Dec 13, 2024

View reviewed changes

derekbit merged commit f37529b into longhorn:master Dec 14, 2024
9 checks passed

derekbit assigned james-munson Dec 14, 2024

james-munson deleted the 9093-not-ready-fail-safe branch December 17, 2024 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clear fast-failover status when node goes not ready #3340

fix: clear fast-failover status when node goes not ready #3340

james-munson commented Dec 10, 2024

coderabbitai bot commented Dec 10, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

james-munson commented Dec 11, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

derekbit left a comment

derekbit commented Dec 13, 2024

PhanLe1010 commented Dec 14, 2024

fix: clear fast-failover status when node goes not ready #3340

fix: clear fast-failover status when node goes not ready #3340

Conversation

james-munson commented Dec 10, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai bot commented Dec 10, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

james-munson commented Dec 11, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

derekbit left a comment

Choose a reason for hiding this comment

derekbit commented Dec 13, 2024

PhanLe1010 commented Dec 14, 2024

coderabbitai bot commented Dec 10, 2024 •

edited

Loading