Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: clear fast-failover status when node goes not ready #3340

Merged
merged 1 commit into from
Dec 14, 2024

Conversation

james-munson
Copy link
Contributor

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9093

What this PR does / why we need it:

Do a fail-safe clearing of RWX fast-failover status still pending on a node that is NotReady. If it hasn't already happened, it is too late for fast failover, so let the normal IsNodeDownOrDeleted logic take over.

Special notes for your reviewer:

Additional documentation or context

Copy link

coderabbitai bot commented Dec 10, 2024

Walkthrough

The changes in this pull request introduce enhancements to the NodeController, VolumeController, and DataStore classes, focusing on the management of delinquent leases and volume states. A new method, clearDelinquentLeasesIfNodeNotReady, is added to handle scenarios where a node is not ready, ensuring that delinquent leases are cleared appropriately. Additionally, modifications in the VolumeController improve error handling, logging, and state management during volume operations, particularly regarding RWX volumes and snapshot settings. The DataStore method for clearing delinquent states is also updated to include a node-specific parameter.

Changes

File Change Summary
controller/node_controller.go Added method clearDelinquentLeasesIfNodeNotReady to clear delinquent leases for not-ready nodes. Modified syncNode to invoke this method after readiness checks.
controller/volume_controller.go Updated handleDelinquentAndStaleStateForFaultedRWXVolume to include an additional parameter. Enhanced logging in requestRemountIfFileSystemReadOnly. Adjusted syncVolumeSnapshotSetting for migration state checks. Refined ReconcileVolumeState for better state handling.
datastore/kubernetes.go Modified ClearDelinquentAndStaleStateIfVolumeIsDelinquent to accept an additional nodeName parameter for targeted state clearing. Enhanced error handling with context.

Assessment against linked issues

Objective Addressed Explanation
Delinquent state should be cleared when holder node goes NotReady (#9093)

Possibly related PRs

  • fix: detached volume replicas not evicted #3293: The changes in the main PR regarding the management of delinquent leases in the NodeController are related to the eviction logic in the VolumeEvictionController, as both involve handling states of volumes and nodes in the context of readiness and eviction processes.

Suggested reviewers

  • PhanLe1010
  • shuo-wu

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
controller/node_controller.go (1)

2227-2227: Fix typographical error in error message

There's a typo in the error message format string. The string "failed to get setting i%v" has an unintended 'i' before %v. It should be corrected to "failed to get setting %v".

Apply this diff to fix the typo:

-return errors.Wrapf(err, "failed to get setting i%v", types.SettingNameRWXVolumeFastFailover)
+return errors.Wrapf(err, "failed to get setting %v", types.SettingNameRWXVolumeFastFailover)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f1c4b43 and 18d128f.

📒 Files selected for processing (3)
  • controller/node_controller.go (2 hunks)
  • controller/volume_controller.go (1 hunks)
  • datastore/kubernetes.go (2 hunks)
🔇 Additional comments (5)
controller/volume_controller.go (1)

1483-1483: Update function call to match new signature

The function ClearDelinquentAndStaleStateIfVolumeIsDelinquent now accepts an additional nodeName parameter. Passing an empty string ("") is acceptable when no specific node filtering is needed, and ensures the function behaves as before.

datastore/kubernetes.go (2)

285-285: Update method signature to include nodeName parameter

The method ClearDelinquentAndStaleStateIfVolumeIsDelinquent now accepts an additional nodeName parameter, allowing for more granular control over clearing delinquent states based on the specified node.


304-307: Verify nodeName comparison logic

The conditional check if nodeName != "" && nodeName != holder { return nil } ensures that the function only proceeds when nodeName matches the lease holder (holder). Verify that this logic correctly handles scenarios where delinquent states should only be cleared for a specific node.

controller/node_controller.go (2)

436-441: Ensure proper handling of delinquent leases for not-ready nodes

The addition of nc.clearDelinquentLeasesIfNodeNotReady(node) clears delinquent leases when a node becomes not ready. Verify that invoking this function at this point in syncNode correctly handles all cases without unintended side effects, especially in environments with multiple RWX volumes.


2223-2254: Implement clearDelinquentLeasesIfNodeNotReady method

The new method clearDelinquentLeasesIfNodeNotReady correctly checks if the RWX volume fast failover is enabled and if the node is down or deleted. It then iterates through share managers to clear delinquent leases associated with the node. The implementation is sound and follows best practices.

@james-munson james-munson force-pushed the 9093-not-ready-fail-safe branch from 18d128f to 5622b70 Compare December 10, 2024 22:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
controller/node_controller.go (1)

2241-2251: Optimize share manager iteration for relevancy

Consider filtering the share managers to only those associated with volumes that have leases held by the not-ready node. This optimization can improve performance in environments with a large number of share managers by reducing unnecessary iterations.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 18d128f and 5622b70.

📒 Files selected for processing (3)
  • controller/node_controller.go (2 hunks)
  • controller/volume_controller.go (1 hunks)
  • datastore/kubernetes.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • datastore/kubernetes.go
🔇 Additional comments (3)
controller/volume_controller.go (1)

1483-1483: Updated method call to match new signature

The addition of the empty string parameter in the call to ClearDelinquentAndStaleStateIfVolumeIsDelinquent aligns with the updated method signature that now accepts a nodeName parameter. This ensures correct invocation of the method.

controller/node_controller.go (2)

436-441: Appropriate error handling after clearing delinquent leases

In the syncNode method, the error handling after calling clearDelinquentLeasesIfNodeNotReady is correctly implemented. Logging the error and returning it ensures that issues during the clearance process are properly reported and handled.


2223-2254: Correct implementation of clearDelinquentLeasesIfNodeNotReady method

The new method clearDelinquentLeasesIfNodeNotReady effectively checks if the RWX volume fast-failover feature is enabled and whether the node is down or deleted. It then iterates through share managers to clear delinquent leases associated with the specified node. Error handling and logging are properly managed, ensuring robustness.

@derekbit derekbit force-pushed the 9093-not-ready-fail-safe branch from 5622b70 to 214dd0d Compare December 11, 2024 15:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
controller/node_controller.go (1)

2229-2259: Add documentation and consider bulk operations.

The implementation looks good but could benefit from:

  1. Add a method comment explaining the purpose and behavior:
// clearDelinquentLeasesIfNodeNotReady clears RWX volume fast-failover status when a node becomes not ready.
// This ensures that the standard logic for determining if a node is down/deleted is applied instead of
// waiting for fast-failover, which is no longer possible once a node is not ready.
  1. Consider optimizing the share manager loop to use bulk operations if the datastore supports it:
// TODO: Consider implementing bulk clear operation in datastore to handle multiple leases efficiently
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5622b70 and 214dd0d.

📒 Files selected for processing (3)
  • controller/node_controller.go (2 hunks)
  • controller/volume_controller.go (1 hunks)
  • datastore/kubernetes.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • controller/volume_controller.go
  • datastore/kubernetes.go

@james-munson
Copy link
Contributor Author

Address some rabbit review notes and rebase.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
controller/node_controller.go (2)

2246-2261: Consider batch processing for better performance.

The current implementation processes share managers one by one. For better performance, consider:

  1. Batch processing share managers that need lease clearing
  2. Using a single datastore call to clear multiple leases

Example approach:

// Group share managers by node
smsByNode := map[string][]string{}
for _, sm := range sms {
    smsByNode[node.Name] = append(smsByNode[node.Name], sm.Name)
}

// Clear leases in batches
for nodeName, smNames := range smsByNode {
    if err := nc.ds.ClearDelinquentAndStaleStateForVolumes(smNames, nodeName); err != nil {
        // Handle error
    }
}

2250-2264: Consider using error aggregation for comprehensive error reporting.

The current implementation only stores the first error encountered. Consider using error aggregation to preserve all errors for better debugging.

Example approach:

var errs []error
for _, sm := range sms {
    if err := nc.ds.ClearDelinquentAndStaleStateIfVolumeIsDelinquent(sm.Name, node.Name); err != nil {
        nc.logger.WithError(err).Warnf("failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name)
        errs = append(errs, errors.Wrapf(err, "volume %v", sm.Name))
    }
}
if len(errs) > 0 {
    return errors.Errorf("failed to clear delinquent leases: %v", errs)
}
return nil
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 214dd0d and c1102a3.

📒 Files selected for processing (3)
  • controller/node_controller.go (2 hunks)
  • controller/volume_controller.go (1 hunks)
  • datastore/kubernetes.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • controller/volume_controller.go
  • datastore/kubernetes.go
🔇 Additional comments (3)
controller/node_controller.go (3)

437-442: Improve error handling for lease clearing.

The error handling should be consistent with the comment in the code that states "if it's too late for fast failover to occur, we should use standard logic". Therefore, we should log the error but continue execution rather than returning it.

Apply this diff to improve error handling:

 if err = nc.clearDelinquentLeasesIfNodeNotReady(node); err != nil {
-    log.WithError(err).Warnf("Failed to clear delinquent leases for node %v", node.Name)
-    return err
+    log.WithError(err).Warnf("Failed to clear delinquent leases for node %v, continuing since node is already not ready", node.Name)
 }

2229-2264: Add test coverage for the new functionality.

The new clearDelinquentLeasesIfNodeNotReady method lacks test coverage. Consider adding tests for:

  1. Feature flag behavior (enabled/disabled)
  2. Node state handling (down/deleted vs. ready)
  3. Error handling scenarios
  4. Lease clearing with multiple share managers

Would you like me to help generate the test cases?


2229-2264: Implementation aligns with PR objectives.

The code successfully implements the requirements to clear fast-failover status when a node becomes not ready:

  1. Respects the feature flag
  2. Properly checks node state
  3. Clears leases for all affected share managers

Run the following script to verify the implementation:

✅ Verification successful

Based on the gathered information, I can now generate the final response:

Implementation correctly handles lease clearing for node not-ready scenarios

The code implementation is well-structured and handles all necessary cases:

  • Properly checks the RWX fast failover feature flag
  • Correctly verifies node state using IsNodeDownOrDeleted
  • Implements proper lease clearing logic in ClearDelinquentAndStaleStateIfVolumeIsDelinquent:
    • Handles non-existent leases
    • Respects node-specific clearing when node is specified
    • Only clears truly delinquent leases (empty holder or zero acquire time)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the implementation handles the scenarios described in the PR

# Check if there are any other places in the codebase that modify delinquent leases
rg "ClearDelinquentAndStaleStateIfVolumeIsDelinquent|SetDelinquentLease"

# Check for related test coverage
rg -A 5 "TestClearDelinquentLeasesIfNodeNotReady|TestSetDelinquentLease"

Length of output: 623


Script:

#!/bin/bash
# Let's check the implementation details and test coverage

# Check the implementation in datastore
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/

# Check volume controller usage context
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" controller/volume_controller.go

# Look for related test files
fd "node.*test.go|volume.*test.go" | xargs rg "TestClearDelinquentLeases|TestSetDelinquentLease"

Length of output: 1547


Script:

#!/bin/bash
# Let's check the complete implementation in datastore and related lease handling

# Check the complete lease handling implementation
rg -A 20 "func.*ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/

# Check for lease-related structs and types
rg "type.*Lease" -A 5

# Look for any lease-related test files
fd "lease.*test.go" | xargs cat

Length of output: 57268


Script:

#!/bin/bash
# Let's check the complete implementation and related code

# Check the datastore implementation for lease clearing
rg -A 30 "func.*ClearDelinquentAndStaleStateIfVolumeIsDelinquent" datastore/

# Check for volume controller's usage
rg -A 10 "ClearDelinquentAndStaleStateIfVolumeIsDelinquent" controller/volume_controller.go

# Check for related functions in node controller
rg -A 10 "IsNodeDownOrDeleted|GetSettingAsBool" datastore/

Length of output: 12837

Signed-off-by: James Munson <james.munson@suse.com>
@derekbit derekbit force-pushed the 9093-not-ready-fail-safe branch from c1102a3 to 978f2d3 Compare December 12, 2024 12:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
controller/node_controller.go (2)

437-442: Improve error handling to avoid redundant logging

The error is both logged and returned with the same message. Since the error will be logged by the caller's error handling, we can either:

  1. Return the error without logging
  2. Log with additional context but don't return the error
 if err = nc.clearDelinquentLeasesIfNodeNotReady(node); err != nil {
-    log.WithError(err).Warnf("Failed to clear delinquent leases for node %v", node.Name)
-    return err
+    return errors.Wrapf(err, "failed to clear delinquent leases for node %v", node.Name)
 }

2229-2264: LGTM! Implementation looks correct and well-structured

The implementation correctly:

  1. Checks the feature flag
  2. Verifies node state
  3. Processes all share managers to clear delinquent leases
  4. Uses a robust error handling pattern

Consider deduplicating the error message:

 if err != nil {
     nc.logger.WithError(err).Warnf("failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name)
     if storedError == nil {
-        storedError = errors.Wrapf(err, "failed to clear delinquent lease for volume %v, node %v", sm.Name, node.Name)
+        storedError = err
     }
 }

Would you like me to help generate unit tests for this new method? The tests should cover:

  1. Feature flag disabled
  2. Node not down/deleted
  3. Error handling when clearing leases
  4. Success case with multiple share managers
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1102a3 and 978f2d3.

📒 Files selected for processing (3)
  • controller/node_controller.go (2 hunks)
  • controller/volume_controller.go (1 hunks)
  • datastore/kubernetes.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • datastore/kubernetes.go
  • controller/volume_controller.go
🔇 Additional comments (1)
controller/node_controller.go (1)

2229-2264: Verify the usage of clearDelinquentLeasesIfNodeNotReady

Let's verify that this is the only place where delinquent leases are cleared for not-ready nodes.

✅ Verification successful

Implementation of clearDelinquentLeasesIfNodeNotReady is correctly isolated

The verification shows that:

  • The ClearDelinquentAndStaleStateIfVolumeIsDelinquent is used in only three places:
    1. The implementation in datastore
    2. In node_controller.go for handling not-ready nodes (our case)
    3. In volume_controller.go with an empty node name parameter (different use case)
  • The node state check via IsNodeDownOrDeleted is used consistently across the codebase for various node state handling scenarios
  • This is the only place where node down status is directly tied to clearing delinquent leases for share managers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other places that might clear delinquent leases

# Check for similar functionality elsewhere
rg "ClearDelinquentAndStaleStateIfVolumeIsDelinquent.*node.*Name" --type go

# Check for other node state checks combined with lease operations
rg "IsNodeDownOrDeleted.*ClearDelinquent" --type go

Length of output: 387


Script:

#!/bin/bash
# Let's check where IsNodeDownOrDeleted is used to understand the node state handling
rg "IsNodeDownOrDeleted" -A 5 --type go

# Also check for any other lease clearing mechanisms
rg "ClearDelinquent" --type go

# Check where this function is called from
ast-grep --pattern 'clearDelinquentLeasesIfNodeNotReady($$$)'

Length of output: 10555

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@derekbit
Copy link
Member

@PhanLe1010 Could you take a look and see if there is any concern? Thank you.

@derekbit derekbit merged commit f37529b into longhorn:master Dec 14, 2024
9 checks passed
@PhanLe1010
Copy link
Contributor

The idea sound good to me

@james-munson james-munson deleted the 9093-not-ready-fail-safe branch December 17, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants