-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition #3566
Conversation
WalkthroughThe changes modify the encryption process in the Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant EncryptVolume
participant isDeviceEncrypted
participant Logger
Caller->>EncryptVolume: Request encryption (devicePath)
EncryptVolume->>isDeviceEncrypted: Check if device is encrypted
isDeviceEncrypted-->>EncryptVolume: Returns (error or true/false)
alt Error during check
EncryptVolume->>Logger: Log warning
EncryptVolume-->>Caller: Return error
else Device already encrypted
EncryptVolume->>Logger: Log info (skip encryption)
EncryptVolume-->>Caller: Return nil
else Device not encrypted
EncryptVolume->>Logger: Proceed with encryption process
EncryptVolume-->>Caller: Continue with encryption
end
Possibly related PRs
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Btw, please let me know if you have better idea than the |
@PhanLe1010 go fmt failed |
…ion of volume in rare race condition see more details at longhorn/longhorn-manager#3566 longhorn-10416 Signed-off-by: Phan Le <phan.le@suse.com>
…ion of volume in rare race condition This happens with encrypted volume only While Longhorn CSI plugin is doing NodeStageVolume, if the Longhorn engine crash right before GetDiskFormat (https://github.com/longhorn/longhorn-manager/blob/ 5f9ec86/csi/node_server.go#L486) and recover quickly after that, the following race condition can happen: 1. When engine is crash, the engine removes the Longhorn device at /dev/longhorn/volume-name 2. Because the device path is gone, getDiskFormat returns empty value for diskFormat and nil error. Ref https://github.com/longhorn/longhorn-manager/blob/ 5f9ec86/vendor/k8s.io/ mount-utils/mount_linux.go#L687-L693 3. Now engine is recovered, so it recreate the device at /dev/longhorn/volume-name 4. However, due to diskFormat having empty value, Longhorn re-encrypts the volume which wipes out the all data. Ref https://github.com/longhorn/ longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/ csi/node_server.go#L513-L517 5. Finally, Longhorn cannot detect a filesystem for the volume any more so it reformat the filesystem RCA The root cause of the issue is that getDiskFormat is using blkid. The blkid command cannot differentiate 2 cases: the device doesn't exist VS the device doesn't have filesystem inside it Proposal Use cryptsetup isLuks instead. can differentiate between: 1. The device is already encrypted -> return exit code 0 2. The device is not encrypted -> return exit code 1 3. All other errors -> return exit code 4 Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848 Now we can safely make decision of only doing encryption in the 2nd case longhorn-10416 Signed-off-by: Phan Le <phan.le@suse.com>
Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
csi/crypto/crypto.go (2)
194-201
: LGTM! Critical fix for race condition.The new function correctly uses
cryptsetup isLuks
to accurately determine the encryption status of a device, addressing the root cause of data loss during re-encryption. The implementation properly handles namespace isolation and error propagation.This change is crucial as it:
- Prevents race conditions during engine crashes
- Provides reliable device encryption detection
- Avoids data loss from incorrect re-encryption
88-97
: LGTM! Enhanced error handling and data protection.The changes correctly prevent data loss by checking encryption status before proceeding. The error handling and logging provide good observability.
Consider enhancing the log message in line 90 to include more context:
- logrus.WithError(err).Warnf("Failed to check IsDeviceEncrypted before encrypting volume %v", devicePath) + logrus.WithError(err).Warnf("Failed to check encryption status before encrypting volume %v. This might indicate issues with cryptsetup or device access", devicePath)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
vendor/github.com/longhorn/go-common-libs/ns/crypto.go
is excluded by!vendor/**
📒 Files selected for processing (1)
csi/crypto/crypto.go
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Build binaries
- GitHub Check: Summary
🔇 Additional comments (1)
csi/crypto/crypto.go (1)
14-15
: LGTM!The import reorganization follows Go's standard grouping practices.
…ion of volume in rare race condition see more details at longhorn/longhorn-manager#3566 longhorn-10416 Signed-off-by: Phan Le <phan.le@suse.com>
@mergify backport v1.8.x v1.7.x |
✅ Backports have been created
|
…ion of volume in rare race condition see more details at longhorn/longhorn-manager#3566 longhorn-10416 Signed-off-by: Phan Le <phan.le@suse.com>
longhorn/longhorn#10416
RCA
This happens with encrypted volume only
While Longhorn CSI plugin is doing
NodeStageVolume
, if the Longhorn engine crash right before GetDiskFormat and recover quickly after that, the following race condition can happen:/dev/longhorn/volume-name
getDiskFormat
returns empty value fordiskFormat
andnil
error. Reflonghorn-manager/vendor/k8s.io/mount-utils/mount_linux.go
Lines 687 to 693 in 5f9ec86
/dev/longhorn/volume-name
diskFormat
having empty value, Longhorn re-encrypts the volume which wipes out the all data. Reflonghorn-manager/csi/node_server.go
Lines 513 to 517 in 5f9ec86
getDiskFormat
is usingblkid
. Theblkid
command cannot differentiate 2 cases: the device doesn't exist VS the device doesn't have filesystem inside itProposal
Use cryptsetup isLuks instead. It can differentiate between:
Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848
With this proposal, we can safely make decision of only doing encryption in the 2nd case
Additional documentation or context
We were discussing of the below ideas with @derekbit and @shuo-wu but it seems cannot remove the race condition 100%
Idea 1: Check if the device path exist before doing GetDiskFormat
-> Cannot avoid the race condition as the device can still disappear before GetDiskFormat and re-appear after GetDiskFormat
Idea 2: Check if the device path exist after doing GetDiskFormat
-> Cannot avoid the race condition as the device can still disappear before GetDiskFormat and re-appear after GetDiskFormat