-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition (backport #3566) #3568
Conversation
Cherry-pick of 72c9173 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
This pull request is now in conflict. Could you fix it @mergify[bot]? 🙏 |
…ion of volume in rare race condition This happens with encrypted volume only While Longhorn CSI plugin is doing NodeStageVolume, if the Longhorn engine crash right before GetDiskFormat (https://github.com/longhorn/longhorn-manager/blob/ 5f9ec86/csi/node_server.go#L486) and recover quickly after that, the following race condition can happen: 1. When engine is crash, the engine removes the Longhorn device at /dev/longhorn/volume-name 2. Because the device path is gone, getDiskFormat returns empty value for diskFormat and nil error. Ref https://github.com/longhorn/longhorn-manager/blob/ 5f9ec86/vendor/k8s.io/ mount-utils/mount_linux.go#L687-L693 3. Now engine is recovered, so it recreate the device at /dev/longhorn/volume-name 4. However, due to diskFormat having empty value, Longhorn re-encrypts the volume which wipes out the all data. Ref https://github.com/longhorn/ longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/ csi/node_server.go#L513-L517 5. Finally, Longhorn cannot detect a filesystem for the volume any more so it reformat the filesystem RCA The root cause of the issue is that getDiskFormat is using blkid. The blkid command cannot differentiate 2 cases: the device doesn't exist VS the device doesn't have filesystem inside it Proposal Use cryptsetup isLuks instead. can differentiate between: 1. The device is already encrypted -> return exit code 0 2. The device is not encrypted -> return exit code 1 3. All other errors -> return exit code 4 Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848 Now we can safely make decision of only doing encryption in the 2nd case longhorn-10416 Signed-off-by: Phan Le <phan.le@suse.com> (cherry picked from commit 72c9173) # Conflicts: # csi/crypto/crypto.go
Signed-off-by: Derek Su <derek.su@suse.com>
Signed-off-by: Derek Su <derek.su@suse.com>
4643b9b
to
1d6bbad
Compare
longhorn/longhorn#10416
RCA
This happens with encrypted volume only
While Longhorn CSI plugin is doing
NodeStageVolume
, if the Longhorn engine crash right before GetDiskFormat and recover quickly after that, the following race condition can happen:/dev/longhorn/volume-name
getDiskFormat
returns empty value fordiskFormat
andnil
error. Reflonghorn-manager/vendor/k8s.io/mount-utils/mount_linux.go
Lines 687 to 693 in 5f9ec86
/dev/longhorn/volume-name
diskFormat
having empty value, Longhorn re-encrypts the volume which wipes out the all data. Reflonghorn-manager/csi/node_server.go
Lines 513 to 517 in 5f9ec86
getDiskFormat
is usingblkid
. Theblkid
command cannot differentiate 2 cases: the device doesn't exist VS the device doesn't have filesystem inside itProposal
Use cryptsetup isLuks instead. It can differentiate between:
Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848
With this proposal, we can safely make decision of only doing encryption in the 2nd case
Additional documentation or context
We were discussing of the below ideas with @derekbit and @shuo-wu but it seems cannot remove the race condition 100%
Idea 1: Check if the device path exist before doing GetDiskFormat
-> Cannot avoid the race condition as the device can still disappear before GetDiskFormat and re-appear after GetDiskFormat
Idea 2: Check if the device path exist after doing GetDiskFormat
-> Cannot avoid the race condition as the device can still disappear before GetDiskFormat and re-appear after GetDiskFormat
This is an automatic backport of pull request #3566 done by [Mergify](https://mergify.com).