-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Ability to repair defective on-disk data #7912
Comments
Good idea. A check of the file from backups, to other known good parts of the file to be fixed, should also be done. That way we would have some confidence that it's the right version of the backup file. This does add complexity, since the destination file may be compressed, de-dupped and encrypted. I think we should consider that some serious recovery functions like this one, should be put in a separate command. Perhaps using |
The check of the supplied data would be implicit, as a successful repair could only happen in case the still existing on-disk checksum matches the file contents. I honestly don't care how the command would be called in the end, the syntax was mainly for illustration. |
I don't understand why you have to destroy snapshots. Can you explain? |
@GregorKopka One big problem with functionality like this would be that when a checksum fails on a data block, you know that either the checksum or the data block is wrong, but not necessarily which one. Similarly, since the checksums are on the blocks as-written, and it's completely possible for you to compress a block with two different implementations of the same algorithm and get different results while still decompressing to the same block, you're not really going to be able to sanity-check this in any useful fashion, just run along and clobber all the blocks with the copies you're feeding in. (Not to mention you'd probably be clobbering the blocks in-place rather than CoW like everything else, and that's such a horrific can of worms.) You'd also really want to not pass the replacement data in via pipe, because then you can't have a sanity check that you're passing in data with the same expected length as the thing you're trying to repair. @richardelling If you have a file in 30 snapshots that has a data block mangled, you get to nuke the snapshots if you want the error to go away. I believe the proposal is for the ability to hand something a copy that you promise is the intact version of the exact file and in-place overwrite it. |
It is absurd to delete data just to make an error message go away. For an operations team, just annotate as an exception (we don't care about this particular error message ever again). This works fine for the use case because the original data still exists therefore this is not a data loss event. |
@rincebrain Metadata is always redundant (at least one more copy than the data has), so a defect there is less likely, also it would lead to a defect that can only be solved by destroying the dataset/pool. As zfs tracks how (what compression) blocks are written with it can used the same mode to process the replacement data and will end up with the correct checksum when the replacement data is equals the original. @richardelling while you might have a point it dosn't account for users able to access the snapshot scheme - these can react quite differently than professionals when confronted with being unable to read a file. Surely the admin could restore the dataset from a zfs based backup (that kept the snapshot chain), but it would be way easier when such a defect could simply be repaired in-place. I brought this up as I expect the problem to come up more often in the future as small (non-redundant) pools (in clients, small system backups on USB drives, ...) are likely to get more commom. |
This is essentially how the self-healing functionality works today, with the exception that the replacement data is being provided by the user. There are some complications with encryption, compression, and deduplication as mentioned above but none of them should be a deal breakers. In the worst case, when the checksums differ the attempted repair would fail. Limiting the repair functionality to level 0 data blocks would also be a good idea for safety. |
Sorry? Do you have a link toward some information about data block levels, please? |
It is true that if you have the original file, then you'll know the correct data and with the block pointer we know the compression and DVAs. So it is clearly possible to do this. But... some devices fail such that the LBA is not correctable. An obvious case is a disk with a |
@richardelling I think everyone agrees that ignoring the error is better than destroying data. Do you agree that fixing the data (when possible) is better than ignoring the error? If so, there isn't really any disagreement here. |
bugs notwithstanding. The CLI design will be challenging, as will making it completely idiot-proof. |
It's a little old but still accurate http://www.giis.co.in/Zfs_ondiskformat.pdf, you want to look at pages 24 and 25. The specific concern is that because ZFS will trust the block contents as long as the checksum matches, we shouldn't allow it to overwrite any internal metadata. User file data will always be stored in the level 0 blocks. |
There are a few suggested features for repair operations which would be reasonable to add to a new user space tool. Such a tool wouldn't need to be completely idiot-proof, but it would nice to have when you are otherwise out of options.
|
Yes. I intended it to rewrite only the checksum failed (level 0, as I now know they are called) data blocks of the defective file with the supplied replacement data, but only if this leads to a correct checksum for the rewritten block. Feeding non-matching data thus can't lead to an on-disk change, should be idiot-proof. Metadata will not be written/modified at all. |
In case of disk read error and no self healing or backup data to restore from, I would like to have a command to force ZFS to zero out the badblock causing the disk to remap the sector and recalculate and update the checksum. Or ZFS can do the remapping at the higher level which maybe better if there are free disk spaces as the on disk remapping is limited to the number of built-in reserved sectors. Of course doing this would corrupt the file (but it is already corrupted at that point anyways). However a lot of file format can handle/recover from file corruption at the application level like video files where it would just blip and move on. Yet the current behavior where ZFS throw a read disk error cause some application would just stop. Currently even when I accept the file partial corruption, there is no way that I know of to clear a file / ZFS from bad sector (without mirror, raidz) except deleting the file and all related snapshots. Even if I zero out the bad sector manually using dd/hdparm and forcing the disk to remap, the ZFS checksum is still wrong and ZFS still error out. |
@aletus it is not required to change ZFS to implement this functionality. It can all be done in userland. |
Hi @richardelling would you give some direction on how to accomplish this in userland? As mentioned direct dd/hdparm write into the disk does not update the checksum which cause scrub error. And even if I am able to identify the byte offset into the file where the bad sector is and force a write into that section of the file, ZFS COW would actually redirect that write somewhere else and all the snapshots would still be corrupted and show errors on scrubing. Am I misunderstanding how this work? |
@aletus You cannot update a checksum (or by extension anything else) in an existing ZFS block. That is fundamentally part of how ZFS works. The change being requested here is to allow the user to provide the correct data, which matches the checksum. That is possible. What you are suggesting is (within reason) not. |
Hi @rlaager. Now I am even more confused. @richardelling mentioned it is possible to do this in userland without any change in ZFS, although he didn't mention how. And you are saying it is not possible even within ZFS. |
@aletus simply find the block in the file that is corrupted (dd can do this) and write a new block (dd can also do this). Step and repeat until dd passes. |
@richardelling that won't overwrite the old mangled block in historical copies,though, it'll just allocate a new one. Or are you suggesting reading until you get EIEIO and then digging through dmesg to see where it's complaining about, then carefully having dd overwrite on the raw device? |
@rincebrain I did what you mentioned getting the bad sector number in dmesg and write the block on the raw device using either dd or hdparm. However that does not update the ZFS checksum and I still end up with read error due to checksum when I read the file back in the application layer. I don't know any way to avoid that read error. I really think we have to do this at the ZFS layer and and have ZFS update the checksum for the zeroed out block at the same time. |
@aletus Yes, because if the block was compressed or the location of the read errors was some metadata, or many other things, that wouldn't fly. I don't think you understand what we're telling you. ZFS really, really does not have a mechanism for mutating extant data in-place, or repointing old things to new modified locations retroactively. So you could either go compute a checksum collision (lol) to overwrite the block with, or provide a valid copy of the data for it to compress and appropriately store. This thread is about the desire to hand such a valid data source to ZFS from userland. It's not likely to happen that someone will implement a whole indirection layer just so you can get incorrect data out of a file without zpool status reporting issues. |
@rincebrain Understood what I am asking for is not possible within ZFS architecture and implementation. The comment from @richardelling had my hopes up :) Just a summary of my understanding for those who later Google this, if you have disk errors / pending sectors waiting for remap on ZFS and you have no redundancy (mirrors, raidz)... and you also have snapshots there is currently no way to force a remap of those sectors nor clear error from zpool scrub status even accepting loss of those of files. So your scrubs will always shows errors and your SMART status will always show pending sectors until you delete all the associate snapshots and the original file. In my case I have automatic snapshots that get clean up after a year, so my theory is to use ddrescue to copy the file to current "live" copy with the bad blocks zeroed out. This is a new duplicate copy of the file, but it is readable without disk errors unlike the old one. Then wait a year for the snapshots that contain the reference to the bad blocks and the old file to age out and get deleted. At that point the zpool status errors should clear up if I understand things correctly. By the way @rincebrain my understanding is ZFS stores two copies of metadata spread apart so if a bad sector happens to be in the metadata section it should be able to correct itself using the other copy right? |
@aletus You understand correctly, as long as there is redundancy (either from the vdev being a mirror/raidz or from ZFS maintaining one more That's the reason why metadata errors are less likely (than data block errors) - and more likely to not stem from corruption of a block already sitting on-disk (unless the drive(s) experience massive failures) but garbage been written in the first place (corrupted before it hits the storage medium by a defect in code, CPU, RAM, controller, cabling, ...). |
See here for a rudimentary tool for ideas: https://www.joyent.com/blog/zfs-forensics-recovering-files-from-a-destroyed-zpool |
Hi, I know this issue is old, but I think it's still a thing. I recently fixed errors on my pool this way, so I thought I should share my prototype https://github.com/t-oster/zfs-repair-dataset. If anyone is interested, there is plenty of room for improvements (confirm checksum, skip healthy blocks, handle compression), but on uncompressed datasets on single-vdev pools it seems to work. |
I've implemented a corruption healing zfs receive, see #9372 |
Thank you for your work. |
I had a corrupted block in a mirror that I managed to manually repair after a couple weeks of tinkering. The block was both compressed and encrypted, which made the process more challenging. A takeaway from that experience is that for a repair feature like this to work, we might also need tools to inspect the corrupted data on disk. For encrypted datasets, we may need a tool that can decrypt a block (for inspection only) even if the MAC is invalid. To show why these tools would be helpful, let me explain my specific corruption scenario. Scrubbing the pool revealed a single checksum failure (on both drives) in a file in an old snapshot. The file was a game asset, so it was non-essential data. However, it was in a filesystem with more important data, and there were many snapshots both before and after the corrupted snapshots, so deleting snapshots was not an acceptable solution. Leaving the corruption was also not acceptable (even though that file was deleted in later snapshots) because it caused First, I located the corrupted part of the file (using At this point, I needed to compress the recovered data and encrypt it using the block's parameters. I needed to compress the block exactly as it was compressed before. If the compressed block had even one bit flipped, the checksum and MAC would mismatch, and all I would know is that I got something wrong, but not what. In order to see whether I was getting the compression right, I wanted to decrypt the corrupt block on disk. I couldn't use It was obvious that the decryption worked because the block was padded with zeroes at the end. (Plus, it turns out LZ4 looks different than random data, and you can develop an intuition for it by staring at LZ4 long enough!) Using the Takeaways and implications for a repair tool:
|
The idea is that This would completely automate what you did manually, repairing any and all corrupt (according to the checksums in the metadata) on-disk data of that file - while not being able to further destroy data, as delivering the wrong source file would result in differing checksums. Surely you need a healthy source file that is, in the location(s) affected by on-disk corruption, identical to the data that was written back then - but how you construct that file is up to you, a sparse one where you |
Yep, such a tool would have made my repair much easier. And in my specific scenario, I was able to find a matching block (same data, aligned to record size) simply by reading the blocks before and after the corrupted block. By all means, we should build the tool as you described! To clarify, when I said "Simply It's possible that recovering smaller files would require more sophisticated tooling (like scanning byte-by-byte or decrypting partially corrupted blocks), but we can solve that problem better when we have a real example. |
If it's "just" an L0 data block, you could probably reach out and invert the nopwrite logic, and then hook however corrective recv overwrites a thing in place. |
For encryption-related maintenance tools, I would like to remind of a use case for primarily extracting(/injecting?) IVs and other encryption-related metadata in #12649 |
Originally I came up with this in Automated orphan file recovery:
A
zfs repair dataset filename < file
function to fix on-disk data corruption would be a good thing.ZFS knows the checksums of the on-disk data, so it should be able to locate bad blocks in a file (even in a snapshot) and rewrite them with (using the known checksums) verified as good data, inplace.
No block pointer rewrite or anything, just restoring damaged sectors on-disk to the contents they should have (so the physical drives will remap them in case the sectors are pending for reallocation).
I would guess this would be welcome by quite some as it could be used to recover from data errors (using a backup of the affected files from that point in time) without having to destroy all snapshots referencing the bad blocks.
The text was updated successfully, but these errors were encountered: