Feature: Ability to repair defective on-disk data #7912

GregorKopka · 2018-09-15T19:08:43Z

Originally I came up with this in Automated orphan file recovery:

A zfs repair dataset filename < file function to fix on-disk data corruption would be a good thing.

ZFS knows the checksums of the on-disk data, so it should be able to locate bad blocks in a file (even in a snapshot) and rewrite them with (using the known checksums) verified as good data, inplace.
No block pointer rewrite or anything, just restoring damaged sectors on-disk to the contents they should have (so the physical drives will remap them in case the sectors are pending for reallocation).

I would guess this would be welcome by quite some as it could be used to recover from data errors (using a backup of the affected files from that point in time) without having to destroy all snapshots referencing the bad blocks.

The text was updated successfully, but these errors were encountered:

Lady-Galadriel · 2018-09-16T18:56:14Z

Good idea.

A check of the file from backups, to other known good parts of the file to be fixed, should also be done. That way we would have some confidence that it's the right version of the backup file. This does add complexity, since the destination file may be compressed, de-dupped and encrypted.

I think we should consider that some serious recovery functions like this one, should be put in a separate command. Perhaps using zmaint repair.

GregorKopka · 2018-09-16T19:15:17Z

The check of the supplied data would be implicit, as a successful repair could only happen in case the still existing on-disk checksum matches the file contents.

I honestly don't care how the command would be called in the end, the syntax was mainly for illustration.

richardelling · 2018-09-16T20:23:11Z

I don't understand why you have to destroy snapshots. Can you explain?

rincebrain · 2018-09-16T21:42:15Z

@GregorKopka One big problem with functionality like this would be that when a checksum fails on a data block, you know that either the checksum or the data block is wrong, but not necessarily which one.

Similarly, since the checksums are on the blocks as-written, and it's completely possible for you to compress a block with two different implementations of the same algorithm and get different results while still decompressing to the same block, you're not really going to be able to sanity-check this in any useful fashion, just run along and clobber all the blocks with the copies you're feeding in. (Not to mention you'd probably be clobbering the blocks in-place rather than CoW like everything else, and that's such a horrific can of worms.) You'd also really want to not pass the replacement data in via pipe, because then you can't have a sanity check that you're passing in data with the same expected length as the thing you're trying to repair.

@richardelling If you have a file in 30 snapshots that has a data block mangled, you get to nuke the snapshots if you want the error to go away. I believe the proposal is for the ability to hand something a copy that you promise is the intact version of the exact file and in-place overwrite it.

richardelling · 2018-09-16T21:48:19Z

It is absurd to delete data just to make an error message go away. For an operations team, just annotate as an exception (we don't care about this particular error message ever again). This works fine for the use case because the original data still exists therefore this is not a data loss event.

GregorKopka · 2018-09-17T06:20:05Z

@rincebrain Metadata is always redundant (at least one more copy than the data has), so a defect there is less likely, also it would lead to a defect that can only be solved by destroying the dataset/pool.
This feature is purely aimed at repairing on-disk data (in the sense of !=metadata) errors.

As zfs tracks how (what compression) blocks are written with it can used the same mode to process the replacement data and will end up with the correct checksum when the replacement data is equals the original.
The feature would only rewrite blocks (in-place) when 1. they can't be read from the pool and 2. the replacement checksum matches. Hence passing shorter/wrong replacement data in (with a pipe or by whatever means) wouldn't be a problem: Should you feed the wrong data nothing could happen (checksum mismatch), should the replacement data be short then it would only repair on-disk defects while it lasts and leave the rest unprocessed.

@richardelling while you might have a point it dosn't account for users able to access the snapshot scheme - these can react quite differently than professionals when confronted with being unable to read a file. Surely the admin could restore the dataset from a zfs based backup (that kept the snapshot chain), but it would be way easier when such a defect could simply be repaired in-place.

I brought this up as I expect the problem to come up more often in the future as small (non-redundant) pools (in clients, small system backups on USB drives, ...) are likely to get more commom.
Having 'destroy and restore from backup' as only only remedy isn't a very good selling point.

behlendorf · 2018-09-17T22:59:58Z

This is essentially how the self-healing functionality works today, with the exception that the replacement data is being provided by the user. There are some complications with encryption, compression, and deduplication as mentioned above but none of them should be a deal breakers. In the worst case, when the checksums differ the attempted repair would fail. Limiting the repair functionality to level 0 data blocks would also be a good idea for safety.

GregorKopka · 2018-09-17T23:30:13Z

Limiting the repair functionality to level 0 data blocks would also be a good idea for safety.

Sorry? Do you have a link toward some information about data block levels, please?

richardelling · 2018-09-18T00:03:45Z

It is true that if you have the original file, then you'll know the correct data and with the block pointer we know the compression and DVAs. So it is clearly possible to do this.

But... some devices fail such that the LBA is not correctable. An obvious case is a disk with a
full defect list and the data cannot be remapped. This can and does happen more often than
one would think. So it is not clear that this would work, and you'll be back to my point: ignoring
the error message is a better strategy than destroying data.

rlaager · 2018-09-18T00:06:17Z

@richardelling I think everyone agrees that ignoring the error is better than destroying data. Do you agree that fixing the data (when possible) is better than ignoring the error? If so, there isn't really any disagreement here.

richardelling · 2018-09-18T00:10:02Z

bugs notwithstanding.

The CLI design will be challenging, as will making it completely idiot-proof.

behlendorf · 2018-09-18T00:11:53Z

a link toward some information about data block levels, please?

It's a little old but still accurate http://www.giis.co.in/Zfs_ondiskformat.pdf, you want to look at pages 24 and 25. The specific concern is that because ZFS will trust the block contents as long as the checksum matches, we shouldn't allow it to overwrite any internal metadata. User file data will always be stored in the level 0 blocks.

behlendorf · 2018-09-18T00:21:19Z

The CLI design will be challenging, as will making it completely idiot-proof.

There are a few suggested features for repair operations which would be reasonable to add to a new user space tool. Such a tool wouldn't need to be completely idiot-proof, but it would nice to have when you are otherwise out of options.

Feature request: allow scrub to recalculate space maps #3111 - Offline spacemap rebuild.
zhack scrub subcommand for offline scrubs in userland #6209 - Offline pool scrubbing
Feature: Ability to repair defective on-disk data #7912 - Manual data repair
zhack: extend to modify vdev labels #2510, Add labelfix (zpool label recovery tool) to official tools. #4187 - Manual label modification

GregorKopka · 2018-09-18T01:43:34Z

The specific concern is that because ZFS will trust the block contents as long as the checksum matches, we shouldn't allow it to overwrite any internal metadata.

Yes. I intended it to rewrite only the checksum failed (level 0, as I now know they are called) data blocks of the defective file with the supplied replacement data, but only if this leads to a correct checksum for the rewritten block.

Feeding non-matching data thus can't lead to an on-disk change, should be idiot-proof.

Metadata will not be written/modified at all.

aletus · 2018-10-11T19:31:34Z

In case of disk read error and no self healing or backup data to restore from, I would like to have a command to force ZFS to zero out the badblock causing the disk to remap the sector and recalculate and update the checksum. Or ZFS can do the remapping at the higher level which maybe better if there are free disk spaces as the on disk remapping is limited to the number of built-in reserved sectors.

Of course doing this would corrupt the file (but it is already corrupted at that point anyways). However a lot of file format can handle/recover from file corruption at the application level like video files where it would just blip and move on. Yet the current behavior where ZFS throw a read disk error cause some application would just stop.

Currently even when I accept the file partial corruption, there is no way that I know of to clear a file / ZFS from bad sector (without mirror, raidz) except deleting the file and all related snapshots.

Even if I zero out the bad sector manually using dd/hdparm and forcing the disk to remap, the ZFS checksum is still wrong and ZFS still error out.

richardelling · 2018-10-11T21:39:37Z

@aletus it is not required to change ZFS to implement this functionality. It can all be done in userland.

aletus · 2018-10-12T03:25:57Z

Hi @richardelling would you give some direction on how to accomplish this in userland?

As mentioned direct dd/hdparm write into the disk does not update the checksum which cause scrub error. And even if I am able to identify the byte offset into the file where the bad sector is and force a write into that section of the file, ZFS COW would actually redirect that write somewhere else and all the snapshots would still be corrupted and show errors on scrubing.

Am I misunderstanding how this work?

rlaager · 2018-10-12T03:47:03Z

@aletus You cannot update a checksum (or by extension anything else) in an existing ZFS block. That is fundamentally part of how ZFS works. The change being requested here is to allow the user to provide the correct data, which matches the checksum. That is possible. What you are suggesting is (within reason) not.

aletus · 2018-10-12T23:58:42Z

Hi @rlaager. Now I am even more confused.

@richardelling mentioned it is possible to do this in userland without any change in ZFS, although he didn't mention how. And you are saying it is not possible even within ZFS.

richardelling · 2018-10-13T01:48:36Z

@aletus simply find the block in the file that is corrupted (dd can do this) and write a new block (dd can also do this). Step and repeat until dd passes.

rincebrain · 2018-10-13T01:52:22Z

@richardelling that won't overwrite the old mangled block in historical copies,though, it'll just allocate a new one. Or are you suggesting reading until you get EIEIO and then digging through dmesg to see where it's complaining about, then carefully having dd overwrite on the raw device?

aletus · 2018-10-13T02:14:28Z

@rincebrain I did what you mentioned getting the bad sector number in dmesg and write the block on the raw device using either dd or hdparm. However that does not update the ZFS checksum and I still end up with read error due to checksum when I read the file back in the application layer. I don't know any way to avoid that read error.

I really think we have to do this at the ZFS layer and and have ZFS update the checksum for the zeroed out block at the same time.

rincebrain · 2018-10-13T04:51:21Z

@aletus Yes, because if the block was compressed or the location of the read errors was some metadata, or many other things, that wouldn't fly.

I don't think you understand what we're telling you. ZFS really, really does not have a mechanism for mutating extant data in-place, or repointing old things to new modified locations retroactively. So you could either go compute a checksum collision (lol) to overwrite the block with, or provide a valid copy of the data for it to compress and appropriately store.

This thread is about the desire to hand such a valid data source to ZFS from userland. It's not likely to happen that someone will implement a whole indirection layer just so you can get incorrect data out of a file without zpool status reporting issues.

aletus · 2018-10-13T05:48:35Z

@rincebrain Understood what I am asking for is not possible within ZFS architecture and implementation. The comment from @richardelling had my hopes up :)

Just a summary of my understanding for those who later Google this, if you have disk errors / pending sectors waiting for remap on ZFS and you have no redundancy (mirrors, raidz)... and you also have snapshots there is currently no way to force a remap of those sectors nor clear error from zpool scrub status even accepting loss of those of files. So your scrubs will always shows errors and your SMART status will always show pending sectors until you delete all the associate snapshots and the original file.

In my case I have automatic snapshots that get clean up after a year, so my theory is to use ddrescue to copy the file to current "live" copy with the bad blocks zeroed out. This is a new duplicate copy of the file, but it is readable without disk errors unlike the old one. Then wait a year for the snapshots that contain the reference to the bad blocks and the old file to age out and get deleted.

At that point the zpool status errors should clear up if I understand things correctly.

By the way @rincebrain my understanding is ZFS stores two copies of metadata spread apart so if a bad sector happens to be in the metadata section it should be able to correct itself using the other copy right?

GregorKopka · 2018-10-13T06:28:43Z

@aletus You understand correctly, as long as there is redundancy (either from the vdev being a mirror/raidz or from ZFS maintaining one more copies of metadata than it does for the data it references, see man zfs for the copies dataset property - especially the warning in the last paragraph) on-disk errors can (and will) be repaired the moment they are detected.

That's the reason why metadata errors are less likely (than data block errors) - and more likely to not stem from corruption of a block already sitting on-disk (unless the drive(s) experience massive failures) but garbage been written in the first place (corrupted before it hits the storage medium by a defect in code, CPU, RAM, controller, cabling, ...).

zenaan · 2019-09-11T10:05:16Z

See here for a rudimentary tool for ideas:

#9313

https://www.joyent.com/blog/zfs-forensics-recovering-files-from-a-destroyed-zpool

t-oster · 2021-01-25T05:59:17Z

Hi, I know this issue is old, but I think it's still a thing. I recently fixed errors on my pool this way, so I thought I should share my prototype https://github.com/t-oster/zfs-repair-dataset. If anyone is interested, there is plenty of room for improvements (confirm checksum, skip healthy blocks, handle compression), but on uncompressed datasets on single-vdev pools it seems to work.

alek-p · 2021-12-02T21:42:38Z

I've implemented a corruption healing zfs receive, see #9372

GregorKopka · 2022-08-09T08:53:09Z

I've implemented a corruption healing zfs receive, see #9372

Thank you for your work.

Majiir · 2023-02-19T17:55:08Z

I had a corrupted block in a mirror that I managed to manually repair after a couple weeks of tinkering. The block was both compressed and encrypted, which made the process more challenging.

A takeaway from that experience is that for a repair feature like this to work, we might also need tools to inspect the corrupted data on disk. For encrypted datasets, we may need a tool that can decrypt a block (for inspection only) even if the MAC is invalid. To show why these tools would be helpful, let me explain my specific corruption scenario.

Scrubbing the pool revealed a single checksum failure (on both drives) in a file in an old snapshot. The file was a game asset, so it was non-essential data. However, it was in a filesystem with more important data, and there were many snapshots both before and after the corrupted snapshots, so deleting snapshots was not an acceptable solution. Leaving the corruption was also not acceptable (even though that file was deleted in later snapshots) because it caused zfs send to fail, preventing me from backing up that filesystem.

First, I located the corrupted part of the file (using dd to read) and found it was a 128KB block. I had a working snapshot with an older version of this file, and I also downloaded a fresh copy of the game from Steam to get a newer version. I looked at the block before the corrupted block and searched for a matching sequence of bytes in the other files. For both of the other files, I got a match, but at a different offset. I confirmed that the block before and after the corrupted block were present in the known good files, and with a 128KB gap in between, so I surmised that I could use the data in that gap to recover my corrupted block. Note that the recovered data was at an arbitrary, non-aligned offset in the known good file.

At this point, I needed to compress the recovered data and encrypt it using the block's parameters. I needed to compress the block exactly as it was compressed before. If the compressed block had even one bit flipped, the checksum and MAC would mismatch, and all I would know is that I got something wrong, but not what. In order to see whether I was getting the compression right, I wanted to decrypt the corrupt block on disk.

I couldn't use zdb to learn more about the block because the filesystem is encrypted, so I patched ZFS to print the block pointer to the debug log on checksum failure. That got me the DVA (so I could locate the encrypted block on disk) and the encryption parameters (IV and salt). I tried to use openssl to decrypt the block, but couldn't get it to work. Ultimately, what worked was patching ZFS again to encrypt a block of zeroes using the corrupted block's parameters. Then, XOR'ing the encrypted zeroes with the corrupted block on disk revealed the corrupted, compressed block.

It was obvious that the decryption worked because the block was padded with zeroes at the end. (Plus, it turns out LZ4 looks different than random data, and you can develop an intuition for it by staring at LZ4 long enough!) Using the lz4utility, I tried compressing the recovered block and comparing the result with the corrupted block. That got me close, but I couldn't get an exact match. It was enough to confirm that the first 4KB was corrupted of the 12KB compressed block. I used the ZFS function for LZ4 to compress the recovered block, and that got me an exact match with the latter 8KB. Finally, I XOR'd that with the encrypted zeroes, wrote that to the block on one disk in a mirror, and watched as a scrub repaired the block on the other disk.

Takeaways and implications for a repair tool:

Simply zfs repair dataset filename < file would not have worked here. I was able to find the contents of the corrupted block, but I had to manually locate it at a different, unaligned offset in another file. To do that, I needed some information from the corrupted file in order to locate the corrupted block in the good file. Alternately, zfs repair could try scanning a block-sized window over the input byte-by-byte.
The file was large, so I was able to locate a matching block of data by looking at blocks before and after the corrupted block. If the file only had a single block (with some bits or sectors corrupted and others intact), it might be necessary to decrypt the block in order to have the context to locate another copy of the corrupted sector.
I was not able to use command-line tools like openssl and lz4 to work with raw ZFS data. A repair tool should support encryption and compression natively.
Recovering a partially corrupted block that is also compressed is not possible (because decompression would fail). Users (or the repair command itself) may still benefit from partially recovering the compressed data.

GregorKopka · 2023-02-21T05:58:49Z

Simply zfs repair dataset filename < file would not have worked here. I was able to find the contents of the corrupted block, but I had to manually locate it at a different, unaligned offset in another file. To do that, I needed some information from the corrupted file in order to locate the corrupted block in the good file.

The idea is that zfs repair dataset filename < file would read the whole source file, treat it exactly (record size, encryption, compression) as the on-disk data of the file had been and only in case the resulting blocks match the checksum in the block pointers write it to the location(s) that the data on-disk should be - so only correct, healthy data could be written to the places where it should be, repairing any corrupt data being there.

This would completely automate what you did manually, repairing any and all corrupt (according to the checksums in the metadata) on-disk data of that file - while not being able to further destroy data, as delivering the wrong source file would result in differing checksums.

Surely you need a healthy source file that is, in the location(s) affected by on-disk corruption, identical to the data that was written back then - but how you construct that file is up to you, a sparse one where you dd the needed data to the right offset(s) would do the trick. To ease this, especially for large files, some zfs/zdb function to report the broken block(s) (offset and length) of a named file could certainly be helpful.

Majiir · 2023-02-21T17:23:54Z

Yep, such a tool would have made my repair much easier. And in my specific scenario, I was able to find a matching block (same data, aligned to record size) simply by reading the blocks before and after the corrupted block. By all means, we should build the tool as you described!

To clarify, when I said "Simply zfs repair dataset filename < file would not have worked here" I meant that the extra dd step was required to get the good block aligned correctly.

It's possible that recovering smaller files would require more sophisticated tooling (like scanning byte-by-byte or decrypting partially corrupted blocks), but we can solve that problem better when we have a real example.

rincebrain · 2023-02-21T19:16:58Z

If it's "just" an L0 data block, you could probably reach out and invert the nopwrite logic, and then hook however corrective recv overwrites a thing in place.

almereyda · 2023-02-21T22:40:46Z

For encryption-related maintenance tools, I would like to remind of a use case for primarily extracting(/injecting?) IVs and other encryption-related metadata in #12649
I am wondering if it would make sense to consider that in this round of implementation activity, too?

GregorKopka closed this as completed Sep 17, 2018

GregorKopka reopened this Sep 17, 2018

behlendorf added the Type: Feature Feature request or new feature label Sep 17, 2018

GregorKopka mentioned this issue Sep 18, 2018

Man pages: zpool/zfs get parameters are optional #7914

Closed

bunder2015 added the Missing Template label Feb 14, 2019

zenaan mentioned this issue Sep 11, 2019

add mdbzfs (explore and undelete files from offline pool) - needed feature for brown paper bag "rm" moments #9313

Closed

behlendorf removed the Missing Template label Aug 29, 2020

grahamperrin mentioned this issue Jan 6, 2021

Need a First Aid utility helloSystem/hello#90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Ability to repair defective on-disk data #7912

Feature: Ability to repair defective on-disk data #7912

GregorKopka commented Sep 15, 2018

Lady-Galadriel commented Sep 16, 2018

GregorKopka commented Sep 16, 2018

richardelling commented Sep 16, 2018

rincebrain commented Sep 16, 2018

richardelling commented Sep 16, 2018

GregorKopka commented Sep 17, 2018

behlendorf commented Sep 17, 2018

GregorKopka commented Sep 17, 2018

richardelling commented Sep 18, 2018

rlaager commented Sep 18, 2018

richardelling commented Sep 18, 2018

behlendorf commented Sep 18, 2018

behlendorf commented Sep 18, 2018 •

edited

Loading

GregorKopka commented Sep 18, 2018 •

edited

Loading

aletus commented Oct 11, 2018

richardelling commented Oct 11, 2018

aletus commented Oct 12, 2018

rlaager commented Oct 12, 2018

aletus commented Oct 12, 2018

richardelling commented Oct 13, 2018

rincebrain commented Oct 13, 2018

aletus commented Oct 13, 2018

rincebrain commented Oct 13, 2018

aletus commented Oct 13, 2018 •

edited

Loading

GregorKopka commented Oct 13, 2018

zenaan commented Sep 11, 2019

t-oster commented Jan 25, 2021

alek-p commented Dec 2, 2021

GregorKopka commented Aug 9, 2022

Majiir commented Feb 19, 2023

GregorKopka commented Feb 21, 2023

Majiir commented Feb 21, 2023 •

edited

Loading

rincebrain commented Feb 21, 2023

almereyda commented Feb 21, 2023

Feature: Ability to repair defective on-disk data #7912

Feature: Ability to repair defective on-disk data #7912

Comments

GregorKopka commented Sep 15, 2018

Lady-Galadriel commented Sep 16, 2018

GregorKopka commented Sep 16, 2018

richardelling commented Sep 16, 2018

rincebrain commented Sep 16, 2018

richardelling commented Sep 16, 2018

GregorKopka commented Sep 17, 2018

behlendorf commented Sep 17, 2018

GregorKopka commented Sep 17, 2018

richardelling commented Sep 18, 2018

rlaager commented Sep 18, 2018

richardelling commented Sep 18, 2018

behlendorf commented Sep 18, 2018

behlendorf commented Sep 18, 2018 • edited Loading

GregorKopka commented Sep 18, 2018 • edited Loading

aletus commented Oct 11, 2018

richardelling commented Oct 11, 2018

aletus commented Oct 12, 2018

rlaager commented Oct 12, 2018

aletus commented Oct 12, 2018

richardelling commented Oct 13, 2018

rincebrain commented Oct 13, 2018

aletus commented Oct 13, 2018

rincebrain commented Oct 13, 2018

aletus commented Oct 13, 2018 • edited Loading

GregorKopka commented Oct 13, 2018

zenaan commented Sep 11, 2019

t-oster commented Jan 25, 2021

alek-p commented Dec 2, 2021

GregorKopka commented Aug 9, 2022

Majiir commented Feb 19, 2023

GregorKopka commented Feb 21, 2023

Majiir commented Feb 21, 2023 • edited Loading

rincebrain commented Feb 21, 2023

almereyda commented Feb 21, 2023

behlendorf commented Sep 18, 2018 •

edited

Loading

GregorKopka commented Sep 18, 2018 •

edited

Loading

aletus commented Oct 13, 2018 •

edited

Loading

Majiir commented Feb 21, 2023 •

edited

Loading