-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad block pointer and poor user experience for recovering #12245
Comments
@tcaputi if your pool is still damaged I'd suggest applying the patch from #12054. While the original motivation behind that change was to resolve a deadlock when removing L2ARC devices, the fix happens to move some block pointer verification code in to Oh course this doesn't address any of your points about this not being particularly easy to resolve, not does it explain how the block pointer was damaged in the first place. But it's a start. |
Thanks Brian. I'm gunna do some more debugging tonight and see what I can figure out. Is that patch in the latest release? |
it just got merged to master, so it's not in 2.0.4 , just suggested it for inclusion in 2.0.5: |
Got it. I'll cherrypick it manually tonight and see if it helps. Thanks |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
This particular error path was addressed, as were a few others, but there's no doubt there are others we'll need to improve as they're discovered. We can open new issues as needed, closing. |
System information
Describe the problem you're observing
My personal server has a bad block pointer. The server hosts several personal projects of mine including backup services for devices on my network, a minecraft server, a number of websites, and a service that downloads and categorizes tweets in a mysql database for research projects. The twitter service is the most intensive workload on the server.
The code is hitting zfs_panic_recover() so it seems that we somehow wrote out a correct checksum for this bad block pointer, which is very concerning. See the logs for more details.
In addition to the inconvenience of corrupted data, the experience of attempting to recover from this situation was not great:
zfs_recover = 1
to attempt to resolve the issue and its not clear to the user what this will actually do.Describe how to reproduce the problem
Unfortunately I don't know what caused this, but it does look like a code problem. We did see this at Datto several times in production and never 100% got to the bottom of it. It was simply too infrequent to debug and hard to identify when the problem was actually introduced.
Include any warning/errors/backtraces from the system logs
Backtraces from dmesg (note the WARNING messages):
The text was updated successfully, but these errors were encountered: