Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure when importing mirrored pool #604

Closed
ryao opened this issue Mar 15, 2012 · 14 comments
Closed

Assertion failure when importing mirrored pool #604

ryao opened this issue Mar 15, 2012 · 14 comments
Milestone

Comments

@ryao
Copy link
Contributor

ryao commented Mar 15, 2012

I have two pools across 6 drives. One is a mirrored pool called bpool and takes 64MB of each drive. The other is a raidz2 pool called rpool and takes the remainder. Both are 1MB aligned at both their start and their end. I had just booted the system after repartitioning (and restoring a snapshot) to align rpool's end at a 1MB boundary when I encountered an assertion failure:

[ 276.990683] SPLError: 14061:0:(zio.c:2893:zio_done()) ASSERTION(zio->io_vd == NULL && zio->io_bp != NULL) failed
[ 276.990726] SPLError: 14061:0:(zio.c:2893:zio_done()) SPL PANIC
[ 276.990751] SPL: Showing stack for process 14061
[ 276.990758] Pid: 14061, comm: z_rd_int/3 Tainted: P O 3.2.8 #12
[ 276.990763] Call Trace:
[ 276.990786] [] ? spl_debug_dumpstack+0x1d/0x40 [spl]
[ 276.990798] [] ? spl_debug_bug+0x73/0x100 [spl]
[ 276.990826] [] ? zio_worst_error+0x14f6/0x1ff0 [zfs]
[ 276.990849] [] ? zio_buf_free+0x220/0x11d0 [zfs]
[ 276.990869] [] ? zil_vdev_offline+0x2f6/0x6a0 [zfs]
[ 276.990899] [] ? vdev_queue_io_done+0x79/0x69e0 [zfs]
[ 276.990919] [] ? zio_execute+0xd7/0x8f0 [zfs]
[ 276.990932] [] ? __taskq_destroy+0xbf9/0x11d0 [spl]
[ 276.990941] [] ? __schedule+0x29e/0x740
[ 276.990950] [] ? try_to_wake_up+0x270/0x270
[ 276.990961] [] ? __taskq_destroy+0x980/0x11d0 [spl]
[ 276.990972] [] ? __taskq_destroy+0x980/0x11d0 [spl]
[ 276.990980] [] ? kthread+0x96/0xa0
[ 276.990989] [] ? kernel_thread_helper+0x4/0x10
[ 276.990997] [] ? kthread_worker_fn+0x180/0x180
[ 276.991033] [] ? gs_change+0xb/0xb
[ 276.991125] SPL: Dumping log to /tmp/spl-log.1331804541.14061

The contents of /tmp/spl-log.1331804541.14061 are available at the following address:

http://www.cs.stonybrook.edu/~ryao/spl-log.1331804541.14061

The exact sequence of events were as followed:

  1. Restore backup of rpool from backup pool (external drive) after repartitioning and recreating rpool
  2. Boot the system
  3. Encounter failure due to rpool being faulted.
  4. Reboot to rescue mode of initramfs
  5. Export rpool
  6. Notice that backup pool (no longer connected) is listed, so export it.
  7. Import rpool
  8. Resume boot sequence.
  9. Log into system
  10. Recompile initramfs and while it is running, export bpool
  11. Import bpool, which hangs.
  12. Run zfs command
  13. Run dmesg to see assertion failure

I had exported bpool at step 9 because I wanted to import it to force an update of /etc/zfs/zpool.cache for the initramfs. I have /boot symlinked to /bpool/boot, but nothing should have touched it when I had done the export.

@ryao
Copy link
Contributor Author

ryao commented Mar 15, 2012

I should probably add that I used the Ubuntu LiveCD with the daily PPA to do repartitioning. I had a failure occur when trying to import rpool in order to send a snapshot to the backup pool before I had repartitioned the disk. The dmesg output was at the following url:

http://paste.pocoo.org/show/565855/

After the import hung, I tried running "zfs" and "zpool" once or twice, which locked each terminal. :/

@behlendorf
Copy link
Contributor

Interestingly I saw this for the first time yesterday as well, I filed it as issue #602.

@ryao
Copy link
Contributor Author

ryao commented Mar 15, 2012

Do you want to keep this one open until we can verify a fix for this resolves things on my system?

With that said, I just hit this again:

[ 177.536568] SPLError: 6297:0:(zio.c:2893:zio_done()) ASSERTION(zio->io_vd == NULL && zio->io_bp != NULL) failed
[ 177.536595] SPLError: 6297:0:(zio.c:2893:zio_done()) SPL PANIC
[ 177.536614] SPL: Showing stack for process 6297
[ 177.536616] Pid: 6297, comm: z_rd_int/1 Tainted: P O 3.2.9 #2
[ 177.536618] Call Trace:
[ 177.536628] [] ? spl_debug_dumpstack+0x1d/0x40 [spl]
[ 177.536631] [] ? spl_debug_bug+0x73/0x100 [spl]
[ 177.536642] [] ? zio_worst_error+0x14f6/0x1ff0 [zfs]
[ 177.536647] [] ? zio_buf_free+0x220/0x11d0 [zfs]
[ 177.536652] [] ? zil_vdev_offline+0x2f6/0x6a0 [zfs]
[ 177.536660] [] ? vdev_queue_io_done+0x79/0x69e0 [zfs]
[ 177.536665] [] ? zio_execute+0xd7/0x8f0 [zfs]
[ 177.536668] [] ? __taskq_destroy+0xbf9/0x11d0 [spl]
[ 177.536671] [] ? __schedule+0x29e/0x740
[ 177.536674] [] ? try_to_wake_up+0x270/0x270
[ 177.536677] [] ? __taskq_destroy+0x980/0x11d0 [spl]
[ 177.536680] [] ? __taskq_destroy+0x980/0x11d0 [spl]
[ 177.536682] [] ? kthread+0x96/0xa0
[ 177.536685] [] ? kernel_thread_helper+0x4/0x10
[ 177.536687] [] ? kthread_worker_fn+0x180/0x180
[ 177.536688] [] ? gs_change+0xb/0xb
[ 177.536757] SPL: Dumping log to /tmp/spl-log.1331849987.6297
[ 177.537705] SPLError: 6300:0:(zio.c:2893:zio_done()) ASSERTION(zio->io_vd == NULL && zio->io_bp != NULL) failed
[ 177.537746] SPLError: 6300:0:(zio.c:2893:zio_done()) SPL PANIC
[ 177.537771] SPL: Showing stack for process 6300
[ 177.537778] Pid: 6300, comm: z_rd_int/4 Tainted: P O 3.2.9 #2
[ 177.537783] Call Trace:
[ 177.537803] [] ? spl_debug_dumpstack+0x1d/0x40 [spl]
[ 177.537814] [] ? spl_debug_bug+0x73/0x100 [spl]
[ 177.537838] [] ? zio_worst_error+0x14f6/0x1ff0 [zfs]
[ 177.537860] [] ? zio_buf_free+0x220/0x11d0 [zfs]
[ 177.537881] [] ? zil_vdev_offline+0x2f6/0x6a0 [zfs]
[ 177.537910] [] ? vdev_queue_io_done+0x79/0x69e0 [zfs]
[ 177.537931] [] ? zio_execute+0xd7/0x8f0 [zfs]
[ 177.537943] [] ? __taskq_destroy+0xbf9/0x11d0 [spl]
[ 177.537951] [] ? __schedule+0x29e/0x740
[ 177.537959] [] ? try_to_wake_up+0x270/0x270
[ 177.537970] [] ? __taskq_destroy+0x980/0x11d0 [spl]
[ 177.537981] [] ? __taskq_destroy+0x980/0x11d0 [spl]
[ 177.537988] [] ? kthread+0x96/0xa0
[ 177.537997] [] ? kernel_thread_helper+0x4/0x10
[ 177.538026] [] ? kthread_worker_fn+0x180/0x180
[ 177.538034] [] ? gs_change+0xb/0xb
[ 177.538152] SPL: Dumping log to /tmp/spl-log.1331849987.6300

I had booted using a dracut-built initramfs when trying to debug the cause of a deadlock in a genkernel-built initramfs. bpool had not been mounted by dracut, so I ran zpool import bpool, which froze. I then checked dmesg and saw a double assertion failure.

The logs are available from the following urls:

http://www.cs.stonybrook.edu/~ryao/spl-log.1331849987.6297
http://www.cs.stonybrook.edu/~ryao/spl-log.1331849987.6300

@behlendorf
Copy link
Contributor

Sure, I don't mind leaving both open. Although it might be a while before I get around to looking at this. I have a lot on my plate.

@ryao
Copy link
Contributor Author

ryao commented Mar 15, 2012

@behlendorf My guess is that there is a race condition that triggers this when imports are fast. Before commit 613d88e, imports were slow enough that we avoided the unsafe region of the race condition, but after it was committed imports into new pools were made fast enough that we enter the unsafe region with much greater certainty.

I realize this is a bit of hand-waving, but my familiarity with the codebase is not good enough to go into more detail.

@behlendorf
Copy link
Contributor

You may be right about blaming 613d88e but I suspect different reasons. One accidental side-effect of this change we're just noticing is that if you create a new zpool with the updated alignment over and existing zpool with the old alignment. Then one of the old vdev labels will not get over written at appear at the beginning of partition 9. That may cause issues during import if since it will now detect the new pool in partition one and the old pool in partition 9.

@ryao
Copy link
Contributor Author

ryao commented Mar 16, 2012

This probably goes without saying, but a workaround for this issue is to compile without --enable-debug. I was previously compiling both SPL and ZFS with --enable-debug, but I discovered that Ubuntu's daily PPA was unaffected, found the difference to be --enable-debug and worked around the issue by disabling it.

So far, there has been no ill effect by disabling the assertions, although the cause of the assertion failure needs more attention. Time permitting, I will try to investigate it, although I make no promises.

@behlendorf
Copy link
Contributor

Yes, by default all the assertions are disabled since they do have a negative impact performance. I just tend to always leave them enabled when working. I doubt there will be any ill effects from this particular error, but your right it still needs to be explained and fixed.

@behlendorf
Copy link
Contributor

@gentoofan If you still have a pool which consistently hits this issue could you try reverting commit 2c6d0b1 and see if resolves the problem. I suspect this patch may have had unexpected consequences.

@behlendorf
Copy link
Contributor

@gentoofan Or better yet, if your still have a test case, can you try the above fix, behlendorf/zfs@1813cc0 , which I'm virtually certain will resolve the issue which was introduced by 2c6d0b1 .

@ryao
Copy link
Contributor Author

ryao commented Mar 21, 2012

I tried behlendorf/zfs@1813cc0, which works with --disable-debug, but if I compile with --enable-debug, the system will hang when attempting to load the zfs module. This is during the early boot process. I am booting off a ZFS rootfs.

I will try reverting 2c6d0b1 tomorrow.

@ryao
Copy link
Contributor Author

ryao commented Mar 21, 2012

I tried reverting 2c6d0b1 and the result is the same as when I had applied behlendorf/zfs@1813cc0. I am not sure what is causing this and unfortunately, I have no way of finding out.

@behlendorf
Copy link
Contributor

With --enable-debug and the patch applied the system just hangs for the test case pool. That's odd, presumably this just occurs for the test case pool?

behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 21, 2012
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#602
Issue openzfs#604
@behlendorf
Copy link
Contributor

Closing issue, I'm quite sure this fixes the described assertion. If you can reproduce a new failure mode lets open a new issue to track it.

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
…aster

Merge remote-tracking branch '6.0/stage' into 'master'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants