zpool core dump #536

behlendorf · 2012-01-17T18:08:02Z

zpool create grove329 /dev/mapper/grove329_1 /dev/mapper/grove329_2 /dev/mapper/grove329_3

After the node crashed and was power cycled:

zpool status showed no pools available.

zpool import showed the two pools it could see (grove329 and grove330) but they were showing up with devices like sde, sdh, instead of the multipath devices and the pools were listed as unavailable.

A zpool import -d /dev/mapper showed the proper pools and all the devices show up as online.

When I try a zpool import -d /dev/mapper grove329 the zpool command core dumps. I tried a couple time with and without the -f flag and it core dumped both times. Also after this initial core dump, any subsequent zpool commands also dump core.

prakashsurya · 2012-01-17T19:21:43Z

I'm getting weird behavior with the zpool and zfs commands, each invocation simply reports back by printing "Aborted"

grove329@surya1:zpool status
Aborted
grove329@surya1:zpool
Aborted
grove329@surya1:zpool import -d /dev/mapper 
Aborted
grove329@surya1:zfs status
Aborted
grove329@surya1:zfs
Aborted

behlendorf · 2012-01-17T19:23:10Z

Try running it under gdb and see what's happening. I occasionally see this sort of behavior in my automated test VMs too but I've never been able to reproduce it when I needed to and run it to ground.

prakashsurya · 2012-01-17T19:25:40Z

grove329@surya1:gdb zpool
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /sbin/zpool...(no debugging symbols found)...done.
(gdb) run status
Starting program: /sbin/zpool status
[Thread debugging using libthread_db enabled]
Detaching after fork from child process 6426.

Program received signal SIGABRT, Aborted.
0x00002aaaacd56885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install zfs-0.6.0-rc6_1chaos.ch5.x86_64
(gdb) bt
#0  0x00002aaaacd56885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00002aaaacd58065 in abort () at abort.c:92
#2  0x00002aaaabc21192 in make_dataset_handle_common (zhp=0x6223e0, zc=) at ../../lib/libzfs/libzfs_dataset.c:426
#3  0x00002aaaabc211ef in make_dataset_handle_zc (hdl=0x61b060, zc=0x7fffffff7bc0) at ../../lib/libzfs/libzfs_dataset.c:473
#4  0x00002aaaabc2169b in zfs_iter_filesystems (zhp=0x61cdb0, func=0x2aaaac2e2270 , data=0x7fffffffe220)
    at ../../lib/libzfs/libzfs_dataset.c:2500
#5  0x00002aaaac2e231a in update_zfs_shares_cb (zhp=0x61cdb0, pcookie=0x7fffffffe220) at ../../lib/libshare/libshare.c:242
#6  0x00002aaaabc1ed31 in zfs_iter_root (hdl=0x61b060, func=0x2aaaac2e2270 , data=0x7fffffffe220)
    at ../../lib/libzfs/libzfs_config.c:365
#7  0x00002aaaac2e29d7 in update_zfs_shares (init_service=) at ../../lib/libshare/libshare.c:326
#8  sa_init (init_service=) at ../../lib/libshare/libshare.c:97
#9  0x00002aaaac2e29f0 in libshare_init () at ../../lib/libshare/libshare.c:113
#10 0x00002aaaac2e3a16 in __do_global_ctors_aux () from //lib64/libshare.so.1
#11 0x00002aaaac2e12f3 in _init () from //lib64/libshare.so.1
#12 0x00002aaaac8f1990 in ?? ()
#13 0x00002aaaaaab94a5 in call_init (main_map=0x2aaaaaccc188, argc=-1404156232, argv=0x7fffffffe2e8, env=0x7fffffffe300) at dl-init.c:70
#14 _dl_init (main_map=0x2aaaaaccc188, argc=-1404156232, argv=0x7fffffffe2e8, env=0x7fffffffe300) at dl-init.c:134
#15 0x00002aaaaaaabb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#16 0x0000000000000002 in ?? ()
#17 0x00007fffffffe5a9 in ?? ()
#18 0x00007fffffffe5b5 in ?? ()
#19 0x0000000000000000 in ?? ()

behlendorf · 2012-01-17T20:29:57Z

It looks like the zfs_dmustats.dds_type just isn't set yet. See the second abort in make_dataset_handle_common():426.

prakashsurya · 2012-01-17T21:08:10Z

Here's the piece of code triggering the abort():


        if (zhp->zfs_dmustats.dds_is_snapshot)
                zhp->zfs_type = ZFS_TYPE_SNAPSHOT;
        else if (zhp->zfs_dmustats.dds_type == DMU_OST_ZVOL)
                zhp->zfs_type = ZFS_TYPE_VOLUME;
        else if (zhp->zfs_dmustats.dds_type == DMU_OST_ZFS)
                zhp->zfs_type = ZFS_TYPE_FILESYSTEM;
        else
                abort();        /* we should never see any other types */

prakashsurya · 2012-01-17T21:16:16Z

This seems odd. Just above the previous code snippet, it executes this:

        /*
         * We've managed to open the dataset and gather statistics.  Determine
         * the high-level type.
         */
        if (zhp->zfs_dmustats.dds_type == DMU_OST_ZVOL)
                zhp->zfs_head_type = ZFS_TYPE_VOLUME;
        else if (zhp->zfs_dmustats.dds_type == DMU_OST_ZFS)
                zhp->zfs_head_type = ZFS_TYPE_FILESYSTEM;
        else
                abort();

Since it is making essentially the same check on zhp->zfs_dmustats.dds_type here, which doesn't result in abort being called, I would think it should not abort on the following checks. As far as I can tell, the same conditional is true here, but false later.. Unless there's another thread making changes to zhp I don't follow how this would happen.

prakashsurya · 2012-01-17T21:25:55Z

Just saw your comment Brian. I guess I should have refreshed this page sooner. If it's simply that zfs_dmustats.dds_type isn't set, shouldn't it hit the first abort on line 417, rather than the second on 426?

behlendorf · 2012-01-17T21:39:32Z

Indeed, it should... clearly something more interesting is going on. I hadn't noticed the same check above. It's possible the zhp is being updated by multiple threads I'd need to check the whole callpath to see.

prakashsurya · 2012-01-17T21:47:11Z

Although, since it's failing at the same place every time I run zpool or zfs, a thread race condition seems unlikely.

behlendorf · 2012-01-17T21:58:56Z

I'd add some debug code to see what's going on.

prakashsurya · 2012-01-17T22:51:04Z

OK. If you point me to a branch whenever you get around to adding the debug code, I'll get it installed on grove.

behlendorf · 2012-01-17T22:54:06Z

Actually, I was hoping you would add some debug code. :) It might be a little while before a get a chance to look at this.

prakashsurya · 2012-01-17T23:14:50Z

Oops. I read that as "I'll", not "I'd". Yeah that works too, I'll dig into it some more.

prakashsurya · 2012-01-18T22:59:45Z

OK. This confirms our suspicions. The abort is a result of remnants from a crashed zpios test leaving it's DMU_OST_OTHER pool around, and the tools not correctly handling this zfs_dmustats.dds_type type.

It can be easily reproduced by running the zpios-sanity test, and concurrently running the zpool command. As long as zpool is run while the zpios pool is in place (i.e. before the test finishes and cleans up after itself), the abort will trigger.

behlendorf · 2012-01-18T23:19:14Z

It looks to me like it would be safe (and correct) to return -1 in the DMU_OST_OTHER case.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

prakashsurya mentioned this issue Jan 18, 2012

Ignore non DMU_OST_ZFS and DMU_OST_ZVOL types #538

Closed

behlendorf closed this as completed in ff998d8 Jan 19, 2012

pcd1193182 added a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023

Build fails with rust devtools not installed (openzfs#536)

00d402d

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zpool core dump #536

zpool core dump #536

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

prakashsurya commented Jan 18, 2012

behlendorf commented Jan 18, 2012

zpool core dump #536

zpool core dump #536

Comments

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

behlendorf commented Jan 17, 2012

prakashsurya commented Jan 17, 2012

prakashsurya commented Jan 18, 2012

behlendorf commented Jan 18, 2012