Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool core dump #536

Closed
behlendorf opened this issue Jan 17, 2012 · 15 comments
Closed

zpool core dump #536

behlendorf opened this issue Jan 17, 2012 · 15 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

zpool create grove329 /dev/mapper/grove329_1 /dev/mapper/grove329_2 /dev/mapper/grove329_3

After the node crashed and was power cycled:

zpool status showed no pools available.

zpool import showed the two pools it could see (grove329 and grove330) but they were showing up with devices like sde, sdh, instead of the multipath devices and the pools were listed as unavailable.

A zpool import -d /dev/mapper showed the proper pools and all the devices show up as online.

When I try a zpool import -d /dev/mapper grove329 the zpool command core dumps. I tried a couple time with and without the -f flag and it core dumped both times. Also after this initial core dump, any subsequent zpool commands also dump core.

@prakashsurya
Copy link
Member

I'm getting weird behavior with the zpool and zfs commands, each invocation simply reports back by printing "Aborted"

grove329@surya1:zpool status
Aborted
grove329@surya1:zpool
Aborted
grove329@surya1:zpool import -d /dev/mapper 
Aborted
grove329@surya1:zfs status
Aborted
grove329@surya1:zfs
Aborted

@behlendorf
Copy link
Contributor Author

Try running it under gdb and see what's happening. I occasionally see this sort of behavior in my automated test VMs too but I've never been able to reproduce it when I needed to and run it to ground.

@prakashsurya
Copy link
Member

grove329@surya1:gdb zpool
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /sbin/zpool...(no debugging symbols found)...done.
(gdb) run status
Starting program: /sbin/zpool status
[Thread debugging using libthread_db enabled]
Detaching after fork from child process 6426.

Program received signal SIGABRT, Aborted.
0x00002aaaacd56885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install zfs-0.6.0-rc6_1chaos.ch5.x86_64
(gdb) bt
#0  0x00002aaaacd56885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00002aaaacd58065 in abort () at abort.c:92
#2  0x00002aaaabc21192 in make_dataset_handle_common (zhp=0x6223e0, zc=) at ../../lib/libzfs/libzfs_dataset.c:426
#3  0x00002aaaabc211ef in make_dataset_handle_zc (hdl=0x61b060, zc=0x7fffffff7bc0) at ../../lib/libzfs/libzfs_dataset.c:473
#4  0x00002aaaabc2169b in zfs_iter_filesystems (zhp=0x61cdb0, func=0x2aaaac2e2270 , data=0x7fffffffe220)
    at ../../lib/libzfs/libzfs_dataset.c:2500
#5  0x00002aaaac2e231a in update_zfs_shares_cb (zhp=0x61cdb0, pcookie=0x7fffffffe220) at ../../lib/libshare/libshare.c:242
#6  0x00002aaaabc1ed31 in zfs_iter_root (hdl=0x61b060, func=0x2aaaac2e2270 , data=0x7fffffffe220)
    at ../../lib/libzfs/libzfs_config.c:365
#7  0x00002aaaac2e29d7 in update_zfs_shares (init_service=) at ../../lib/libshare/libshare.c:326
#8  sa_init (init_service=) at ../../lib/libshare/libshare.c:97
#9  0x00002aaaac2e29f0 in libshare_init () at ../../lib/libshare/libshare.c:113
#10 0x00002aaaac2e3a16 in __do_global_ctors_aux () from //lib64/libshare.so.1
#11 0x00002aaaac2e12f3 in _init () from //lib64/libshare.so.1
#12 0x00002aaaac8f1990 in ?? ()
#13 0x00002aaaaaab94a5 in call_init (main_map=0x2aaaaaccc188, argc=-1404156232, argv=0x7fffffffe2e8, env=0x7fffffffe300) at dl-init.c:70
#14 _dl_init (main_map=0x2aaaaaccc188, argc=-1404156232, argv=0x7fffffffe2e8, env=0x7fffffffe300) at dl-init.c:134
#15 0x00002aaaaaaabb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#16 0x0000000000000002 in ?? ()
#17 0x00007fffffffe5a9 in ?? ()
#18 0x00007fffffffe5b5 in ?? ()
#19 0x0000000000000000 in ?? ()

@behlendorf
Copy link
Contributor Author

It looks like the zfs_dmustats.dds_type just isn't set yet. See the second abort in make_dataset_handle_common():426.

@prakashsurya
Copy link
Member

Here's the piece of code triggering the abort():


        if (zhp->zfs_dmustats.dds_is_snapshot)
                zhp->zfs_type = ZFS_TYPE_SNAPSHOT;
        else if (zhp->zfs_dmustats.dds_type == DMU_OST_ZVOL)
                zhp->zfs_type = ZFS_TYPE_VOLUME;
        else if (zhp->zfs_dmustats.dds_type == DMU_OST_ZFS)
                zhp->zfs_type = ZFS_TYPE_FILESYSTEM;
        else
                abort();        /* we should never see any other types */

@prakashsurya
Copy link
Member

This seems odd. Just above the previous code snippet, it executes this:

        /*
         * We've managed to open the dataset and gather statistics.  Determine
         * the high-level type.
         */
        if (zhp->zfs_dmustats.dds_type == DMU_OST_ZVOL)
                zhp->zfs_head_type = ZFS_TYPE_VOLUME;
        else if (zhp->zfs_dmustats.dds_type == DMU_OST_ZFS)
                zhp->zfs_head_type = ZFS_TYPE_FILESYSTEM;
        else
                abort();

Since it is making essentially the same check on zhp->zfs_dmustats.dds_type here, which doesn't result in abort being called, I would think it should not abort on the following checks. As far as I can tell, the same conditional is true here, but false later.. Unless there's another thread making changes to zhp I don't follow how this would happen.

@prakashsurya
Copy link
Member

Just saw your comment Brian. I guess I should have refreshed this page sooner. If it's simply that zfs_dmustats.dds_type isn't set, shouldn't it hit the first abort on line 417, rather than the second on 426?

@behlendorf
Copy link
Contributor Author

Indeed, it should... clearly something more interesting is going on. I hadn't noticed the same check above. It's possible the zhp is being updated by multiple threads I'd need to check the whole callpath to see.

@prakashsurya
Copy link
Member

Although, since it's failing at the same place every time I run zpool or zfs, a thread race condition seems unlikely.

@behlendorf
Copy link
Contributor Author

I'd add some debug code to see what's going on.

@prakashsurya
Copy link
Member

OK. If you point me to a branch whenever you get around to adding the debug code, I'll get it installed on grove.

@behlendorf
Copy link
Contributor Author

Actually, I was hoping you would add some debug code. :) It might be a little while before a get a chance to look at this.

@prakashsurya
Copy link
Member

Oops. I read that as "I'll", not "I'd". Yeah that works too, I'll dig into it some more.

@prakashsurya
Copy link
Member

OK. This confirms our suspicions. The abort is a result of remnants from a crashed zpios test leaving it's DMU_OST_OTHER pool around, and the tools not correctly handling this zfs_dmustats.dds_type type.

It can be easily reproduced by running the zpios-sanity test, and concurrently running the zpool command. As long as zpool is run while the zpios pool is in place (i.e. before the test finishes and cleans up after itself), the abort will trigger.

@behlendorf
Copy link
Contributor Author

It looks to me like it would be safe (and correct) to return -1 in the DMU_OST_OTHER case.

pcd1193182 added a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants