Implement zeventd daemon #2

behlendorf · 2010-05-14T21:22:43Z

As of zfs-0.4.9 the infrastructure is in place to monitor events generated by zfs from user space. Currently the only consumer of this ABI is the 'zpool events' command which can be used to print the events which are just nvlists. What we need now is a zeventd daemon which uses this interface to monitor zevents. Initially this daemon can simply forward the events cleanly off to syslog but longer term we want to be able to have specific events trigger actions. For example, in the case of a disk failure event you may want to send an email or perhaps kick off a suite of diagnostics against the newly failed drive.

behlendorf · 2010-05-14T21:53:58Z

Additionally one case we need to make sure to handle here is to detect when a disk is going bad but has not yet failed. Often an educated guess about this can be made based on non-success sense codes returned by the drive during IO. The Linux port promotes non-success sense code to zevents for consumption by the zeventd. These events can be used to make a determination that the drive should be proactively failed and a hot space kicked in.

behlendorf · 2011-05-09T18:39:28Z

The zeventd daemon also looks like the correct way to automatically handle autoexpand. See my first comment in issue #120 for how this should be done.

colbygk · 2011-05-17T04:20:36Z

I'd like to work on this.

behlendorf · 2011-05-17T21:25:47Z

Wonderful, we'd love the help. This is a daemon we know we're going to need but just haven't found the time to write yet. However, we have given it some thought so let me sketch out for you what we're thinking and we can start a discussion plus implementation from there.

At a high level here's how the zeventd needs to work. The daemon will most likely be started by the zfs init.d script at system boot time. It will need some minimal configuration from a /etc/zfs/zeventd.conf file. It would be reasonable to simply have the init script source this file and pass any needed configuration in as command line options. Alternately, a parser could be added to zeventd to read in the config file.

Once started the zeventd needs to consume events posted from the kernel via the /dev/zfs ioctl interface. These events are nvlists which are packed in the kernel then passed via ioctl() to user space when they are unpacked. The zpool events command show exactly how this can be done. Depending on the configuration options the event should then be sent to syslog.

Next, we want to add the ability for zeventd to exec an arbitrary helper application. Typically, this will just be a script which allows us to easily customize the behavior. The zeventd should pass the full event to stdin of the helper in an easily parsed form, for example the verbose zpool events -v form. We can then provide a default helper script which parses this and handles the common cases.

Disk failure - The kernel will automatically fail a disk but the helper may need to kick in a hot spare, start diagnostics on the failed drive, email someone the drive has failed, etc.
Autoreplace - When a drive is replaced the autoreplace property should be checked and zfs replace run accordingly.
Autoexpand - When a vdev is expanded online the autoexpand property should be checked and zpool online -e run accordingly.

For all the above we should try to be consistent with the Solaris FMA behavior as much as possible. If people want different behavior they can of course tweak the script to their liking. That is exactly why we're making the policy bit a script and not placing it in the heart of zeventd. We can also provide some helper functions in the script to make customization easy and list all the possible events and what they mean.

The easiest way to implement the core of the zeventd will be as a user space daemon written in C. This way the daemon can easily leverage certain existing libraries in the zfs package to simplify the kernel ioctl, the nvlist manipulation, and event formatting. There are already helper functions to do all these things so it should be relatively straight forward.

colbygk · 2011-05-18T06:19:19Z

Sounds good to me. I will set up a stubbed out server to be in zfs/cmds/zeventd and will work towards having a parser for a config file that won't do much at first. Following on through reading events from /dev/zfs etc...

I'm only cursorily familiar with the behavior of Solaris FMA.

I have a dev environment set up under openSUSE 11.3 now, built both zfs/spl and have created a zpool that mounted. Will follow up with more later. Any suggested resources to look at to learn more about Solaris FMA?

behlendorf · 2011-05-18T17:05:20Z

Sounds like a plan. To be honest I'm not all that familiar with Solaris FMA either, but I don't think we need/want to replicate it entirely. At least initially we just want to make sure we have enough infrastructure in place to handle the common events like disk failure, autoreplace, and autoreplace. We can determine if we need a more complicated analysis engine latter.

colbygk · 2011-05-24T08:41:56Z

Quick questions:

Do zfs events have some sort of computationally unique ID's? Can individual events be cleared vs clearing all events? Consider the following:

zeventd is running, recognizes a disk failure event, runs a script that is configured to launch in the event of such a failure, presumably this script attempts to move a hot spare into place.

Some time later, an admin restarts zeventd using the init.d script. zeventd rereads events and sees the same series that it acted upon earlier. That would be bad! This is moot if doing a zpool_next_event(...) using blocking I/O will only read new events and not old ones (which it looks like is what happens in my initial testing).

Initially, I was thinking that zeventd could keep a log of the events it has acted upon, using some unique id associated with each event (an ID produced by the underlying ZFS kernel module) and not act on any event that it has already logged.

and/or

zeventd could remove events after it has acted on them (I don't like this idea, I'd rather leave the event log intact).

Perhaps zeventd could inject log events as it acts on other events, so that the chain of event->action/resolution is maintained in the zfs events log itself. This would also be a way for it to keep track of events that it has already acted upon.

behlendorf · 2011-05-24T18:22:42Z

There is supposed to be an ENA (Error Numeric Association) nvpair in the nvlist which is close to this. However, I just noticed there is not an ENA for all event types. This field is not added for events generated by zfs_post_common(), we may need to add this. If you look at the header in module/zfs/zfs_fm.c you can read a pretty good description of how ENAs are support to be used under Solaris FMA.

The way events work at the moment is the kernel maintains a list of the last 64 events. These events are then accessed via an ioctl() on /dev/zfs. Each process when it opens /dev/zfs is allowed to read events as a private stream. This allows multiple consumers of /dev/zfs events to operate concurrently without destroying each others events.

However, when the device is closed and opened the process will start getting against from the oldest to the newest. Your absolutely right this could be very bad if the zeventd was restart and reprocessed the same events. Well once we get the ENA added to all event you could start tracking those in a log file. Or you could try something even simpler to start with. All the events are time stamped by the kernel when they were generated. If you simply store the time stamp of the last event processed by zeventd, then you would know which events to skip if the daemon was restarted. Events older than the logged time stamp must already have been processed and can be ignored. There may be a small race here with something like ntp adjusting your clock but that seems like an exceptional rare case.

colbygk · 2011-05-28T04:53:46Z

Ahh yes, I had noticed the ENA and thought it might be this. I like your suggestion of storing a high resolution copy of the time as an initial start and look forward to having ENA on all events.

Rudd-O · 2011-11-29T01:22:34Z

Seems to me like udev or udisks already provide what you are looking for, and you should route events through them (making zeventd, if needed, a consumer of those interfaces).

colbygk · 2011-11-29T01:47:27Z

I have been busy with a job transition but am coming back to this.

Re: udev and udisks.

udev is for devices that the kernel notices are new or have been removed and is not appropriate in the case of passing messages for zeventd
udisks I had not heard of before, but it seems to be a sub-component of free desktop and is a glue layer between disk related events and d-bus listeners.

zeventd would most likely be a path for messages to end up in udisks.

pyavdr · 2012-03-27T06:12:43Z

Hi Brian,
i really like this ZFS port to linux for the last 6 months. In these 6 months there was a lot of bugs beeing fixed and zol becomes more stable then ever. BUT the autoreplace feature for failing disks does not work and this issue is numbered as #2. I would say, this is a very basic feature.
As you guys are really good in solving issues, is there any chance to made this work for the next rc9 ?

behlendorf · 2012-03-27T19:05:08Z

No, I can't see this code being finished by -rc9. But your right it absolutely needs to happen, and sooner would be better than later. We just haven't had enough time with the other outstanding issues to get to this. You could probably cobble together some scripts to handle this job if your desperate and need it right away.

pyavdr · 2012-04-10T09:00:00Z

@behlendorf

Hi Brian,

while composing some C-code for an zeventd, the timeshifts driving me crazy. There are always timeshifts within a system. On startup there are a timeshift if the timezone gets involved, which is a most common point for vdev failures. If we think about traveling around with a zpool, then there are lots. Also NTP timeshifts occur really periodically.
So zpool events shows timestamps in future or in the past, which is not predictable, so not reliable for an eventd.
For me, the whole event correlation leads to parse a log file, filled with events, searching for enas ( asuming they are unique ..), which are not in there, thus making the assumption, that this is a new event, and writing again timpestamps in there, probably leading to the same problems with timeshifts. But doing so, this is not really efficient. Any ideas? (Maybe creating a database with keys for ENA, or trees … ?)
Next questions for the target of triggering a “zpool_vdev_attach(zhp, old_disk, new_disk, nvroot, replacing)” call are: Which event (class, ena, whatever ...) should really trigger the replacement ?
There are the unsolved issues #250 and #544. That makes things complicated too. Is there any solution in sight for them?

pyavdr · 2012-04-18T05:51:13Z

@behlendorf

Hi Brian,

while composing some C-code for an zeventd, the timeshifts driving me crazy. There are always timeshifts within a system. On startup there are a timeshift if the timezone gets involved, which is a most common point for vdev failures. If we think about traveling around with a zpool, then there are lots. Also NTP timeshifts occur really periodically.
So zpool events shows timestamps in future or in the past, which is not predictable, so not reliable for an eventd.
For me, the whole event correlation leads to parse a log file, filled with events, searching for enas ( asuming they are unique ..), which are not in there, thus making the assumption, that this is a new event, and writing again timpestamps in there, probably leading to the same problems with timeshifts. But doing so, this is not really efficient. Any ideas? (Maybe creating a database with keys for ENA, or trees … ?)
Next questions for the target of triggering a “zpool_vdev_attach(zhp, old_disk, new_disk, nvroot, replacing)” call are: Which event (class, ena, whatever ...) should really trigger the replacement ?
There are the unsolved issues #250 and #544. That makes things complicated too. Is there any solution in sight for them?

behlendorf · 2012-04-18T19:05:58Z

@pyavdr Concerning the timeshift for ENAs we could perhaps include the current value of the kernels monotonically increasing clock along with the timestamp. Or event simpler a monotonically increasing number of ENAs which have been generated since the zfs module was loaded (although that might break down if you were to unload/reload the module). Really, the contents of the event are entirely up to us so don't feel to locked in to what's there now.

Offhand I'm not exactly sure which events should be used to trigger a replacement. I'd need to look at it carefully, it wouldn't be the worst idea to look and see what Solaris does in their FMA.

I haven't had a chance to seriously look in to either #250 or #544 yet. However, your right getting them fixed is a bit of a prerequisite for this work.

pyavdr · 2012-04-18T20:00:21Z

Brian,

while searching i found this : http://gdamore.blogspot.com/2010/07/zfs-disk-monitoring.html

Is there any chance you can contact him to ask for the code of his monitor ?

behlendorf · 2012-04-18T22:24:25Z

Garrett's been busy. It sounds like he's willing to share the code with anyone, it's certainly worth taking a look at. Although since it's a FMA module I'm not sure how useful it will be.

pyavdr · 2012-05-06T19:56:40Z

@behlendorf

Brian,

while testing around failed disks the zpool events for some cases are missing.

In my vmware suse guest i created a quite large pool with a zfs with 16 vdevs. Next i remove one vdev with vmware device manager. Copied a file to the pool. After that the pool shows state (zpool status) "degraded" with the missing vdev marked first as "faulted" and then "removed". But there is only a simply "state change" event in the zpool events.
If the events are not giving more information, it would be better to analyse the state of the pool and the vdevs itself, but this seems to be double effort. As you mentioned, the events interface isn´t complete right now. For an zeventd, the events are essential. I would like to wish some more events which clearly reflects the vdev/zpool state. Alternatively just parsing the zpool status output maybe gives a faster approach to the target of replacing a "unavailable or removed" disk automatically . Pls comment. Thank you!

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread openzfs#7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714 openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761 openzfs#9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#9706

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread #7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #7 0x55ffbc769fba in ztest_execute ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread ztest.c:6761 #9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9706

Using zfs with Lustre, an arc_read can trigger kernel memory allocation that in turn leads to a memory reclaim callback and a deadlock within a single zfs process. This change uses spl_fstrans_mark and spl_trans_unmark to prevent the reclaim attempt and the deadlock (https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba). The stack trace observed is: #0 [ffffc9002b98adc8] __schedule at ffffffff81610f2e openzfs#1 [ffffc9002b98ae68] schedule at ffffffff81611558 openzfs#2 [ffffc9002b98ae70] schedule_preempt_disabled at ffffffff8161184a openzfs#3 [ffffc9002b98ae78] __mutex_lock at ffffffff816131e8 openzfs#4 [ffffc9002b98af18] arc_buf_destroy at ffffffffa0bf37d7 [zfs] openzfs#5 [ffffc9002b98af48] dbuf_destroy at ffffffffa0bfa6fe [zfs] openzfs#6 [ffffc9002b98af88] dbuf_evict_one at ffffffffa0bfaa96 [zfs] openzfs#7 [ffffc9002b98afa0] dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs] openzfs#8 [ffffc9002b98b050] dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs] openzfs#9 [ffffc9002b98b100] osd_object_delete at ffffffffa0b64ecc [osd_zfs] openzfs#10 [ffffc9002b98b118] lu_object_free at ffffffffa06d6a74 [obdclass] openzfs#11 [ffffc9002b98b178] lu_site_purge_objects at ffffffffa06d7fc1 [obdclass] openzfs#12 [ffffc9002b98b220] lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass] openzfs#13 [ffffc9002b98b278] shrink_slab at ffffffff811ca9d8 openzfs#14 [ffffc9002b98b338] shrink_node at ffffffff811cfd94 openzfs#15 [ffffc9002b98b3b8] do_try_to_free_pages at ffffffff811cfe63 openzfs#16 [ffffc9002b98b408] try_to_free_pages at ffffffff811d01c4 openzfs#17 [ffffc9002b98b488] __alloc_pages_slowpath at ffffffff811be7f2 openzfs#18 [ffffc9002b98b580] __alloc_pages_nodemask at ffffffff811bf3ed openzfs#19 [ffffc9002b98b5e0] new_slab at ffffffff81226304 openzfs#20 [ffffc9002b98b638] ___slab_alloc at ffffffff812272ab openzfs#21 [ffffc9002b98b6f8] __slab_alloc at ffffffff8122740c openzfs#22 [ffffc9002b98b708] kmem_cache_alloc at ffffffff81227578 openzfs#23 [ffffc9002b98b740] spl_kmem_cache_alloc at ffffffffa048a1fd [spl] openzfs#24 [ffffc9002b98b780] arc_buf_alloc_impl at ffffffffa0befba2 [zfs] openzfs#25 [ffffc9002b98b7b0] arc_read at ffffffffa0bf0924 [zfs] openzfs#26 [ffffc9002b98b858] dbuf_read at ffffffffa0bf9083 [zfs] openzfs#27 [ffffc9002b98b900] dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs] Signed-off-by: Mark Roper <markroper@gmail.com>

Merged Allan's fixes for send/recv Resume

This is a fixup of commit 0fdd610 See added test case for a reproducer. Stack trace: panic: VERIFY3(nvlist_next_nvpair(redactnvl, pair) == NULL) failed (0xfffff80003ce5d18x == 0x) cpuid = 7 time = 1602212370 KDB: stack backtrace: #0 0xffffffff80c1d297 at kdb_backtrace+0x67 openzfs#1 0xffffffff80bd05cd at vpanic+0x19d openzfs#2 0xffffffff828446fa at spl_panic+0x3a openzfs#3 0xffffffff828af85d at dmu_redact_snap+0x39d openzfs#4 0xffffffff829c0370 at zfs_ioc_redact+0xa0 openzfs#5 0xffffffff829bba44 at zfsdev_ioctl_common+0x4a4 openzfs#6 0xffffffff8284c3ed at zfsdev_ioctl+0x14d openzfs#7 0xffffffff80a85ead at devfs_ioctl+0xad openzfs#8 0xffffffff8122a46c at VOP_IOCTL_APV+0x7c openzfs#9 0xffffffff80cb0a3a at vn_ioctl+0x16a openzfs#10 0xffffffff80a8649f at devfs_ioctl_f+0x1f openzfs#11 0xffffffff80c3b55e at kern_ioctl+0x2be openzfs#12 0xffffffff80c3b22d at sys_ioctl+0x15d openzfs#13 0xffffffff810a88e4 at amd64_syscall+0x364 openzfs#14 0xffffffff81082330 at fast_syscall_common+0x101 Signed-off-by: Christian Schwarz <me@cschwarz.com>

merge upstream

`zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and `pool_specified` to `import_pools()`. If `pool_specified==FALSE`, the `argv[]` arguments are not used. However, these values may be off the end of the `argv[]` array, so loading them could dereference unmapped memory. This error is reported by the asan build: ``` ================================================================= ==6003==ERROR: AddressSanitizer: heap-buffer-overflow READ of size 8 at 0x6030000004a8 thread T0 #0 0x562a078b50eb in zpool_do_import zpool_main.c:3796 #1 0x562a078858c5 in main zpool_main.c:10709 #2 0x7f5115231bf6 in __libc_start_main #3 0x562a07885eb9 in _start 0x6030000004a8 is located 0 bytes to the right of 24-byte region allocated by thread T0 here: #0 0x7f5116ac6b40 in __interceptor_malloc #1 0x562a07885770 in main zpool_main.c:10699 #2 0x7f5115231bf6 in __libc_start_main ``` This commit passes NULL for these arguments if they are off the end of the `argv[]` array. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #12339

`zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and `pool_specified` to `import_pools()`. If `pool_specified==FALSE`, the `argv[]` arguments are not used. However, these values may be off the end of the `argv[]` array, so loading them could dereference unmapped memory. This error is reported by the asan build: ``` ================================================================= ==6003==ERROR: AddressSanitizer: heap-buffer-overflow READ of size 8 at 0x6030000004a8 thread T0 #0 0x562a078b50eb in zpool_do_import zpool_main.c:3796 #1 0x562a078858c5 in main zpool_main.c:10709 #2 0x7f5115231bf6 in __libc_start_main #3 0x562a07885eb9 in _start 0x6030000004a8 is located 0 bytes to the right of 24-byte region allocated by thread T0 here: #0 0x7f5116ac6b40 in __interceptor_malloc #1 0x562a07885770 in main zpool_main.c:10699 #2 0x7f5115231bf6 in __libc_start_main ``` This commit passes NULL for these arguments if they are off the end of the `argv[]` array. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#12339

Simplify and document OpenZFS library dependencies

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 openzfs#1 0xffffffff8058e86f at vpanic+0x17f openzfs#2 0xffffffff8058e6e3 at panic+0x43 openzfs#3 0xffffffff808adc15 at trap_fatal+0x385 openzfs#4 0xffffffff808adc6f at trap_pfault+0x4f openzfs#5 0xffffffff80886da8 at calltrap+0x8 openzfs#6 0xffffffff80669186 at vgonel+0x186 openzfs#7 0xffffffff80669841 at vgone+0x31 openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#11 0xffffffff8065a28c at lookup+0x45c openzfs#12 0xffffffff806594b9 at namei+0x259 openzfs#13 0xffffffff80676a33 at kern_statat+0xf3 openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 openzfs#1 0xffffffff8059620f at vpanic+0x17f openzfs#2 0xffffffff81a27f4a at spl_panic+0x3a openzfs#3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 openzfs#4 0xffffffff8066fdee at vinactivef+0xde openzfs#5 0xffffffff80670b8a at vgonel+0x1ea openzfs#6 0xffffffff806711e1 at vgone+0x31 openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#10 0xffffffff80661c2c at lookup+0x45c openzfs#11 0xffffffff80660e59 at namei+0x259 openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3 openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501

fox-pluto mentioned this issue Oct 18, 2011

ZFS kernel crash with zconfig.sh and zpios-sanity.sh #418

Closed

asnow0 mentioned this issue Nov 1, 2011

zfs rollback crash #431

Closed

richardinside mentioned this issue Nov 2, 2011

0.6.0-rc5 tests failed on Gentoo kernel 3.0 #350

Closed

behlendorf mentioned this issue Nov 21, 2011

nfs_lookup_revalidate() NULL deref #456

Closed

adeeln mentioned this issue Mar 5, 2012

zfs on 32bit: xattr causes oops #594

Closed

ryao mentioned this issue Mar 15, 2012

Assertion failure when importing mirrored pool #604

Closed

ryao mentioned this issue Apr 21, 2012

'BUG: scheduling while atomic' #688

Closed

ryao mentioned this issue Apr 29, 2012

Stopping bridged KVM virtual machine whose disk is a zvol causes system to freeze #708

Closed

n9sla mentioned this issue May 2, 2012

modprobe splat crashes with error: killed #699

Closed

michaelletzgus mentioned this issue May 8, 2012

zconfig fails ans causes kernel oops with kernel 3.3.5: #730

Closed

OverLocker mentioned this issue Jan 7, 2020

ZFS 0.8.2 starts scrub and looses drives #9816

Closed

allanjude referenced this issue in KlaraSystems/zfs Apr 28, 2020

Merge pull request #2 from bgly/spp-zfs-0.8.1

2d32c3c

Merged Allan's fixes for send/recv Resume

pcd1193182 mentioned this issue May 16, 2020

File incorrectly zeroed when receiving incremental stream that toggles -L #6224

Closed

mattmacy mentioned this issue Sep 22, 2020

Fix ZFS_DEBUG_MODIFY assert in arc_buf_try_copy_decompressed_data #10943

Closed

12 tasks

anodos325 referenced this issue in truenas/zfs Nov 9, 2020

Merge pull request #2 from openzfs/master

f80d3ca

merge upstream

justinpryzby mentioned this issue Mar 16, 2021

stuck in futex #11749

Closed

bu7cher mentioned this issue Jun 11, 2021

Possible circular locking dependency in dmu_buf_will_dirty_impl vs free_children #10583

Open

mapmot mentioned this issue May 11, 2022

Kernel NULL pointer dereference #13430

Open

hooklab mentioned this issue Jun 6, 2022

FATAL: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'perf_trace_buf_alloc' #13535

Closed

hpingfs added a commit to hpingfs/zfs that referenced this issue Oct 10, 2022

Merge pull request openzfs#2 from iomesh/uzfs-dev

580135b

Simplify and document OpenZFS library dependencies

mcladams mentioned this issue Jan 7, 2023

Distribution installers (from iso) should do more ZFS #14355

Open

tongfw mentioned this issue Feb 6, 2023

kernel panic on write #14463

Closed

gitcnd mentioned this issue May 12, 2023

Support for Oracle Linux UEK (problem and suggested solution/workaround) #14857

Closed

jasimmons1973 mentioned this issue Jul 26, 2023

Linux lockdep issue with dbuf_find() #15111

Closed

GregorKopka mentioned this issue Oct 6, 2023

Rewriting Records Via Scrub #15335

Open

baycom mentioned this issue Jan 8, 2024

page fault after arc_write_done / dbuf_write_done / abd_free_linear_page #15745

Closed

youzhongyang mentioned this issue Mar 12, 2024

Bug fix: Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN. #15940

Merged

13 tasks

Uglymotha mentioned this issue Oct 30, 2024

zed aborts after assertion failure in udev_device_get_sysattr_value #16705

Closed

cfallin mentioned this issue Dec 21, 2024

GPF for non-canonical address in dmu_zfetch_fini #16895

Closed

asomers mentioned this issue Jan 24, 2025

panic: VERIFY0(spa_open(dsname, &spa, FTAG)) failed (0 == 2) #16988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement zeventd daemon #2

Implement zeventd daemon #2

behlendorf commented May 14, 2010

behlendorf commented May 14, 2010

behlendorf commented May 9, 2011

colbygk commented May 17, 2011

behlendorf commented May 17, 2011

colbygk commented May 18, 2011

behlendorf commented May 18, 2011

colbygk commented May 24, 2011

behlendorf commented May 24, 2011

colbygk commented May 28, 2011

Rudd-O commented Nov 29, 2011

colbygk commented Nov 29, 2011

pyavdr commented Mar 27, 2012

behlendorf commented Mar 27, 2012

pyavdr commented Apr 10, 2012

pyavdr commented Apr 18, 2012

behlendorf commented Apr 18, 2012

pyavdr commented Apr 18, 2012

behlendorf commented Apr 18, 2012

pyavdr commented May 6, 2012

Implement zeventd daemon #2

Implement zeventd daemon #2

Comments

behlendorf commented May 14, 2010

behlendorf commented May 14, 2010

behlendorf commented May 9, 2011

colbygk commented May 17, 2011

behlendorf commented May 17, 2011

colbygk commented May 18, 2011

behlendorf commented May 18, 2011

colbygk commented May 24, 2011

behlendorf commented May 24, 2011

colbygk commented May 28, 2011

Rudd-O commented Nov 29, 2011

colbygk commented Nov 29, 2011

pyavdr commented Mar 27, 2012

behlendorf commented Mar 27, 2012

pyavdr commented Apr 10, 2012

pyavdr commented Apr 18, 2012

behlendorf commented Apr 18, 2012

pyavdr commented Apr 18, 2012

behlendorf commented Apr 18, 2012

pyavdr commented May 6, 2012