Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC: zfs: adding existent segment to range tree #15030

Open
reefland opened this issue Jul 3, 2023 · 39 comments
Open

PANIC: zfs: adding existent segment to range tree #15030

reefland opened this issue Jul 3, 2023 · 39 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@reefland
Copy link

reefland commented Jul 3, 2023

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.1
Kernel Version 5.15.0-75-generic
Architecture x86_64
OpenZFS Version 2.1.5

Describe the problem you're observing

Randomly (not at boot or pool import) system shows panic messages (see below) on the console. The system seemed to be operational. Upon trying to reboot I was unable to get a clean shutdown. Had several Failed unmounting messages with references to /root and /var/log. Along with a failed to start Journal Service. Eventually it hung at a systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID ... Had to kill the power after waiting ~20 minutes.

Upon reboot, everything seems fine. No issues with pool import, zpool status was good. After an hour or so I would see the first panic message. Usually within 24 to 48 hours, had to reboot again.. it would not be a clean shutdown, required manual power off.

The PANIC messages I have is very close to #13483 however, that discussion appears to be on pool import issues. I'm not having any issues with pool import so I'm documenting a new issue as the panic seems to have a different cause.

Include any warning/errors/backtraces from the system logs

Jul 02 16:09:43 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=279691ea000 size=1000)
Jul 02 16:09:43 k3s06 kernel: Showing stack for process 1111
Jul 02 16:09:43 k3s06 kernel: CPU: 10 PID: 1111 Comm: txg_sync Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 16:09:43 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 02 16:09:43 k3s06 kernel: Call Trace:
Jul 02 16:09:43 k3s06 kernel:  <TASK>
Jul 02 16:09:43 k3s06 kernel:  show_stack+0x52/0x5c
Jul 02 16:09:43 k3s06 kernel:  dump_stack_lvl+0x4a/0x63
Jul 02 16:09:43 k3s06 kernel:  dump_stack+0x10/0x16
Jul 02 16:09:43 k3s06 kernel:  spl_dumpstack+0x29/0x2f [spl]
Jul 02 16:09:43 k3s06 kernel:  vcmn_err.cold+0x60/0x78 [spl]
Jul 02 16:09:43 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 16:09:43 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 16:09:43 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:09:43 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 16:09:43 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 16:09:43 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 16:09:43 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 16:09:43 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 16:09:43 k3s06 kernel:  kthread+0x127/0x150
Jul 02 16:09:43 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 16:09:43 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 16:09:43 k3s06 kernel:  </TASK>


Jul 02 16:13:19 k3s06 kernel: INFO: task txg_sync:1111 blocked for more than 120 seconds.
Jul 02 16:13:19 k3s06 kernel:       Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 16:13:19 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 02 16:13:19 k3s06 kernel: task:txg_sync        state:D stack:    0 pid: 1111 ppid:     2 flags:0x00004004
Jul 02 16:13:19 k3s06 kernel: Call Trace:
Jul 02 16:13:19 k3s06 kernel:  <TASK>
Jul 02 16:13:19 k3s06 kernel:  __schedule+0x24e/0x590
Jul 02 16:13:19 k3s06 kernel:  schedule+0x69/0x110
Jul 02 16:13:19 k3s06 kernel:  vcmn_err.cold+0x76/0x78 [spl]
Jul 02 16:13:19 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 16:13:19 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 16:13:19 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:13:19 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 16:13:19 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 16:13:19 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 16:13:19 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 16:13:19 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 16:13:19 k3s06 kernel:  kthread+0x127/0x150
Jul 02 16:13:19 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 16:13:19 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 16:13:19 k3s06 kernel:  </TASK>

NOTE: System is using ZFS Boot Menu, so this is ZFS on Root, but does not require the commonly used separate BOOT and ROOT pools. This method allows for a single ZFS pool (with encryption).

After reboot, status is good:

$ zpool status
  pool: k3s06
 state: ONLINE
  scan: scrub repaired 0B in 08:41:02 with 0 errors on Sun Jun 11 09:05:03 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        k3s06                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA109YVK-part3  ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA1040NK-part3  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VGJJ0Y5G-part3     ONLINE       0     0     0
            ata-WDC_WD120EFBX-68B0EN0_5QJ6U62B-part3    ONLINE       0     0     0
        cache
          nvme0n1                                       ONLINE       0     0     0

errors: No known data errors

I tried to issue a sudo zpool scrub k3s06 and it hanged (did not return to prompt). I open another SSH session, zpool status does not indicate a scrub is in progress. Checked the console and it had panic message scrolling on the screen.

Jul 02 19:45:26 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=279691ea000 size=1000)
Jul 02 19:45:26 k3s06 kernel: Showing stack for process 1341
Jul 02 19:45:26 k3s06 kernel: CPU: 2 PID: 1341 Comm: txg_sync Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 19:45:26 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 02 19:45:26 k3s06 kernel: Call Trace:
Jul 02 19:45:26 k3s06 kernel:  <TASK>
Jul 02 19:45:26 k3s06 kernel:  show_stack+0x52/0x5c
Jul 02 19:45:26 k3s06 kernel:  dump_stack_lvl+0x4a/0x63
Jul 02 19:45:26 k3s06 kernel:  dump_stack+0x10/0x16
Jul 02 19:45:26 k3s06 kernel:  spl_dumpstack+0x29/0x2f [spl]
Jul 02 19:45:26 k3s06 kernel:  vcmn_err.cold+0x60/0x78 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 19:45:26 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? kmem_cache_free+0x24f/0x290
Jul 02 19:45:26 k3s06 kernel:  ? percpu_counter_dec+0x10/0x20 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? spl_kmem_cache_free+0xda/0x140 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:45:26 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 19:45:26 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 19:45:26 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 19:45:26 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 19:45:26 k3s06 kernel:  kthread+0x127/0x150
Jul 02 19:45:26 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 19:45:26 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 19:45:26 k3s06 kernel:  </TASK>
Jul 02 19:49:02 k3s06 kernel: INFO: task txg_sync:1341 blocked for more than 120 seconds.
Jul 02 19:49:02 k3s06 kernel:       Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 19:49:02 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 02 19:49:02 k3s06 kernel: task:txg_sync        state:D stack:    0 pid: 1341 ppid:     2 flags:0x00004004
Jul 02 19:49:02 k3s06 kernel: Call Trace:
Jul 02 19:49:02 k3s06 kernel:  <TASK>
Jul 02 19:49:02 k3s06 kernel:  __schedule+0x24e/0x590
Jul 02 19:49:02 k3s06 kernel:  schedule+0x69/0x110
Jul 02 19:49:02 k3s06 kernel:  vcmn_err.cold+0x76/0x78 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 19:49:02 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? kmem_cache_free+0x24f/0x290
Jul 02 19:49:02 k3s06 kernel:  ? percpu_counter_dec+0x10/0x20 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? spl_kmem_cache_free+0xda/0x140 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:49:02 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 19:49:02 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 19:49:02 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 19:49:02 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 19:49:02 k3s06 kernel:  kthread+0x127/0x150
Jul 02 19:49:02 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 19:49:02 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 19:49:02 k3s06 kernel:  </TASK>

I did another unclean shutdown again and rebooted. Tried some of the suggestions within #13483

echo 1 > /sys/module/zfs/parameters/zil_replay_disable
echo 1 > /sys/module/zfs/parameters/zfs_recover

Then issues another scrub, which completed successfully:

 zpool status
  pool: k3s06
 state: ONLINE
  scan: scrub repaired 0B in 09:07:12 with 0 errors on Mon Jul  3 05:54:45 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        k3s06                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA109YVK-part3  ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA1040NK-part3  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VGJJ0Y5G-part3     ONLINE       0     0     0
            ata-WDC_WD120EFBX-68B0EN0_5QJ6U62B-part3    ONLINE       0     0     0
        cache
          nvme0n1                                       ONLINE       0     0     0

errors: No known data errors

As expected from the tunables, panic messages now became warning messages:

Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=279691ea000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27691830000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=276917fa000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=279691bc000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28d90833000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=276917fb000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28d907b0000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=197224c4000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=291de47e000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28a52005000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2c13ab29000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2c118da9000 size=2000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27ecdfbe000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27341d0b000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2915ddb2000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28d4ea04000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=196d5824000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2529b350000 size=f000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=279350b0000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27ac475e000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2824d6d6000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2824d072000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28afcbcf000 size=1000)

This seem to run clean as well:

sudo zdb -AAA -b k3s06

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 1, metaslab 464 of 465 ...
8.91T completed (10118MB/s) estimated time remaining: 0hr 00min 00sec        
        No leaks (block sum matches space maps exactly)

        bp count:              68746233
        ganged count:                 0
        bp logical:       9955783857664      avg: 144819
        bp physical:      9789308441088      avg: 142397     compression:   1.02
        bp allocated:     9798470672384      avg: 142531     compression:   1.02
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:     9798686769152     used: 61.46%
        Embedded log class         192512     used:  0.00%

        additional, non-pointer bps of type 0:      10841
        Dittoed blocks on same vdev: 77922

This time I was able to get a clean shutdown. Rebooted without the tunables.

System is now running 3+ hours under normal load, no panic messages have been seen since yet. Will keep monitoring.

@reefland reefland added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 3, 2023
@reefland
Copy link
Author

reefland commented Jul 4, 2023

Got about 12 hours from it, before it got wonky and panic again.

First few entries seem to be related to ext4, the system does have one zvol with ext4.

Jul 04 00:44:50 k3s06 kernel: INFO: task python3:584178 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:python3         state:D stack:    0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel:  <TASK>
Jul 04 00:44:50 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel:  ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:44:50 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:44:50 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:44:50 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:44:50 k3s06 kernel:  </TASK>
Jul 04 00:44:50 k3s06 kernel: INFO: task node:615723 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:node            state:D stack:    0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel:  <TASK>
Jul 04 00:44:50 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel:  ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:44:50 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel:  vfs_fsync_range+0x46/0x90
Jul 04 00:44:50 k3s06 kernel:  ? __fget_light+0x39/0x90
Jul 04 00:44:50 k3s06 kernel:  __x64_sys_fsync+0x38/0x70
Jul 04 00:44:50 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:44:50 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:44:50 k3s06 kernel:  </TASK>
Jul 04 00:44:50 k3s06 kernel: INFO: task WTCheck.tThread:616605 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:WTCheck.tThread state:D stack:    0 pid:616605 ppid:616510 flags:0x00000000
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel:  <TASK>
Jul 04 00:44:50 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel:  ? wbc_detach_inode+0x142/0x210
Jul 04 00:44:50 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:44:50 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 04 00:44:50 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7fcd4954ffc7
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007fcd4357a7b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000093 RCX: 00007fcd4954ffc7
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000093 RDI: 0000000000000093
Jul 04 00:44:50 k3s06 kernel: RBP: 00007fcd4357a810 R08: 0000000000000020 R09: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000017 R11: 0000000000000293 R12: 000000000000000a
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000001 R14: 00005630ccb039c0 R15: 00005630c5c46269
Jul 04 00:44:50 k3s06 kernel:  </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task python3:584178 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:python3         state:D stack:    0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel:  <TASK>
Jul 04 00:46:51 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel:  ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:46:51 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:46:51 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:46:51 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:46:51 k3s06 kernel:  </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task node:615723 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:node            state:D stack:    0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel:  <TASK>
Jul 04 00:46:51 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel:  ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:46:51 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel:  vfs_fsync_range+0x46/0x90
Jul 04 00:46:51 k3s06 kernel:  ? __fget_light+0x39/0x90
Jul 04 00:46:51 k3s06 kernel:  __x64_sys_fsync+0x38/0x70
Jul 04 00:46:51 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:46:51 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:46:51 k3s06 kernel:  </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task WTCheck.tThread:616605 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:WTCheck.tThread state:D stack:    0 pid:616605 ppid:616510 flags:0x00000000
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel:  <TASK>
Jul 04 00:46:51 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel:  ? wbc_detach_inode+0x142/0x210
Jul 04 00:46:51 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:46:51 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 04 00:46:51 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7fcd4954ffc7
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007fcd4357a7b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000093 RCX: 00007fcd4954ffc7
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000093 RDI: 0000000000000093
Jul 04 00:46:51 k3s06 kernel: RBP: 00007fcd4357a810 R08: 0000000000000020 R09: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000017 R11: 0000000000000293 R12: 000000000000000a
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000001 R14: 00005630ccb039c0 R15: 00005630c5c46269
Jul 04 00:46:51 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task jbd2/rbd7-8:571525 blocked for more than 120 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:jbd2/rbd7-8     state:D stack:    0 pid:571525 ppid:     2 flags:0x00004000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  jbd2_journal_commit_transaction+0x307/0x17a0
Jul 04 00:48:52 k3s06 kernel:  ? __update_idle_core+0x93/0x120
Jul 04 00:48:52 k3s06 kernel:  ? psi_task_switch+0xc6/0x220
Jul 04 00:48:52 k3s06 kernel:  ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel:  kjournald2+0xa9/0x280
Jul 04 00:48:52 k3s06 kernel:  ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel:  ? load_superblock.part.0+0xc0/0xc0
Jul 04 00:48:52 k3s06 kernel:  kthread+0x127/0x150
Jul 04 00:48:52 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 04 00:48:52 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 04 00:48:52 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task python3:584178 blocked for more than 362 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:python3         state:D stack:    0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:48:52 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:48:52 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:48:52 k3s06 kernel:  ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:48:52 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:48:52 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:48:52 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:48:52 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:48:52 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:48:52 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task node:615723 blocked for more than 362 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:node            state:D stack:    0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:48:52 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:48:52 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:48:52 k3s06 kernel:  ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:48:52 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:48:52 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:48:52 k3s06 kernel:  vfs_fsync_range+0x46/0x90
Jul 04 00:48:52 k3s06 kernel:  ? __fget_light+0x39/0x90
Jul 04 00:48:52 k3s06 kernel:  __x64_sys_fsync+0x38/0x70
Jul 04 00:48:52 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:48:52 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:48:52 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task WTJourn.Flusher:616604 blocked for more than 120 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:WTJourn.Flusher state:D stack:    0 pid:616604 ppid:616510 flags:0x00000000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  ? update_sd_lb_stats.constprop.0+0xf2/0x390
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  wait_transaction_locked+0x8e/0xd0
Jul 04 00:48:52 k3s06 kernel:  ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel:  add_transaction_credits+0xdb/0x2d0
Jul 04 00:48:52 k3s06 kernel:  start_this_handle+0xfd/0x510
Jul 04 00:48:52 k3s06 kernel:  ? kmem_cache_alloc+0x1ab/0x2f0
Jul 04 00:48:52 k3s06 kernel:  ? jbd2__journal_start+0x95/0x1f0
Jul 04 00:48:52 k3s06 kernel:  jbd2__journal_start+0x103/0x1f0
Jul 04 00:48:52 k3s06 kernel:  __ext4_journal_start_sb+0x10b/0x130
Jul 04 00:48:52 k3s06 kernel:  ext4_dirty_inode+0x3a/0x80
Jul 04 00:48:52 k3s06 kernel:  __mark_inode_dirty+0x5b/0x330
Jul 04 00:48:52 k3s06 kernel:  generic_update_time+0x6c/0xd0
Jul 04 00:48:52 k3s06 kernel:  file_update_time+0x127/0x140
Jul 04 00:48:52 k3s06 kernel:  file_modified+0x27/0x40
Jul 04 00:48:52 k3s06 kernel:  ext4_buffered_write_iter+0x5f/0x180
Jul 04 00:48:52 k3s06 kernel:  ext4_file_write_iter+0x43/0x60
Jul 04 00:48:52 k3s06 kernel:  new_sync_write+0x111/0x1a0
Jul 04 00:48:52 k3s06 kernel:  vfs_write+0x1d5/0x270
Jul 04 00:48:52 k3s06 kernel:  __x64_sys_pwrite64+0x96/0xc0
Jul 04 00:48:52 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? restore_fpregs_from_fpstate+0x4e/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? switch_fpu_return+0x4e/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x96/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x96/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7fcd4983c13f
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007fcd43d7b570 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fcd4983c13f
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000500 RSI: 00005630cbdec000 RDI: 0000000000000010
Jul 04 00:48:52 k3s06 kernel: RBP: 00005630cbdec000 R08: 0000000000000000 R09: 0000000000000500
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000b6d300 R11: 0000000000000293 R12: 0000000000000500
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000b6d300 R14: 00005630c8968b40 R15: 0000000000b6d300
Jul 04 00:48:52 k3s06 kernel:  </TASK>

The ZFS panic does not show up until a bit later:

Jul 04 01:20:35 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=2941f81f000 size=1000)
Jul 04 01:20:35 k3s06 kernel: Showing stack for process 1111
Jul 04 01:20:35 k3s06 kernel: CPU: 18 PID: 1111 Comm: txg_sync Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 01:20:35 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 04 01:20:35 k3s06 kernel: Call Trace:
Jul 04 01:20:35 k3s06 kernel:  <TASK>
Jul 04 01:20:35 k3s06 kernel:  show_stack+0x52/0x5c
Jul 04 01:20:35 k3s06 kernel:  dump_stack_lvl+0x4a/0x63
Jul 04 01:20:35 k3s06 kernel:  dump_stack+0x10/0x16
Jul 04 01:20:35 k3s06 kernel:  spl_dumpstack+0x29/0x2f [spl]
Jul 04 01:20:35 k3s06 kernel:  vcmn_err.cold+0x60/0x78 [spl]
Jul 04 01:20:35 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 04 01:20:35 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 04 01:20:35 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 04 01:20:35 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 04 01:20:35 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 04 01:20:35 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 04 01:20:35 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 04 01:20:35 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 04 01:20:35 k3s06 kernel:  kthread+0x127/0x150
Jul 04 01:20:35 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 04 01:20:35 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 04 01:20:35 k3s06 kernel:  </TASK>

@AdamLantos
Copy link

AdamLantos commented Jul 17, 2023

I was having similar symptoms on a TrueNAS Scale system. My backup pool of 2 disks got corrupted somehow. Scrub is clean, but any time I'm pushing a particular dataset to it, it panics. I was able to reproduce the panic by manually deleting one snapshot from the dataset. Export and reimport works, but deleting snapshots immediately panics the system and needs hard reset since normal shutdown never completes after zfs panic.

Since the backup pool is somewhat important for me, I enabled zfs_recover parameter and deleted the offending dataset and replicated from scratch. For now the system appears to behave, but I'm not sure if I have confidence in the pool any more. I will keep the system running with zfs_recover enabled for a few days, and will wipe/recreate the pool when this reoccurs.

@reefland
Copy link
Author

I left it running for about a week with zil_replay_disable and zfs_recover set. All that time, did not have a panic or warning message. Rebooted and completed a full scrub without a panic message.

I've had no more zfs_panic_recover or WARNING: zfs: adding existent segment to range tree since then... close to 2 weeks now.

I assume this is from zfs constantly rewriting spacemaps in condensing operation. Running the pool in recovery mode for a while eventually removed the duplicated range descriptions during the rewrite operations.... or something like that.

@satmandu
Copy link
Contributor

satmandu commented Aug 14, 2023

I'm also seeing this on a pool using kernel 6.5-rc6. Here is my zdb output: https://dpaste.com/DL9KS6SQ5

I've set boot variables for vfs.zfs.zil.replay_disable=1 and vfs.zfs.zfs_recover=1 for now, so hopefully that helps?

@satmandu
Copy link
Contributor

@reefland Is there a way to force spacemap condensing to happen faster to try to resolve this?

@reefland
Copy link
Author

@satmandu - sorry, no idea. No clue how to monitor the spacemap condensing operation - trial and error. Hopefully someone more knowledgeable can answer.

  • Six weeks now without a PANIC
  • Completed an 11 hour scrub without issue scan: scrub repaired 0B in 11:13:59 with 0 errors on Sun Aug 13 11:38:00 2023

@satmandu
Copy link
Contributor

Thanks. My pool has clean scrubs, but my system still panics if I mount the pool without the special zfs options... but I also haven't had the pool mounted with the special options for more than 16 hours...

@rincebrain
Copy link
Contributor

rincebrain commented Aug 23, 2023

If rewriting the spacemaps removes the assertion, that might suggest you still have lurking excitement eventually, since in theory that should be equally upset by duplicate range insertion, no?

My wild guess would be when it threw out the log after you imported the pool and unlocked the datasets with replaying disabled, it no longer had a pending insertion that was invalid. But that's just a wild guess.

@satmandu
Copy link
Contributor

I hit this again... and had to set zfs.zil.replay_disable=1 and zfs.zfs_recover=1 as kernel cmdline arguments to get boot to succeed.

@reefland
Copy link
Author

I've had this multiple times on different systems (it has not duplicated on the same system). It's just a process I follow now to recover the system. At least its a workaround. I'm no closer to finding a cause / pattern.

@satmandu
Copy link
Contributor

I destroyed a zfs subvolume (that may have had this issue because that's where intense writes from docker were happening) and still have the issue when importing the parent volume.

I will copy all the data to a fresh volume and recreate the volume.

If I delete all the data and the volume still has these issues, could someone look at an image of the disk to help figure out how to repair this issue, even if figuring out how this damage is being created isn't proximally resolved?

Would that be useful at all?

@reefland
Copy link
Author

@satmandu - I can't comment on your disk image question.

But I've had to destroy a number of ZVOLs (extract data, delete ZVOL, recreate, restore, etc) related to this as well. I've since switched all ZVOLs to forced sync writes. Yeah, performance impact, but I haven't had any ZVOLs get corrupted since.

I was frequently getting this after a ZFS panic:

errors: Permanent errors have been detected in the following files:
        rpool/rancher:<0x1> 

Multiple scrubs were not able to fix it. I'm using CRI / K3s instead of Docker, but we're probably using ZVOLs for similar reasons.

@satmandu
Copy link
Contributor

I wonder if we can create a test case of amplifying read/writes to replicate the issue.

Or maybe if there is an existing test case we should be trying on our hardware?

This sounds like it should be replicable in a test environment where one is reaching the I/O limits of a disk subsystem or a disk.

@kaitallaoua
Copy link

I've also encounters this when moving a nvme to another system, I don't know what I did wrong. Only thing I can think of it inserting into the new system while PSU was plugged in but pc was off. Forgot to unplug it. I also noticed getting i/o errors, I don't think the drive is faulty but have not tested it yet. Computer passed memtest

@dodexahedron
Copy link
Contributor

dodexahedron commented Nov 5, 2024

I've been encountering this on 2.3.0-rc2 after deleting large files (4+GB each) from a pool with dedup active.

PANIC: zfs: adding existent segment to range tree (offset=78cca32000 size=6000)
Showing stack for process 11146
CPU: 0 UID: 0 PID: 11146 Comm: txg_sync Tainted: P           OE      6.11.5-1.el8.elrepo.x86_64 #1
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: Supermicro AS -2014TP-HTR/H12SST-PS, BIOS 2.8 01/19/2024
Call Trace:
 <TASK>
 dump_stack_lvl+0x60/0x80
 vcmn_err+0xca/0x110 [spl]
 ? bt_grow_leaf+0xd0/0x160 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? arc_access+0x295/0x4a0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? arc_buf_access+0xf8/0x180 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? mutex_lock+0xe/0x30
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 zfs_panic_recover+0x6d/0x90 [zfs]
 range_tree_add_impl+0x244/0xe30 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add_segment+0x513/0x610 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add+0x6e/0x170 [zfs]
 metaslab_sync+0x277/0x950 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dmu_tx_create_dd+0x7c/0xc0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 vdev_sync+0x6b/0x1e0 [zfs]
 spa_sync_iterate_to_convergence+0x185/0x2f0 [zfs]
 spa_sync+0x4a3/0x980 [zfs]
 txg_sync_thread+0x206/0x3a0 [zfs]
 ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 thread_generic_wrapper+0x59/0x70 [spl]
 kthread+0xce/0x100
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x30/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

After it occurs, zpool scrub hangs and ends up causing these:

INFO: task txg_sync:11146 blocked for more than 122 seconds.
      Tainted: P           OE      6.11.5-1.el8.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:txg_sync        state:D stack:0     pid:11146 tgid:11146 ppid:2      flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x225/0x5b0
 schedule+0x23/0xa0
 vcmn_err+0xd7/0x110 [spl]
 ? bt_grow_leaf+0xd0/0x160 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? arc_access+0x295/0x4a0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? arc_buf_access+0xf8/0x180 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? mutex_lock+0xe/0x30
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 zfs_panic_recover+0x6d/0x90 [zfs]
 range_tree_add_impl+0x244/0xe30 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add_segment+0x513/0x610 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add+0x6e/0x170 [zfs]
 metaslab_sync+0x277/0x950 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dmu_tx_create_dd+0x7c/0xc0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 vdev_sync+0x6b/0x1e0 [zfs]
 spa_sync_iterate_to_convergence+0x185/0x2f0 [zfs]
 spa_sync+0x4a3/0x980 [zfs]
 txg_sync_thread+0x206/0x3a0 [zfs]
 ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 thread_generic_wrapper+0x59/0x70 [spl]
 kthread+0xce/0x100
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x30/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

...

INFO: task zpool:876884 blocked for more than 614 seconds.
      Tainted: P           OE      6.11.5-1.el8.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:zpool           state:D stack:0     pid:876884 tgid:876884 ppid:875951 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x225/0x5b0
 schedule+0x23/0xa0
 io_schedule+0x42/0x70
 cv_wait_common+0xab/0x130 [spl]
 ? __pfx_autoremove_wake_function+0x10/0x10
 txg_wait_synced_impl+0xfc/0x140 [zfs]
 txg_wait_synced+0xc/0x40 [zfs]
 dsl_sync_task_common+0x1bf/0x2a0 [zfs]
 ? __pfx_dsl_scan_setup_sync+0x10/0x10 [zfs]
 ? __pfx_dsl_scan_setup_check+0x10/0x10 [zfs]
 ? __pfx_dsl_scan_setup_check+0x10/0x10 [zfs]
 ? __pfx_dsl_scan_setup_sync+0x10/0x10 [zfs]
 dsl_sync_task+0x16/0x20 [zfs]
 dsl_scan+0xaa/0x220 [zfs]
 zfs_ioc_pool_scrub+0x9c/0x160 [zfs]
 zfsdev_ioctl_common+0x297/0x760 [zfs]
 ? srso_return_thunk+0x5/0x5f
 zfsdev_ioctl+0x4f/0xd0 [zfs]
 __x64_sys_ioctl+0x93/0xd0
 do_syscall_64+0x60/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fd909c3922b
RSP: 002b:00007ffe74bfbde8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000005a57 RCX: 00007fd909c3922b
RDX: 00007ffe74bfbe10 RSI: 0000000000005a57 RDI: 0000000000000004
RBP: 00007ffe74bff3f0 R08: 0000000000000000 R09: 0000000016fb9880
R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe74bfbe10
R13: 0000000000005a57 R14: 0000000016f91c00 R15: 0000000000000000
 </TASK>

These continue indefinitely and are only resolved by a reboot, after which a scrub works as normal and doesn't find any problems.

Seeing as how a txg sync never happened, though, the pool may be consistent, but that suggests data loss, no?

Other pool activity appears to continue while those hung tasks remain hung forever.

If it helps at all, there are SHA512 and BLAKE3 ZAPs on that pool, and they're quite big for the pool, so I'm trying to get rid of the old SHA512 data so a big chunk can go away. It was twice as large before I got rid of the SHA256 ZAP that was also there from even earlier.

Here's a zdb -DD on that pool:

DDT-sha512-zap-duplicate: 5950988 entries, size 14914019328 on disk, 4815867904 in core
DDT-sha512-zap-unique: 39829815 entries, size 93688590336 on disk, 30253473792 in core
DDT-blake3-zap-duplicate: 903049 entries, size 413208576 on disk, 268533760 in core
DDT-blake3-zap-unique: 11498977 entries, size 4799926272 on disk, 3097395200 in core
DDT-log-blake3-0: 18590 log entries
DDT-log-blake3-1: 87105 log entries

@dodexahedron
Copy link
Contributor

dodexahedron commented Nov 6, 2024

Confirmed that the delete of two CentOS ISOs (11GB each) that caused the most recent instance of this did not actually get committed to the pool, so must have been in that hung txg.

Upon further testing, it happens when trying to remove one of those files, in particular.

Reboot after the hang and the files are back.

Removed the "safe" one and that was fine. Ran a zpool sync. No problem.

Remove the other one, and it immediately panicked exactly as above, and that txg never got committed.

Reboot after that, and the problem file is back again. Deleted it, and the same thing happened again. So I can recreate this panic on demand, with next to no load on the system at all.

Anything y'all want me to look at specifically?

@dodexahedron
Copy link
Contributor

Also note that the file appears to be gone after deleting it, even though that txg is uncommitted. It only reappears after reboot and import of the pool.

@Neurrone
Copy link

Neurrone commented Jan 3, 2025

This reliably triggers for me if I try to delete a specific snapshot. Any ideas on how to fix this?

I'll try doing a scrub but from the above, it seems like that won't help.

[  591.982595] PANIC: zfs: adding existent segment to range tree (offset=1265b374000 size=7a000)
[  591.982604] Showing stack for process 1211
[  591.982608] CPU: 13 PID: 1211 Comm: txg_sync Tainted: P           O       6.8.12-5-pve #1
[  591.982614] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING WIFI II, BIOS 3607 03/22/2024
[  591.982618] Call Trace:
[  591.982622]  <TASK>
[  591.982626]  dump_stack_lvl+0x76/0xa0
[  591.982632]  dump_stack+0x10/0x20
[  591.982637]  vcmn_err+0xdb/0x130 [spl]
[  591.982649]  zfs_panic_recover+0x75/0xa0 [zfs]
[  591.982749]  range_tree_add_impl+0x27f/0x11c0 [zfs]
[  591.982845]  range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[  591.982932]  ? dmu_buf_rele+0x3b/0x50 [zfs]
[  591.983022]  range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[  591.983113]  metaslab_sync+0x27f/0x950 [zfs]
[  591.983203]  ? ktime_get_raw_ts64+0x41/0xd0
[  591.983210]  ? mutex_lock+0x12/0x50
[  591.983215]  vdev_sync+0x73/0x4d0 [zfs]
[  591.983303]  ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[  591.983392]  ? srso_alias_return_thunk+0x5/0xfbef5
[  591.983397]  ? mutex_lock+0x12/0x50
[  591.983402]  spa_sync+0x62e/0x1050 [zfs]
[  591.983491]  ? srso_alias_return_thunk+0x5/0xfbef5
[  591.983495]  ? spa_txg_history_init_io+0x120/0x130 [zfs]
[  591.983583]  txg_sync_thread+0x207/0x3a0 [zfs]
[  591.983668]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[  591.983748]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[  591.983757]  thread_generic_wrapper+0x5f/0x70 [spl]
[  591.983764]  kthread+0xf2/0x120
[  591.983769]  ? __pfx_kthread+0x10/0x10
[  591.983774]  ret_from_fork+0x47/0x70
[  591.983779]  ? __pfx_kthread+0x10/0x10
[  591.983783]  ret_from_fork_asm+0x1b/0x30
[  591.983790]  </TASK>
[  738.223428] INFO: task txg_sync:1211 blocked for more than 122 seconds.
[  738.223444]       Tainted: P           O       6.8.12-5-pve #1
[  738.223450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  738.223455] task:txg_sync        state:D stack:0     pid:1211  tgid:1211  ppid:2      flags:0x00004000
[  738.223465] Call Trace:
[  738.223470]  <TASK>
[  738.223479]  __schedule+0x401/0x15e0
[  738.223494]  schedule+0x33/0x110
[  738.223502]  vcmn_err+0xe8/0x130 [spl]
[  738.223524]  zfs_panic_recover+0x75/0xa0 [zfs]
[  738.223723]  range_tree_add_impl+0x27f/0x11c0 [zfs]
[  738.223868]  range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[  738.223976]  ? dmu_buf_rele+0x3b/0x50 [zfs]
[  738.224093]  range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[  738.224194]  metaslab_sync+0x27f/0x950 [zfs]
[  738.224284]  ? ktime_get_raw_ts64+0x41/0xd0
[  738.224293]  ? mutex_lock+0x12/0x50
[  738.224298]  vdev_sync+0x73/0x4d0 [zfs]
[  738.224386]  ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[  738.224477]  ? srso_alias_return_thunk+0x5/0xfbef5
[  738.224484]  ? mutex_lock+0x12/0x50
[  738.224488]  spa_sync+0x62e/0x1050 [zfs]
[  738.224578]  ? srso_alias_return_thunk+0x5/0xfbef5
[  738.224583]  ? spa_txg_history_init_io+0x120/0x130 [zfs]
[  738.224671]  txg_sync_thread+0x207/0x3a0 [zfs]
[  738.224757]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[  738.224848]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[  738.224857]  thread_generic_wrapper+0x5f/0x70 [spl]
[  738.224865]  kthread+0xf2/0x120
[  738.224873]  ? __pfx_kthread+0x10/0x10
[  738.224877]  ret_from_fork+0x47/0x70
[  738.224884]  ? __pfx_kthread+0x10/0x10
[  738.224887]  ret_from_fork_asm+0x1b/0x30
[  738.224895]  </TASK>
[  861.103822] INFO: task txg_sync:1211 blocked for more than 245 seconds.
[  861.103838]       Tainted: P           O       6.8.12-5-pve #1
[  861.103843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  861.103849] task:txg_sync        state:D stack:0     pid:1211  tgid:1211  ppid:2      flags:0x00004000
[  861.103858] Call Trace:
[  861.103863]  <TASK>
[  861.103870]  __schedule+0x401/0x15e0
[  861.103885]  schedule+0x33/0x110
[  861.103892]  vcmn_err+0xe8/0x130 [spl]
[  861.103918]  zfs_panic_recover+0x75/0xa0 [zfs]
[  861.104157]  range_tree_add_impl+0x27f/0x11c0 [zfs]
[  861.104302]  range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[  861.104450]  ? dmu_buf_rele+0x3b/0x50 [zfs]
[  861.104540]  range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[  861.104638]  metaslab_sync+0x27f/0x950 [zfs]
[  861.104728]  ? ktime_get_raw_ts64+0x41/0xd0
[  861.104737]  ? mutex_lock+0x12/0x50
[  861.104742]  vdev_sync+0x73/0x4d0 [zfs]
[  861.104830]  ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[  861.104930]  ? srso_alias_return_thunk+0x5/0xfbef5
[  861.104938]  ? mutex_lock+0x12/0x50
[  861.104942]  spa_sync+0x62e/0x1050 [zfs]
[  861.105031]  ? srso_alias_return_thunk+0x5/0xfbef5
[  861.105036]  ? spa_txg_history_init_io+0x120/0x130 [zfs]
[  861.105124]  txg_sync_thread+0x207/0x3a0 [zfs]
[  861.105210]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[  861.105291]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[  861.105299]  thread_generic_wrapper+0x5f/0x70 [spl]
[  861.105307]  kthread+0xf2/0x120
[  861.105315]  ? __pfx_kthread+0x10/0x10
[  861.105320]  ret_from_fork+0x47/0x70
[  861.105326]  ? __pfx_kthread+0x10/0x10
[  861.105330]  ret_from_fork_asm+0x1b/0x30
[  861.105337]  </TASK>
[  983.984297] INFO: task txg_sync:1211 blocked for more than 368 seconds.
[  983.984313]       Tainted: P           O       6.8.12-5-pve #1
[  983.984319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  983.984324] task:txg_sync        state:D stack:0     pid:1211  tgid:1211  ppid:2      flags:0x00004000
[  983.984334] Call Trace:
[  983.984339]  <TASK>
[  983.984345]  __schedule+0x401/0x15e0
[  983.984358]  schedule+0x33/0x110
[  983.984365]  vcmn_err+0xe8/0x130 [spl]
[  983.984382]  zfs_panic_recover+0x75/0xa0 [zfs]
[  983.984610]  range_tree_add_impl+0x27f/0x11c0 [zfs]
[  983.984812]  range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[  983.985060]  ? dmu_buf_rele+0x3b/0x50 [zfs]
[  983.985326]  range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[  983.985577]  metaslab_sync+0x27f/0x950 [zfs]
[  983.985778]  ? ktime_get_raw_ts64+0x41/0xd0
[  983.985790]  ? mutex_lock+0x12/0x50
[  983.985796]  vdev_sync+0x73/0x4d0 [zfs]
[  983.985887]  ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[  983.985989]  ? srso_alias_return_thunk+0x5/0xfbef5
[  983.985997]  ? mutex_lock+0x12/0x50
[  983.986002]  spa_sync+0x62e/0x1050 [zfs]
[  983.986092]  ? srso_alias_return_thunk+0x5/0xfbef5
[  983.986096]  ? spa_txg_history_init_io+0x120/0x130 [zfs]
[  983.986184]  txg_sync_thread+0x207/0x3a0 [zfs]
[  983.986270]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[  983.986351]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[  983.986359]  thread_generic_wrapper+0x5f/0x70 [spl]
[  983.986367]  kthread+0xf2/0x120
[  983.986375]  ? __pfx_kthread+0x10/0x10
[  983.986379]  ret_from_fork+0x47/0x70
[  983.986386]  ? __pfx_kthread+0x10/0x10
[  983.986390]  ret_from_fork_asm+0x1b/0x30
[  983.986398]  </TASK>

@Neurrone
Copy link

Neurrone commented Jan 3, 2025

Unfortunately, a scrub didn't help. The scrub shows no errors, but the issue persists.

@Neurrone
Copy link

Neurrone commented Jan 3, 2025

In case this is useful:

> zdb -y tank
Verifying deleted livelist entries
Verifying metaslab entries
verifying concrete vdev 0, metaslab 11 of 1164 ...ASSERT at cmd/zdb/zdb.c:482:verify_livelist_allocs()
((size) >> (9)) - (0) < 1ULL << (24) (0x10459f8 < 0x1000000)
  PID: 46353     COMM: zdb
  TID: 46353     NAME: zdb
Call trace:
  /lib/x86_64-linux-gnu/libzpool.so.5(libspl_assertf+0x157) [0x79be2fc38627]
  zdb(+0xe5d0) [0x61b154a645d0]
  /lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x32f) [0x79be2fabe24f]
  zdb(+0x13a7f) [0x61b154a69a7f]
  zdb(+0x212ae) [0x61b154a772ae]
  zdb(+0xafd1) [0x61b154a60fd1]
  /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x79be2f2e224a]
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x79be2f2e2305]
  zdb(+0xc7f1) [0x61b154a627f1]
zdb(+0x13db3)[0x61b154a69db3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x79be2f2f7050]
/lib/x86_64-linux-gnu/libc.so.6(+0x8aebc)[0x79be2f345ebc]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x79be2f2f6fb2]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x79be2f2e1472]
/lib/x86_64-linux-gnu/libzpool.so.5(+0x57a97)[0x79be2f993a97]
zdb(+0xe5d0)[0x61b154a645d0]
/lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x32f)[0x79be2fabe24f]
zdb(+0x13a7f)[0x61b154a69a7f]
zdb(+0x212ae)[0x61b154a772ae]
zdb(+0xafd1)[0x61b154a60fd1]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x79be2f2e224a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x79be2f2e2305]
zdb(+0xc7f1)[0x61b154a627f1]
Aborted

This pool seems to be beyond recovery, so I'll try destroying and rebuilding it.

@dodexahedron
Copy link
Contributor

Unfortunately, a scrub didn't help. The scrub shows no errors, but the issue persists.

Not only does a scrub not find it, since the checksums are "correct..." It also can destroy data from older snapshots, apparently, as well as cause snapshots to be unremovable, file systems to not be un-mountable, and all sorts of other weird stuff.

It's very hard to trace the cause, as there may be multiple bugs at play, with multiple different as well as related causes.

I do have an easier time eventually reproducing the bug on test pools that have dedup enabled and have any one or more of the following situations on top of it:

  • Had legacy dedup before upgrade, and now have fast dedup using a different dedup checksum algo
  • Started with fast dedup on 2.3, and then changed dedup checksum algo
  • Started without dedup on any version, but then enabled dedup on an existing dataset, with any algo.
  • Changed compression algo or level on a dataset subject to dedup while fast dedup is active
  • Changed recordsize - especially reductions that are likely to result in multiple records per file for new writes.
  • Created by zfs receive from a send of a dataset that had recordsize changes in its past.

And other similar situations. In short, it seems at least from my un-scientific observations (since I haven't been able to explicitly pinpoint the origin of the failure), that there may be some code in the fast dedup that isn't accounting for the possibility of different compression/checksum/encryption algos, record sizes, etc. Perhaps it's some sort of off-by-one or other length-related miscalculation? I'm of course totally spitballing there.

Also, I've anecdotally encountered this more often when blake3 is the checksum/dedup algo in use. Skein and SHA variants, in almost the same amount of test runs, have had it happen less, but the set of test data that's based on is small enough that it's not THAT strong of an indicator.... I Just happened to also notice a few other bugs involving blake3, so it seemed worth mentioning.

The test suite being run is not very great, and I have only been able to reproduce the issue without a forced ungraceful halt/panic once, in dedicated testing, among dozens of test batteries that otherwise did involve either hard stopping a vm, killing a network connection, and other forms of simulating actual hard system failures.

The thing is that it is similar to behavior I've encountered for years with dedup on ZFS, since the 0.x days, but which has until now usually required some combination of time, high load, and I don't know what other triggers to cause, because it's had delayed onset seeming to be as much as a year from the time data was originally written. Or at least I think it's delayed onset, given the data that has been destroyed by the failures. I suppose it could be happening later on, but again there's been no way I am familiar with to trace the origin of it. I've NEVER been able to trace it deterministically to more than dedup, of some variety, being in use, and something causing one or more of the txg commit queue tasks to hard lock up. The above list is even just me trying to filter out trends...

If there's any specific things I can do to provide helpful data on this, I'm all ears!

@dodexahedron
Copy link
Contributor

dodexahedron commented Jan 11, 2025

This pool seems to be beyond recovery, so I'll try destroying and rebuilding it.

With this issue, considering it's a silent corruption thing, that is probably the best course of action anyway.

But there's no guarantee you won't hit it again, if using the same configuration for everything. 🤷‍♂️

@Neurrone
Copy link

Not only does a scrub not find it, since the checksums are "correct..." It also can destroy data from older snapshots, apparently, as well as cause snapshots to be unremovable, file systems to not be un-mountable, and all sorts of other weird stuff.

Yikes. I didn't leave the pool around long enough to find out, I destroyed and recreated from backup when I saw the scrub didn't help.

My pool is a very standard configuration, essentially the defaults with encryption. I've not changed anything else or enabled deduplication or changed the record size.

For now, I'm running zdb -y daily to detect similar corruption. If this happens a second time though,, I'll have to abandon ZFS and look for alternatives. I understand this is a difficult issue to debug but there being no comments from the developers for years is very concerning. I too would be happy to help run commands to get any data if it happens again to me.

@aerusso
Copy link
Contributor

aerusso commented Jan 11, 2025

@Neurrone Could you give us the version of ZFS you encountered this on? What Linux (or BSD) distribution, and how old the zpool was (i.e., roughly what versions of ZFS the pool was used with)?

It would be helpful for understanding if this is some old Ubuntu-specific bug from the 2.1 era, or if this is definitely due to a modern ZFS problem.

@dodexahedron
Copy link
Contributor

@Neurrone Could you give us the version of ZFS you encountered this on? What Linux (or BSD) distribution, and how old the zpool was (i.e., roughly what versions of ZFS the pool was used with)?

It would be helpful for understanding if this is some old Ubuntu-specific bug from the 2.1 era, or if this is definitely due to a modern ZFS problem.

FWIW, I've encountered the specific behavior of the posted issue on:

  • CentOS 8 running 6.11 and 6.12 mainline kernels from elrepo
  • RHEL 8 running 6.10 and 6.11 mainline kernels from elrepo
  • Ubuntu 24.10 running xanmod-x64v3 6.11 and 6.12 kernels
  • Ubuntu 24.04 running ubuntu linux-generic-hwe-24.04 kernels (6.8 series)
  • Ubuntu 24.10 running ubuntu linux-generic-hwe-24.04 kernels (6.11 series)

@aerusso
Copy link
Contributor

aerusso commented Jan 12, 2025

@dodexahedron Has this only affected encrypted datasets? And are these datasets new, or do they "go way back" perhaps being products of zfs send from much earlier zpools? There has been some significant recent progress to replicating other hard-to-debug issues. It's possible this issue might either be related (or at least susceptible to similar approaches used there).

Also, the ZFS kernel version (NOT just the userspace version) is very helpful for this kind of triage. You can get that with zfs --version, just give both lines.

I think that encryption may not be seen as a priority by the big players in the ZFS development space. Issues that only affect encrypted datasets probably do not reflect on the overall status of the rest of the project (or at least I have not seen any such pernicious problems outside encryption).

@dodexahedron
Copy link
Contributor

No, I honestly don't think encryption has anything at all to do with it, in the end.

Long-winded details interspersed with direct answers to your questions, though:

Last system to encounter this stack trace was running zfs-2.3.0-rc4/zfs-kmod-2.3.0-rc4, built on the local machine with same major/minor version of gcc (13.2) as the kernel was, and with no extra parameters to ./configure. That's a CentOS Stream 8 box on kernel Linux nunya.biz 6.12.8-1.el8.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC

I have not yet seen this stack trace on test machines with rc5 and otherwise identical, but I do still consistently see the other bad behavior noted earlier.

Pools on the test machines are never from sends/receives, for these. They are always original pools, and are pretty young by the time they hit this, on the test systems. However, a much older pool that also was never received from elsewhere, originally, had thrown this stack trace as well, plus the data loss and pool dysfunction noted, which was what started me looking into it and testing with this stack trace in mind, specifically.

But it non-deterministically occurs along with other dedup-related issues, which I suspect are probably closely related anyway and this is just a new symptom from a new feature. They're lockups too, but with different stack traces. So seems like race conditions/concurrency issues by my guess, and I have a hunch this is the result of damage caused by the other problems in the first place.

I can somewhat reliably recreate them on brand new pools, now, by running certain workloads on top of them, only ever happening when dedup is in the picture, and very bursty IOP load with a high-ish amount of sync IO from the consumers (iSCSI and NFS serving up VMDKs for VMs on ESXi hosts). The time it takes to hit a problem is variable, but it does usually happen at some point in the next day or two - sometimes much quicker - with the same workload and configuration. The easiest way I've been able to make it happen is to throw up a large MSSQL database on top of it and then do something like rebuild all indexes on a 20GB table, which definitely causes a ton of IO, a lot of it being sync.

While I have had encrypted datasets have had this problem, I have not bothered continuing to test out with encryption, since it also happens without it just as easily. So I think encryption is a red herring here as well. Or, if anything, it's just exacerbating it when it's also in play.

Once a lockup occurs, I've tried to revive it various different ways, including the most recent I attempted, which was writing a 1 to spl_taskq_kick. But that doesn't even result in any changes to any threads, and no other visible changes in the state of the machine at all in fact, so something is definitely hard locked up. It always requires a hard power cycle to successfully reboot.

However, I should note that the extent of the data destruction seems to be significantly less (or at least as far as I can tell) with rc5, but there's still loss of anything written from the time of the lockup forward, of course. And that doesn't even get replayed and I'm thinking never even actually got written to the ZIL in the first place. Upon reboot, it just imports the pool like nothing happened and life goes on until missing data is noticed (which, with iSCSI on top of zvols, is sometimes easier to notice, given ZFS has no clue if what was given to it is valid at a higher level).

@dodexahedron
Copy link
Contributor

Oh also, no other errors of any kind are indicated by the system, the hardware, or ZFS leading up to this stuff. The first indication will be the stuck IO, and then upon investigation, the hung task once it actually happens, along with inability to run pretty much all zfs/zpool commands, inability to unload the zfs module, inability to even stop other modules that have indirect hard dependencies on it (like scst, serving up those zvols), and other issues stemming from having storage locked up in a way that the rest of the system can't really see.

@Neurrone
Copy link

@aerusso The pool was created on TrueNas Scale 24.04 with ZFS 2.2.4-2 in May 2024.

I migrated to Proxmox 8.3 and reimported this pool in Dec 2024 In Proxmox, the version of ZFS is zfs-2.2.6-pve1 and zfs-kmod-2.2.6-pve1.

@Neurrone
Copy link

Would it be helpful to consider another angle of attack: allowing for better recovery if this corruption happens, as suggested in #13995?

@Neurrone
Copy link

Neurrone commented Jan 19, 2025

After I recreated the pool that got corrupted, its happening again but I have an advance warning of it this time because of the script I run daily to check for metaslab corruption.

I'm at my wits end about what else to try, since disabling encryption wouldn't help from the above comments.

I'd be happy to invest the time to work with ZFS devs to get to the bottom of this if this is prioritized. Otherwise, I'll be migrating everything off ZFS to BTRFS, since I can't be living in fear of my data being corrupted every month, and the inconvenience of the downtime from recreating the pool from backups without a fix in sight.

> zdb -y tank
Verifying deleted livelist entries
Verifying metaslab entries
verifying concrete vdev 0, metaslab 169 of 1164 ...ASSERT at cmd/zdb/zdb.c:482:verify_livelist_allocs()
((size) >> (9)) - (0) < 1ULL << (24) (0x1e87a28 < 0x1000000)
  PID: 522372    COMM: zdb
  TID: 522372    NAME: zdb
Call trace:
  /lib/x86_64-linux-gnu/libzpool.so.5(libspl_assertf+0x157) [0x78482f957627]
  /usr/sbin/zdb(+0xe5d0) [0x5b362da4a5d0]
  /lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x32f) [0x78482f7dd24f]
  /usr/sbin/zdb(+0x13a7f) [0x5b362da4fa7f]
  /usr/sbin/zdb(+0x212ae) [0x5b362da5d2ae]
  /usr/sbin/zdb(+0xafd1) [0x5b362da46fd1]
  /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x78482f00124a]
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x78482f001305]
  /usr/sbin/zdb(+0xc7f1) [0x5b362da487f1]
/usr/sbin/zdb(+0x13db3)[0x5b362da4fdb3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x78482f016050]
/lib/x86_64-linux-gnu/libc.so.6(+0x8aebc)[0x78482f064ebc]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x78482f015fb2]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x78482f000472]
/lib/x86_64-linux-gnu/libzpool.so.5(+0x57a97)[0x78482f6b2a97]
/usr/sbin/zdb(+0xe5d0)[0x5b362da4a5d0]
/lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x32f)[0x78482f7dd24f]
/usr/sbin/zdb(+0x13a7f)[0x5b362da4fa7f]
/usr/sbin/zdb(+0x212ae)[0x5b362da5d2ae]
/usr/sbin/zdb(+0xafd1)[0x5b362da46fd1]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x78482f00124a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x78482f001305]
/usr/sbin/zdb(+0xc7f1)[0x5b362da487f1]
Aborted

The pool is a mirror of Seagate Exos X20 20TB HDDs with encryption enabled and all other settings unchanged from defaults. No dedup is being used. It is storing bulk media and is a backup target for my VMs.

@Neurrone
Copy link

Neurrone commented Jan 24, 2025

I noticed the pool has the block cloning feature enabled, but the tunable is set to off by default in Proxmox, so block cloning shouldn't be happening.

> cat /sys/module/zfs/parameters/zfs_bclone_enabled
0

So whatever this is shouldn't involve block cloning.

@putnam
Copy link

putnam commented Jan 29, 2025

I haven't run into any panics, but running zdb -y on my pool crashes in the same way when it eventually fails the same ASSERT:

# zdb -y tank
Verifying deleted livelist entries
Verifying metaslab entries
verifying concrete vdev 0, metaslab 7862 of 12806 ...ASSERT at cmd/zdb/zdb.c:482:verify_livelist_allocs()
((size) >> (9)) - (0) < 1ULL << (24) (0x12e2598 < 0x1000000)
  PID: 53758     COMM: zdb
  TID: 53758     NAME: zdb
Call trace:
  /lib/x86_64-linux-gnu/libzpool.so.5(libspl_assertf+0x159) [0x7f9d46ddfe59]
  zdb(+0xe860) [0x55c20fe4b860]
  /lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x341) [0x7f9d46c601a1]
  zdb(+0x13d07) [0x55c20fe50d07]
  zdb(+0x21a7c) [0x55c20fe5ea7c]
  zdb(+0xc2f8) [0x55c20fe492f8]
  /lib/x86_64-linux-gnu/libc.so.6(+0x29d68) [0x7f9d4634cd68]
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f9d4634ce25]
  zdb(+0xc8d1) [0x55c20fe498d1]
zdb(+0x14035) [0x55c20fe51035]
/lib/x86_64-linux-gnu/libc.so.6(+0x3fda0) [0x7f9d46362da0]
/lib/x86_64-linux-gnu/libc.so.6(+0x93dcc) [0x7f9d463b6dcc]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12) [0x7f9d46362d02]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f9d4634b4f0]
/lib/x86_64-linux-gnu/libzpool.so.5(+0x57af4) [0x7f9d46b31af4]
zdb(+0xe860) [0x55c20fe4b860]
/lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x341) [0x7f9d46c601a1]
zdb(+0x13d07) [0x55c20fe50d07]
zdb(+0x21a7c) [0x55c20fe5ea7c]
zdb(+0xc2f8) [0x55c20fe492f8]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d68) [0x7f9d4634cd68]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f9d4634ce25]
zdb(+0xc8d1) [0x55c20fe498d1]
Aborted

I hope I don't have any corruption!

I have encryption enabled, I regularly create and cull snapshots with Sanoid, and I do NOT have dedup enabled. I haven't run into the bug that OP has. I am hoping perhaps it's a missed case in zdb rather than corruption.

zfs-2.2.7-2
zfs-kmod-2.2.7-2

@StanleyPain
Copy link

All of my pools have this issue. Including one that I just created a week ago. This is across 3 different computers. WTH is going on here?

@silenius
Copy link

same here (FreeBSD 14.2-RELEASE):

~/ sudo zdb -vy zroot_srv 
Verifying deleted livelist entries
Verifying metaslab entries
verifying concrete vdev 0, metaslab 269 of 1846 ...ASSERT at /usr/src/sys/contrib/openzfs/cmd/zdb/zdb.c:482:verify_livelist_allocs()
((size) >> (9)) - (0) < 1ULL << (24) (0x11459c0 < 0x1000000)
  PID: 84880     COMM: zdb
  TID: 100856    NAME: 
zsh: abort      sudo zdb -vy zroot_srv

~/ zfs -V  
zfs-2.2.6-FreeBSD_g33174af15
zfs-kmod-2.2.6-FreeBSD_g33174af15

@Neurrone
Copy link

I don't know if that assert failing is definite confirmation of corruption, it seemed to be for my pool when it crashed last month.

Hoping a ZFS dev can chime in to confirm either way.

@jacobblock
Copy link

FWIW: I'm on FreeBSD 13.3-RELEASE-p2. Pool has been the same since... FreeBSD 9? maybe.

$ sudo zdb -U /data/zfs/zpool.cache -y tank
Verifying deleted livelist entries
Verifying metaslab entries
verifying concrete vdev 5, metaslab 4 of 931 ...((size) >> (9)) - (0) < 1ULL << (24) (0x14f2c68 < 0x1000000)
ASSERT at cmd/zdb/zdb.c:478:verify_livelist_allocs()Abort trap

Features:

$ zpool get all | grep feature
tank         feature@async_destroy          enabled                        local
tank         feature@empty_bpobj            active                         local
tank         feature@lz4_compress           active                         local
tank         feature@multi_vdev_crash_dump  enabled                        local
tank         feature@spacemap_histogram     active                         local
tank         feature@enabled_txg            active                         local
tank         feature@hole_birth             active                         local
tank         feature@extensible_dataset     active                         local
tank         feature@embedded_data          active                         local
tank         feature@bookmarks              enabled                        local
tank         feature@filesystem_limits      enabled                        local
tank         feature@large_blocks           enabled                        local
tank         feature@large_dnode            enabled                        local
tank         feature@sha512                 enabled                        local
tank         feature@skein                  enabled                        local
tank         feature@edonr                  enabled                        local
tank         feature@userobj_accounting     active                         local
tank         feature@encryption             active                         local
tank         feature@project_quota          active                         local
tank         feature@device_removal         active                         local
tank         feature@obsolete_counts        active                         local
tank         feature@zpool_checkpoint       enabled                        local
tank         feature@spacemap_v2            active                         local
tank         feature@allocation_classes     enabled                        local
tank         feature@resilver_defer         enabled                        local
tank         feature@bookmark_v2            enabled                        local
tank         feature@redaction_bookmarks    enabled                        local
tank         feature@redacted_datasets      enabled                        local
tank         feature@bookmark_written       enabled                        local
tank         feature@log_spacemap           active                         local
tank         feature@livelist               enabled                        local
tank         feature@device_rebuild         enabled                        local
tank         feature@zstd_compress          enabled                        local
tank         feature@draid                  enabled                        local
tank         feature@zilsaxattr             active                         local
tank         feature@head_errlog            active                         local
tank         feature@blake3                 enabled                        local
tank         feature@block_cloning          enabled                        local
tank         feature@vdev_zaps_v2           active                         local
$ zfs -V
zfs-2.2.4-1
zfs-kmod-v2024052400-zfs_e4631d089

@silenius
Copy link

I've opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=284437

@robn
Copy link
Member

robn commented Jan 29, 2025

Hi to people who have arrived from @Neurrone's via blog post!

Please be aware that zdb -y failing or crashing is not itself an indicator of anything of interest. zdb by itself is not a general-purpose diagnostic tool and isn't particularly useful without enough knowledge of OpenZFS internals to understand its results (including if it crashes!). See zdb(8) for more info.

If your pool is running fine, then there's almost certainly no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests