Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool remove couse pool halt #11880

Closed
slavonnet opened this issue Apr 11, 2021 · 7 comments
Closed

zpool remove couse pool halt #11880

slavonnet opened this issue Apr 11, 2021 · 7 comments
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@slavonnet
Copy link

version from git:

commit e778b0485baaddd55284b138a517e0a028bd15eb (HEAD -> master, origin/master, origin/HEAD)
Author: Ryan Moeller <ryan@iXsystems.com>
Date:   Wed Apr 7 19:23:57 2021 -0400

    Ratelimit deadman zevents as with delay zevents

    Just as delay zevents can flood the zevent pipe when a vdev becomes
    unresponsive, so do the deadman zevents.

    Ratelimit deadman zevents according to the same tunable as for delay
    zevents.

    Enable deadman tests on FreeBSD and add a test for deadman event
    ratelimiting.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Don Brady <don.brady@delphix.com>
    Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
    Closes #11786

Centos 8 Stream
kernel 5.11.8-1.el8.elrepo.x86_64 (kernel-ml elrepo)

[root@vm2 ~]# zpool list -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
p            16.4T  11.4T  5.01T        -         -    37%    69%  1.24x    ONLINE  -
  raidz1     12.7T  11.3T  1.44T        -         -    37%  88.7%      -  ONLINE
    sdl          -      -      -        -         -      -      -      -  ONLINE
    sdm          -      -      -        -         -      -      -      -  ONLINE
    sdo          -      -      -        -         -      -      -      -  ONLINE
    sdp          -      -      -        -         -      -      -      -  ONLINE
    sdq          -      -      -        -         -      -      -      -  ONLINE
    sds          -      -      -        -         -      -      -      -  ONLINE
    sdt          -      -      -        -         -      -      -      -  ONLINE
special          -      -      -        -         -      -      -      -  -
  mirror      928G  13.0G   915G        -         -     2%  1.39%      -  ONLINE
    sdb          -      -      -        -         -      -      -      -  ONLINE
    sdc          -      -      -        -         -      -      -      -  ONLINE
  mirror      928G  13.0G   915G        -         -     2%  1.39%      -  ONLINE
    sdd          -      -      -        -         -      -      -      -  ONLINE
    sde          -      -      -        -         -      -      -      -  ONLINE
  mirror      928G  13.0G   915G        -         -     2%  1.40%      -  ONLINE
    sdh          -      -      -        -         -      -      -      -  ONLINE
    sdi          -      -      -        -         -      -      -      -  ONLINE
  mirror      928G  13.4G   915G        -         -     2%  1.44%      -  ONLINE
    sdj          -      -      -        -         -      -      -      -  ONLINE
    sdk          -      -      -        -         -      -      -      -  ONLINE
logs             -      -      -        -         -      -      -      -  -
  mirror      238G  2.24M   238G        -         -     0%  0.00%      -  ONLINE
    nvme0n1      -      -      -        -         -      -      -      -  ONLINE
    nvme1n1      -      -      -        -         -      -      -      -  ONLINE
cache            -      -      -        -         -      -      -      -  -
  sdf        3.64T  1.27T  2.37T        -         -     0%  34.8%      -  ONLINE
  sdg        3.64T  1.43T  2.20T        -         -     0%  39.4%      -  ONLINE
  sdn         224G      0   224G        -         -     0%  0.00%      -  ONLINE
  sdr         224G      0   224G        -         -     0%  0.00%      -  ONLINE

[root@vm2 zfs]# lsscsi
[0:0:0:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdb
[0:0:1:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdc
[0:0:2:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdd
[0:0:3:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sde
[0:0:4:0]    disk    ATA      Samsung SSD 850  2B6Q  /dev/sdf
[0:0:5:0]    disk    ATA      Samsung SSD 850  2B6Q  /dev/sdg
[0:0:6:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdh
[0:0:7:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdi
[2:0:0:0]    disk    ATA      KINGSTON SV300S3 BBF0  /dev/sda
[9:0:0:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdj
[9:0:1:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdk
[9:0:2:0]    disk    ATA      ST2000DM001-1ER1 CC26  /dev/sdl
[9:0:3:0]    disk    ATA      ST2000DM001-1ER1 CC26  /dev/sdm
[9:0:4:0]    disk    ATA      INTEL SSDSC2CW24 400i  /dev/sdn
[9:0:5:0]    disk    ATA      ST2000DM001-1CH1 CC29  /dev/sdo
[9:0:6:0]    disk    ATA      ST2000DM001-1CH1 CC29  /dev/sdp
[9:0:7:0]    disk    ATA      WDC WD20EZRZ-00Z 0A80  /dev/sdq
[9:0:8:0]    disk    ATA      INTEL SSDSC2CW24 400i  /dev/sdr
[9:0:9:0]    disk    ATA      ST3000DM001-1CH1 CC27  /dev/sds
[9:0:10:0]   disk    ATA      ST3000DM001-1CH1 CC27  /dev/sdt
[9:0:11:0]   enclosu LSI      SAS2X28          0e12  -
[N:0:1:1]    disk    Samsung SSD 950 PRO 256GB__1               /dev/nvme0n1
[N:1:1:1]    disk    Samsung SSD 950 PRO 256GB__1               /dev/nvme1n1

/> ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 3]
  | | o- zvoLinux ................................................................ [/dev/p/zvolLinux (20.0TiB) write-thru activated]
  | | | o- alua ................................................................................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | | o- zvoWinServers ...................................................... [/dev/p/zvolWinServers (20.0TiB) write-thru activated]
  | | | o- alua ................................................................................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | | o- zvolVDI ................................................................... [/dev/p/zvolVDI (20.0TiB) write-thru activated]
  | |   o- alua ................................................................................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 1]
  | o- iqn.2003-01.org.linux-iscsi.vm2.x8664:sn.b07943625401 ............................................................. [TPGs: 1]
  |   o- tpg1 ............................................................................................... [no-gen-acls, no-auth]
  |     o- acls .......................................................................................................... [ACLs: 4]
  |     | o- iqn.1991-05.com.microsoft:vhost10.dev.obs.group ...................................................... [Mapped LUNs: 3]
  |     | | o- mapped_lun0 ............................................................................... [lun0 block/zvolVDI (rw)]
  |     | | o- mapped_lun1 ......................................................................... [lun1 block/zvoWinServers (rw)]
  |     | | o- mapped_lun2 .............................................................................. [lun2 block/zvoLinux (rw)]
  |     | o- iqn.1991-05.com.microsoft:vhost11.dev.obs.group ...................................................... [Mapped LUNs: 3]
  |     | | o- mapped_lun0 ............................................................................... [lun0 block/zvolVDI (rw)]
  |     | | o- mapped_lun1 ......................................................................... [lun1 block/zvoWinServers (rw)]
  |     | | o- mapped_lun2 .............................................................................. [lun2 block/zvoLinux (rw)]
  |     | o- iqn.1991-05.com.microsoft:vhost8.dev.obs.group ....................................................... [Mapped LUNs: 3]
  |     | | o- mapped_lun0 ............................................................................... [lun0 block/zvolVDI (rw)]
  |     | | o- mapped_lun1 ......................................................................... [lun1 block/zvoWinServers (rw)]
  |     | | o- mapped_lun2 .............................................................................. [lun2 block/zvoLinux (rw)]
  |     | o- iqn.1991-05.com.microsoft:vhost9.dev.obs.group ....................................................... [Mapped LUNs: 3]
  |     |   o- mapped_lun0 ............................................................................... [lun0 block/zvolVDI (rw)]
  |     |   o- mapped_lun1 ......................................................................... [lun1 block/zvoWinServers (rw)]
  |     |   o- mapped_lun2 .............................................................................. [lun2 block/zvoLinux (rw)]
  |     o- luns .......................................................................................................... [LUNs: 3]
  |     | o- lun0 .............................................................. [block/zvolVDI (/dev/p/zvolVDI) (default_tg_pt_gp)]
  |     | o- lun1 ................................................. [block/zvoWinServers (/dev/p/zvolWinServers) (default_tg_pt_gp)]
  |     | o- lun2 ........................................................... [block/zvoLinux (/dev/p/zvolLinux) (default_tg_pt_gp)]
  |     o- portals .................................................................................................... [Portals: 1]
  |       o- 0.0.0.0:3260 ..................................................................................................... [OK]
  o- loopback ......................................................................................................... [Targets: 0]
  o- srpt ............................................................................................................. [Targets: 0]

Try to remove cache disks was produce system halt and all VM is down becose IO errors

[root@vm2 ~]# zpool remove p sdf sdg

dmesg:

[235095.570660]  sdn: sdn1 sdn9
[235095.586773]  sdn: sdn1 sdn9
[235095.767672]  sdr: sdr1 sdr9
[235095.768943]  sdr: sdr1 sdr9
[235302.732771] INFO: task l2arc_feed:687 blocked for more than 122 seconds.
[235302.735416]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235302.738209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235302.740936] task:l2arc_feed      state:D stack:    0 pid:  687 ppid:     2 flags:0x00004004
[235302.740941] Call Trace:
[235302.740945]  __schedule+0x39b/0x820
[235302.740955]  schedule+0x3c/0xa0
[235302.740958]  schedule_preempt_disabled+0xa/0x10
[235302.740961]  __mutex_lock.isra.11+0x2de/0x4e0
[235302.740964]  ? __set_current_blocked+0x36/0x50
[235302.740969]  l2arc_feed_thread+0xe1/0x5b0 [zfs]
[235302.741049]  ? l2arc_remove_vdev+0x210/0x210 [zfs]
[235302.741108]  ? __thread_exit+0x20/0x20 [spl]
[235302.741117]  thread_generic_wrapper+0x6f/0x80 [spl]
[235302.741125]  kthread+0x116/0x130
[235302.741131]  ? kthread_park+0x80/0x80
[235302.741134]  ret_from_fork+0x1f/0x30
[235302.741165] INFO: task agents:3722 blocked for more than 122 seconds.
[235302.743756]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235302.746479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235302.749419] task:agents          state:D stack:    0 pid: 3722 ppid:     1 flags:0x00000000
[235302.749423] Call Trace:
[235302.749425]  __schedule+0x39b/0x820
[235302.749429]  schedule+0x3c/0xa0
[235302.749431]  schedule_preempt_disabled+0xa/0x10
[235302.749433]  __mutex_lock.isra.11+0x2de/0x4e0
[235302.749436]  spa_open_common+0x5f/0x4c0 [zfs]
[235302.749531]  spa_get_stats+0x4d/0x530 [zfs]
[235302.749623]  ? __kmalloc_node+0x15d/0x320
[235302.749628]  ? spl_kmem_alloc+0xb6/0x100 [spl]
[235302.749635]  zfs_ioc_pool_stats+0x34/0x80 [zfs]
[235302.749731]  zfsdev_ioctl_common+0x692/0x960 [zfs]
[235302.749835]  zfsdev_ioctl+0x4f/0xe0 [zfs]
[235302.749925]  __x64_sys_ioctl+0x84/0xc0
[235302.749932]  do_syscall_64+0x33/0x40
[235302.749936]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[235302.749939] RIP: 0033:0x7fafa9ecb62b
[235302.749941] RSP: 002b:00007fafa7b265f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[235302.749944] RAX: ffffffffffffffda RBX: 00007faf98049690 RCX: 00007fafa9ecb62b
[235302.749945] RDX: 00007fafa7b26620 RSI: 0000000000005a05 RDI: 0000000000000009
[235302.749947] RBP: 00007fafa7b29c10 R08: 00007faf98071680 R09: 00007faf98000080
[235302.749948] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fafa7b26620
[235302.749950] R13: 000055f2d39dc1f0 R14: 0000000000000000 R15: 00007fafa7b29c34
[235425.605345] INFO: task l2arc_feed:687 blocked for more than 245 seconds.
[235425.608465]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235425.611795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235425.615410] task:l2arc_feed      state:D stack:    0 pid:  687 ppid:     2 flags:0x00004004
[235425.615416] Call Trace:
[235425.615420]  __schedule+0x39b/0x820
[235425.615431]  schedule+0x3c/0xa0
[235425.615433]  schedule_preempt_disabled+0xa/0x10
[235425.615436]  __mutex_lock.isra.11+0x2de/0x4e0
[235425.615439]  ? __set_current_blocked+0x36/0x50
[235425.615444]  l2arc_feed_thread+0xe1/0x5b0 [zfs]
[235425.615526]  ? l2arc_remove_vdev+0x210/0x210 [zfs]
[235425.615585]  ? __thread_exit+0x20/0x20 [spl]
[235425.615593]  thread_generic_wrapper+0x6f/0x80 [spl]
[235425.615601]  kthread+0x116/0x130
[235425.615607]  ? kthread_park+0x80/0x80
[235425.615609]  ret_from_fork+0x1f/0x30
[235425.615640] INFO: task agents:3722 blocked for more than 245 seconds.
[235425.619452]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235425.623471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235425.627784] task:agents          state:D stack:    0 pid: 3722 ppid:     1 flags:0x00000000
[235425.627788] Call Trace:
[235425.627790]  __schedule+0x39b/0x820
[235425.627793]  schedule+0x3c/0xa0
[235425.627795]  schedule_preempt_disabled+0xa/0x10
[235425.627798]  __mutex_lock.isra.11+0x2de/0x4e0
[235425.627801]  spa_open_common+0x5f/0x4c0 [zfs]
[235425.627895]  spa_get_stats+0x4d/0x530 [zfs]
[235425.627987]  ? __kmalloc_node+0x15d/0x320
[235425.627992]  ? spl_kmem_alloc+0xb6/0x100 [spl]
[235425.627999]  zfs_ioc_pool_stats+0x34/0x80 [zfs]
[235425.628095]  zfsdev_ioctl_common+0x692/0x960 [zfs]
[235425.628192]  zfsdev_ioctl+0x4f/0xe0 [zfs]
[235425.628282]  __x64_sys_ioctl+0x84/0xc0
[235425.628288]  do_syscall_64+0x33/0x40
[235425.628292]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[235425.628295] RIP: 0033:0x7fafa9ecb62b
[235425.628298] RSP: 002b:00007fafa7b265f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[235425.628300] RAX: ffffffffffffffda RBX: 00007faf98049690 RCX: 00007fafa9ecb62b
[235425.628302] RDX: 00007fafa7b26620 RSI: 0000000000005a05 RDI: 0000000000000009
[235425.628303] RBP: 00007fafa7b29c10 R08: 00007faf98071680 R09: 00007faf98000080
[235425.628304] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fafa7b26620
[235425.628305] R13: 000055f2d39dc1f0 R14: 0000000000000000 R15: 00007fafa7b29c34
[235671.350481] INFO: task l2arc_feed:687 blocked for more than 122 seconds.
[235671.355031]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235671.359802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235671.364842] task:l2arc_feed      state:D stack:    0 pid:  687 ppid:     2 flags:0x00004004
[235671.364847] Call Trace:
[235671.364851]  __schedule+0x39b/0x820
[235671.364861]  schedule+0x3c/0xa0
[235671.364863]  schedule_preempt_disabled+0xa/0x10
[235671.364866]  __mutex_lock.isra.11+0x2de/0x4e0
[235671.364870]  ? __set_current_blocked+0x36/0x50
[235671.364874]  l2arc_feed_thread+0xe1/0x5b0 [zfs]
[235671.364955]  ? l2arc_remove_vdev+0x210/0x210 [zfs]
[235671.365013]  ? __thread_exit+0x20/0x20 [spl]
[235671.365023]  thread_generic_wrapper+0x6f/0x80 [spl]
[235671.365031]  kthread+0x116/0x130
[235671.365037]  ? kthread_park+0x80/0x80
[235671.365039]  ret_from_fork+0x1f/0x30
[235794.223048] INFO: task l2arc_feed:687 blocked for more than 245 seconds.
[235794.228309]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235794.233755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235794.239468] task:l2arc_feed      state:D stack:    0 pid:  687 ppid:     2 flags:0x00004004
[235794.239472] Call Trace:
[235794.239477]  __schedule+0x39b/0x820
[235794.239487]  schedule+0x3c/0xa0
[235794.239489]  schedule_preempt_disabled+0xa/0x10
[235794.239493]  __mutex_lock.isra.11+0x2de/0x4e0
[235794.239496]  ? __set_current_blocked+0x36/0x50
[235794.239500]  l2arc_feed_thread+0xe1/0x5b0 [zfs]
[235794.239582]  ? l2arc_remove_vdev+0x210/0x210 [zfs]
[235794.239641]  ? __thread_exit+0x20/0x20 [spl]
[235794.239650]  thread_generic_wrapper+0x6f/0x80 [spl]
[235794.239658]  kthread+0x116/0x130
[235794.239663]  ? kthread_park+0x80/0x80
[235794.239666]  ret_from_fork+0x1f/0x30
[235917.095768] INFO: task l2arc_feed:687 blocked for more than 368 seconds.
[235917.101703]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[235917.107860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[235917.114294] task:l2arc_feed      state:D stack:    0 pid:  687 ppid:     2 flags:0x00004004
[235917.114299] Call Trace:
[235917.114303]  __schedule+0x39b/0x820
[235917.114313]  schedule+0x3c/0xa0
[235917.114315]  schedule_preempt_disabled+0xa/0x10
[235917.114319]  __mutex_lock.isra.11+0x2de/0x4e0
[235917.114322]  ? __set_current_blocked+0x36/0x50
[235917.114326]  l2arc_feed_thread+0xe1/0x5b0 [zfs]
[235917.114408]  ? l2arc_remove_vdev+0x210/0x210 [zfs]
[235917.114468]  ? __thread_exit+0x20/0x20 [spl]
[235917.114478]  thread_generic_wrapper+0x6f/0x80 [spl]
[235917.114486]  kthread+0x116/0x130
[235917.114491]  ? kthread_park+0x80/0x80
[235917.114494]  ret_from_fork+0x1f/0x30
[236000.546648] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236000.553270] iSCSI Login negotiation failed.
[236020.513439] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236020.520286] iSCSI Login negotiation failed.
[236039.968241] INFO: task zvol:680 blocked for more than 122 seconds.
[236039.975374]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[236039.982752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[236039.990426] task:zvol            state:D stack:    0 pid:  680 ppid:     2 flags:0x00004000
[236039.990429] Call Trace:
[236039.990431]  __schedule+0x39b/0x820
[236039.990434]  ? __kmalloc+0x3f5/0x450
[236039.990438]  schedule+0x3c/0xa0
[236039.990441]  cv_wait_common+0xf8/0x130 [spl]
[236039.990447]  ? wait_woken+0x80/0x80
[236039.990452]  spa_config_enter+0xed/0x100 [zfs]
[236039.990548]  zfs_blkptr_verify+0x3dc/0x440 [zfs]
[236039.990645]  zio_read+0x42/0xc0 [zfs]
[236039.990740]  ? arc_read+0x12e0/0x12e0 [zfs]
[236039.990798]  arc_read+0x96c/0x12e0 [zfs]
[236039.990857]  ? dbuf_rele_and_unlock+0x660/0x660 [zfs]
[236039.990918]  ? _cond_resched+0x15/0x30
[236039.990921]  dbuf_read_impl.constprop.35+0x29f/0x6b0 [zfs]
[236039.990985]  dbuf_read+0x1b2/0x530 [zfs]
[236039.991047]  dmu_tx_check_ioerr+0x61/0xc0 [zfs]
[236039.991117]  dmu_tx_count_write+0xb9/0x170 [zfs]
[236039.991186]  dmu_tx_hold_write_by_dnode+0x35/0x50 [zfs]
[236039.991261]  zvol_write.isra.21+0x16d/0x480 [zfs]
[236039.991351]  zvol_write_task+0x15/0x30 [zfs]
[236039.991439]  taskq_thread+0x2d5/0x490 [spl]
[236039.991448]  ? wake_up_q+0xa0/0xa0
[236039.991453]  ? zvol_write.isra.21+0x480/0x480 [zfs]
[236039.991542]  ? task_done+0x90/0x90 [spl]
[236039.991550]  kthread+0x116/0x130
[236039.991553]  ? kthread_park+0x80/0x80
[236039.991555]  ret_from_fork+0x1f/0x30
[236039.991559] INFO: task l2arc_feed:687 blocked for more than 491 seconds.
[236039.999436]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[236040.007545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[236040.015956] task:l2arc_feed      state:D stack:    0 pid:  687 ppid:     2 flags:0x00004004
[236040.015959] Call Trace:
[236040.015960]  __schedule+0x39b/0x820
[236040.015963]  schedule+0x3c/0xa0
[236040.015965]  schedule_preempt_disabled+0xa/0x10
[236040.015968]  __mutex_lock.isra.11+0x2de/0x4e0
[236040.015970]  ? __set_current_blocked+0x36/0x50
[236040.015972]  l2arc_feed_thread+0xe1/0x5b0 [zfs]
[236040.016032]  ? l2arc_remove_vdev+0x210/0x210 [zfs]
[236040.016091]  ? __thread_exit+0x20/0x20 [spl]
[236040.016099]  thread_generic_wrapper+0x6f/0x80 [spl]
[236040.016106]  kthread+0x116/0x130
[236040.016109]  ? kthread_park+0x80/0x80
[236040.016112]  ret_from_fork+0x1f/0x30
[236040.016121] INFO: task mmp:2917 blocked for more than 122 seconds.
[236040.024680]       Tainted: P           O      5.11.8-1.el8.elrepo.x86_64 #1
[236040.033460] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[236040.042534] task:mmp             state:D stack:    0 pid: 2917 ppid:     2 flags:0x00004000
[236040.042537] Call Trace:
[236040.042539]  __schedule+0x39b/0x820
[236040.042542]  schedule+0x3c/0xa0
[236040.042544]  cv_wait_common+0xf8/0x130 [spl]
[236040.042550]  ? wait_woken+0x80/0x80
[236040.042553]  spa_config_enter+0xed/0x100 [zfs]
[236040.042648]  vdev_count_leaves+0x20/0x50 [zfs]
[236040.042744]  mmp_thread+0x359/0x710 [zfs]
[236040.042830]  ? mmp_write_uberblock+0x730/0x730 [zfs]
[236040.042916]  ? __thread_exit+0x20/0x20 [spl]
[236040.042924]  thread_generic_wrapper+0x6f/0x80 [spl]
[236040.042932]  kthread+0x116/0x130
[236040.042935]  ? kthread_park+0x80/0x80
[236040.042937]  ret_from_fork+0x1f/0x30
[236040.992229] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236041.001443] iSCSI Login negotiation failed.
[236060.959020] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236060.968222] iSCSI Login negotiation failed.
[236080.925810] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236080.934751] iSCSI Login negotiation failed.
[236100.892602] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236100.901363] iSCSI Login negotiation failed.
[236116.251634] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236116.260173] iSCSI Login negotiation failed.
[236131.610666] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236131.619013] tx_data returned -104, expecting 48.
[236131.627224] iSCSI Login negotiation failed.
[236146.969699] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236146.977792] tx_data returned -104, expecting 48.
[236146.985737] iSCSI Login negotiation failed.
[236162.328822] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236162.336662] tx_data returned -104, expecting 48.
[236162.344344] iSCSI Login negotiation failed.
[236177.687838] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236177.695372] tx_data returned -104, expecting 48.
[236177.702763] iSCSI Login negotiation failed.
[236193.047045] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236193.054310] tx_data returned -104, expecting 48.
[236193.061423] iSCSI Login negotiation failed.
[236208.406071] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236208.413246] tx_data returned -104, expecting 48.
[236208.420413] iSCSI Login negotiation failed.
[236223.765103] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236223.772198] tx_data returned -104, expecting 48.
[236223.779150] iSCSI Login negotiation failed.
[236239.124135] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236239.130965] tx_data returned -104, expecting 48.
[236239.137637] iSCSI Login negotiation failed.
[236254.483210] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236254.489764] tx_data returned -104, expecting 48.
[236254.496174] iSCSI Login negotiation failed.
[236269.842450] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236269.848799] tx_data returned -104, expecting 48.
[236269.855099] iSCSI Login negotiation failed.
[236285.201475] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236285.207839] tx_data returned -104, expecting 48.
[236285.214170] iSCSI Login negotiation failed.
[236300.560515] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236300.566908] tx_data returned -104, expecting 48.
[236300.573254] iSCSI Login negotiation failed.
[236315.919540] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236315.925948] tx_data returned -104, expecting 48.
[236315.932296] iSCSI Login negotiation failed.
[236331.278580] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236331.284997] tx_data returned -104, expecting 48.
[236331.291346] iSCSI Login negotiation failed.
[236346.637847] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236346.644258] tx_data returned -104, expecting 48.
[236346.650618] iSCSI Login negotiation failed.
[236361.996879] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236362.003287] tx_data returned -104, expecting 48.
[236362.009647] iSCSI Login negotiation failed.
[236377.355912] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236377.362353] tx_data returned -104, expecting 48.
[236377.368715] iSCSI Login negotiation failed.
[236392.714951] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236392.721373] tx_data returned -104, expecting 48.
[236392.727718] iSCSI Login negotiation failed.
[236408.073976] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236408.080398] tx_data returned -104, expecting 48.
[236408.086752] iSCSI Login negotiation failed.
[236423.433008] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236423.439415] tx_data returned -104, expecting 48.
[236423.445760] iSCSI Login negotiation failed.
[236438.792041] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236438.798457] tx_data returned -104, expecting 48.
[236438.804809] iSCSI Login negotiation failed.
[236454.151316] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236454.157740] tx_data returned -104, expecting 48.
[236454.164110] iSCSI Login negotiation failed.
[236469.510349] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236469.516785] tx_data returned -104, expecting 48.
[236469.523154] iSCSI Login negotiation failed.
[236484.869381] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236484.875799] tx_data returned -104, expecting 48.
[236484.882170] iSCSI Login negotiation failed.
[236500.228413] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236500.234847] tx_data returned -104, expecting 48.
[236500.241223] iSCSI Login negotiation failed.
[236515.587446] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236515.593891] tx_data returned -104, expecting 48.
[236515.600264] iSCSI Login negotiation failed.
[236530.946478] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236530.952909] tx_data returned -104, expecting 48.
[236530.959285] iSCSI Login negotiation failed.
[236546.305650] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236546.312099] tx_data returned -104, expecting 48.
[236546.318477] iSCSI Login negotiation failed.
[236561.664784] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236561.671214] tx_data returned -104, expecting 48.
[236561.677579] iSCSI Login negotiation failed.
[236577.023705] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236577.030127] tx_data returned -104, expecting 48.
[236577.036493] iSCSI Login negotiation failed.
[236592.382849] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236592.389277] tx_data returned -104, expecting 48.
[236592.395634] iSCSI Login negotiation failed.
[236607.741889] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236607.748314] tx_data returned -104, expecting 48.
[236607.754680] iSCSI Login negotiation failed.
[236623.100915] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236623.107353] tx_data returned -104, expecting 48.
[236623.113730] iSCSI Login negotiation failed.
[236638.459987] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236638.466403] tx_data returned -104, expecting 48.
[236638.472783] iSCSI Login negotiation failed.
[236653.819230] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236653.825668] tx_data returned -104, expecting 48.
[236653.832039] iSCSI Login negotiation failed.
[236669.178255] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236669.184695] tx_data returned -104, expecting 48.
[236669.191064] iSCSI Login negotiation failed.
[236684.537285] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236684.543725] tx_data returned -104, expecting 48.
[236684.550093] iSCSI Login negotiation failed.
[236699.896317] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236699.902736] tx_data returned -104, expecting 48.
[236699.909100] iSCSI Login negotiation failed.
[236715.255349] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236715.261769] tx_data returned -104, expecting 48.
[236715.268128] iSCSI Login negotiation failed.
[236730.614623] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236730.621041] tx_data returned -104, expecting 48.
[236730.627404] iSCSI Login negotiation failed.
[236745.973654] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[236745.980066] tx_data returned -104, expecting 48.
[236745.986421] iSCSI Login negotiation failed.

@slavonnet slavonnet added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Apr 11, 2021
@slavonnet
Copy link
Author

try set this ssd to offline and remove one (after reboot)

this is perf top

Samples: 173K of event 'cycles', 4000 Hz, Event count (approx.): 35526978272 lost: 0/0 drop: 0/0
Overhead  Shared Object             Symbol
  22,30%  [kernel]                  [k] osq_lock
  11,83%  [kernel]                  [k] vdev_get_child_stat_ex.constprop.0
   6,14%  [kernel]                  [k] memcpy
   5,94%  [kernel]                  [k] mutex_lock
   5,83%  [kernel]                  [k] _raw_spin_lock
   3,99%  [kernel]                  [k] __mutex_lock.isra.11
   3,76%  [kernel]                  [k] mutex_spin_on_owner
   3,21%  [kernel]                  [k] mutex_unlock
   1,75%  [kernel]                  [k] native_write_msr
   1,68%  [kernel]                  [k] __cv_broadcast
   1,56%  [kernel]                  [k] spa_config_enter
   1,51%  [kernel]                  [k] memset
   1,31%  [kernel]                  [k] cv_wait_common
   1,17%  [kernel]                  [k] osq_unlock
   1,13%  [kernel]                  [k] _raw_spin_lock_irqsave
   1,04%  [kernel]                  [k] spa_config_exit
   1,01%  [kernel]                  [k] vdev_get_stats_ex_impl
   0,85%  [kernel]                  [k] __schedule
   0,61%  [kernel]                  [k] __bitmap_weight
   0,60%  [kernel]                  [k] menu_select
   0,49%  [kernel]                  [k] _raw_spin_lock_irq
   0,45%  [kernel]                  [k] update_rq_clock
   0,43%  [kernel]                  [k] do_idle
   0,42%  [kernel]                  [k] txg_dispatch_callbacks
   0,40%  [kernel]                  [k] __switch_to
   0,36%  [kernel]                  [k] read_tsc
   0,35%  [kernel]                  [k] __wake_up_common
   0,32%  [kernel]                  [k] recalc_sigpending
   0,32%  [kernel]                  [k] native_sched_clock
   0,31%  [kernel]                  [k] cpuacct_charge
   0,29%  [kernel]                  [k] txg_quiesce
   0,28%  [kernel]                  [k] dequeue_entity
   0,28%  [kernel]                  [k] _cond_resched
   0,27%  [kernel]                  [k] ttwu_queue_wakelist
   0,27%  [kernel]                  [k] resched_curr
   0,27%  [kernel]                  [k] rcu_all_qs
   0,27%  [kernel]                  [k] finish_task_switch
   0,25%  [kernel]                  [k] _raw_spin_unlock_irqrestore
   0,25%  [kernel]                  [k] update_load_avg
   0,24%  [kernel]                  [k] prepare_to_wait_exclusive
   0,23%  [kernel]                  [k] available_idle_cpu

@slavonnet
Copy link
Author

[root@vm2 ~]# zpool status
  pool: p
 state: SUSPENDED
status: The pool is suspended because multihost writes failed or were delayed;
        another system could import the pool undetected.
action: Make sure the pool's devices are connected, then reboot your system and
        import the pool.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-MM
config:

        NAME         STATE     READ WRITE CKSUM
        p            ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            sdl      ONLINE       0     0     0
            sdm      ONLINE       0     0     0
            sdo      ONLINE       0     0     0
            sdp      ONLINE       0     0     0
            sdq      ONLINE       0     0     0
            sds      ONLINE       0     0     0
            sdt      ONLINE       0     0     0
        special
          mirror-2   ONLINE       0     0     0
            sdb      ONLINE       0     0     0
            sdc      ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
          mirror-4   ONLINE       0     0     0
            sdh      ONLINE       0     0     0
            sdi      ONLINE       0     0     0
          mirror-5   ONLINE       0     0     0
            sdj      ONLINE       0     0     0
            sdk      ONLINE       0     0     0
        logs
          mirror-1   ONLINE       0     0     0
            nvme0n1  ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0
        cache
          sdg        OFFLINE      0     0     0
          sdn        ONLINE       0     0     0
          sdr        ONLINE       0     0     0

@slavonnet
Copy link
Author

[root@vm2 ~]# zpool get all\
>
NAME  PROPERTY                       VALUE                          SOURCE
p     size                           16.4T                          -
p     capacity                       69%                            -
p     altroot                        -                              default
p     health                         SUSPENDED                      -
p     guid                           14578718927571756337           -
p     version                        -                              default
p     bootfs                         -                              default
p     delegation                     on                             default
p     autoreplace                    on                             local
p     cachefile                      -                              default
p     failmode                       wait                           default
p     listsnapshots                  off                            default
p     autoexpand                     off                            default
p     dedupratio                     1.24x                          -
p     free                           5.01T                          -
p     allocated                      11.4T                          -
p     readonly                       off                            -
p     ashift                         13                             local
p     comment                        -                              default
p     expandsize                     -                              -
p     freeing                        0                              -
p     fragmentation                  37%                            -
p     leaked                         0                              -
p     multihost                      on                             local
p     checkpoint                     -                              -
p     load_guid                      8467617560320844006            -
p     autotrim                       on                             local
p     compatibility                  off                            default
p     feature@async_destroy          enabled                        local
p     feature@empty_bpobj            active                         local
p     feature@lz4_compress           active                         local
p     feature@multi_vdev_crash_dump  enabled                        local
p     feature@spacemap_histogram     active                         local
p     feature@enabled_txg            active                         local
p     feature@hole_birth             active                         local
p     feature@extensible_dataset     active                         local
p     feature@embedded_data          active                         local
p     feature@bookmarks              enabled                        local
p     feature@filesystem_limits      enabled                        local
p     feature@large_blocks           enabled                        local
p     feature@large_dnode            enabled                        local
p     feature@sha512                 enabled                        local
p     feature@skein                  enabled                        local
p     feature@edonr                  active                         local
p     feature@userobj_accounting     active                         local
p     feature@encryption             enabled                        local
p     feature@project_quota          active                         local
p     feature@device_removal         enabled                        local
p     feature@obsolete_counts        enabled                        local
p     feature@zpool_checkpoint       enabled                        local
p     feature@spacemap_v2            active                         local
p     feature@allocation_classes     active                         local
p     feature@resilver_defer         enabled                        local
p     feature@bookmark_v2            enabled                        local
p     feature@redaction_bookmarks    enabled                        local
p     feature@redacted_datasets      enabled                        local
p     feature@bookmark_written       enabled                        local
p     feature@log_spacemap           active                         local
p     feature@livelist               enabled                        local
p     feature@device_rebuild         enabled                        local
p     feature@zstd_compress          active                         local
p     feature@draid                  enabled                        local

@slavonnet
Copy link
Author

again system halt and hard reboot

try

[root@vm2 ~]# zpool set multihost=off p
[root@vm2 ~]# zpool set autotrim=off p
[root@vm2 ~]# zpool set autoreplace=off p
[root@vm2 ~]# zpool remove p sdg

ssd was removed without any bugs

@AmkG
Copy link

AmkG commented Apr 14, 2021

This seems a common issue when removing cache vdevs:

There may be even older issues buried even deeper in the issue list...

@h1z1
Copy link

h1z1 commented May 2, 2021

failmode wait default

Is that not doing what you expect?

@devZer0
Copy link

devZer0 commented Oct 26, 2021

@slavonnet , does the fix work for you ? can you close this issue? thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants