Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERIFY3(sa.sa_magic == SA_MAGIC) failed - ZFS 2.1.0/2.1.3, CentOS 8.5 #13144

Closed
doma2203 opened this issue Feb 23, 2022 · 14 comments
Closed

VERIFY3(sa.sa_magic == SA_MAGIC) failed - ZFS 2.1.0/2.1.3, CentOS 8.5 #13144

doma2203 opened this issue Feb 23, 2022 · 14 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@doma2203
Copy link

doma2203 commented Feb 23, 2022

Type Version/Name
Distribution Name CentOS
Distribution Version 8.5.2111
Kernel Version 4.18.0-348.2.1.el8_5
Architecture x86_64
OpenZFS Version zfs-2.1.0/ zfs-2.1.2-27_g1009e609 (both compilled)
Additional information This problem occurs interchangeable with #13143.

Describe the problem you're observing

We have an additional issues during the heavy metadata-intensive I/O on the same pool as in #13143 (single pool, single dataset, LUN from an all-flash NVMe disk array) :

VERIFY3(sa.sa_magic == SA_MAGIC) failed (8 == 3100762)
PANIC at zfs_quota.c:89:zpl_get_file_info()

which causes the ZFS threads to hang. Consequently, the ZFS pool do not perform any read/write operation and the system load increases. This pool acts as the Lustre MDT, so the mdt threads are also hang and the whole cluster becomes unresponsive. The ZFS pool works after the reboot of the node. The more reboot operation, the earlier stage of the read/write problems and at the end even the simplest tar causes the whole filesystem to crash.

Up to this point we tried to:

  • perform the zpool scrub on the pool - scrub do not detect any errors
  • mount the pool natively using zfs mount - the pool seems to have no problem with accessing to ZFS data and the ZFS quotas seems to be set correctly
  • updating ZFS from 2.1.0 to 2.1.3 (from staging branch) - no effect
  • change default properties dnodesize and xattr from default (legacy, on) to the dnodesize=auto i xattr=sa - no effect
  • create new zfs pool with proper propeties (with aboves) and rewrite existing pool using zfs send|zfs receive - it had a good effect, but for the first 10 minutes of the heavy load on the pool. After this time, the problem started to reappear.
  • disable all additional parameters applied to the ZFS module - no effect

Describe how to reproduce the problem

The problem is originally triggered by the ADF (quantum chemisty HPC code) on the Lustre MDT with ZFS. It can be also recreated by the tar on the kernel sources executed in a loop.

Include any warning/errors/backtraces from the system logs

Feb 16 21:51:53 ascratch-mds01 kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8 == 3100762)
Feb 16 21:51:53 ascratch-mds01 kernel: PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 16 21:51:53 ascratch-mds01 kernel: Showing stack for process 30151
Feb 16 21:51:53 ascratch-mds01 kernel: CPU: 8 PID: 30151 Comm: mdt00_096 Tainted: P          IOE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
Feb 16 21:51:53 ascratch-mds01 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Feb 16 21:51:53 ascratch-mds01 kernel: Call Trace:
Feb 16 21:51:53 ascratch-mds01 kernel: dump_stack+0x5c/0x80
Feb 16 21:51:53 ascratch-mds01 kernel: spl_panic+0xd3/0xfb [spl]
Feb 16 21:51:53 ascratch-mds01 kernel: ? sg_init_table+0x11/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? __sg_alloc_table+0x6e/0x170
Feb 16 21:51:54 ascratch-mds01 kernel: ? sg_alloc_table+0x1f/0x50
Feb 16 21:51:54 ascratch-mds01 kernel: ? sg_init_one+0x80/0x80
Feb 16 21:51:54 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? mutex_lock+0xe/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? spl_kmem_cache_alloc+0x5d/0x160 [spl]
Feb 16 21:51:54 ascratch-mds01 kernel: ? dbuf_rele_and_unlock+0x13d/0x6a0 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: ? kmem_cache_alloc+0x12e/0x270
Feb 16 21:51:54 ascratch-mds01 kernel: ? __cv_init+0x3d/0x60 [spl]
Feb 16 21:51:54 ascratch-mds01 kernel: zpl_get_file_info+0x1ea/0x230 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dmu_objset_userquota_get_ids+0x1f8/0x480 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dnode_setdirty+0x2f/0xe0 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dnode_allocate+0x11d/0x180 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dmu_object_alloc_impl+0x32c/0x3c0 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dmu_object_alloc_dnsize+0x1c/0x30 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: __osd_object_create+0x78/0x120 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: osd_mkreg+0x98/0x250 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: ? __osd_xattr_declare_set+0x190/0x260 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: osd_create+0x2c6/0xc90 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: ? __kmalloc_node+0x10e/0x2f0
Feb 16 21:51:54 ascratch-mds01 kernel: lod_sub_create+0x244/0x4a0 [lod]
Feb 16 21:51:54 ascratch-mds01 kernel: lod_create+0x4b/0x330 [lod]
@doma2203 doma2203 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 23, 2022
@rincebrain
Copy link
Contributor

Can you share zfs get all on one of the datasets being used?

#11433 and #12659 seem potentially relevant, though neither of them involve Lustre in the callstack, so perhaps not.

@doma2203
Copy link
Author

We saw this issues, but we do not have encryption enabled on this dataset. Here is the output of zfs get all:

AME           PROPERTY              VALUE                                  SOURCE
mdt0-bkp/mdt0  type                  filesystem                             -
mdt0-bkp/mdt0  creation              Mon Feb 21  0:47 2022                  -
mdt0-bkp/mdt0  used                  323G                                   -
mdt0-bkp/mdt0  available             13.5T                                  -
mdt0-bkp/mdt0  referenced            321G                                   -
mdt0-bkp/mdt0  compressratio         1.00x                                  -
mdt0-bkp/mdt0  mounted               no                                     -
mdt0-bkp/mdt0  quota                 none                                   default
mdt0-bkp/mdt0  reservation           none                                   default
mdt0-bkp/mdt0  recordsize            1M                                     local
mdt0-bkp/mdt0  mountpoint            /mdt0-bkp/mdt0                         default
mdt0-bkp/mdt0  sharenfs              off                                    default
mdt0-bkp/mdt0  checksum              on                                     default
mdt0-bkp/mdt0  compression           off                                    default
mdt0-bkp/mdt0  atime                 on                                     default
mdt0-bkp/mdt0  devices               on                                     default
mdt0-bkp/mdt0  exec                  on                                     default
mdt0-bkp/mdt0  setuid                on                                     default
mdt0-bkp/mdt0  readonly              off                                    default
mdt0-bkp/mdt0  zoned                 off                                    default
mdt0-bkp/mdt0  snapdir               hidden                                 default
mdt0-bkp/mdt0  aclmode               discard                                default
mdt0-bkp/mdt0  aclinherit            restricted                             default
mdt0-bkp/mdt0  createtxg             130                                    -
mdt0-bkp/mdt0  canmount              off                                    local
mdt0-bkp/mdt0  xattr                 sa                                     local
mdt0-bkp/mdt0  copies                1                                      default
mdt0-bkp/mdt0  version               5                                      -
mdt0-bkp/mdt0  utf8only              off                                    -
mdt0-bkp/mdt0  normalization         none                                   -
mdt0-bkp/mdt0  casesensitivity       sensitive                              -
mdt0-bkp/mdt0  vscan                 off                                    default
mdt0-bkp/mdt0  nbmand                off                                    default
mdt0-bkp/mdt0  sharesmb              off                                    default
mdt0-bkp/mdt0  refquota              none                                   default
mdt0-bkp/mdt0  refreservation        none                                   default
mdt0-bkp/mdt0  guid                  5116653551129842537                    -
mdt0-bkp/mdt0  primarycache          all                                    default
mdt0-bkp/mdt0  secondarycache        all                                    default
mdt0-bkp/mdt0  usedbysnapshots       1.87G                                  -
mdt0-bkp/mdt0  usedbydataset         321G                                   -
mdt0-bkp/mdt0  usedbychildren        0B                                     -
mdt0-bkp/mdt0  usedbyrefreservation  0B                                     -
mdt0-bkp/mdt0  logbias               latency                                default
mdt0-bkp/mdt0  objsetid              515                                    -
mdt0-bkp/mdt0  dedup                 off                                    default
mdt0-bkp/mdt0  mlslabel              none                                   default
mdt0-bkp/mdt0  sync                  standard                               default
mdt0-bkp/mdt0  dnodesize             auto                                   local
mdt0-bkp/mdt0  refcompressratio      1.00x                                  -
mdt0-bkp/mdt0  written               23.8G                                  -
mdt0-bkp/mdt0  logicalused           312G                                   -
mdt0-bkp/mdt0  logicalreferenced     311G                                   -
mdt0-bkp/mdt0  volmode               default                                default
mdt0-bkp/mdt0  filesystem_limit      none                                   default
mdt0-bkp/mdt0  snapshot_limit        none                                   default
mdt0-bkp/mdt0  filesystem_count      none                                   default
mdt0-bkp/mdt0  snapshot_count        none                                   default
mdt0-bkp/mdt0  snapdev               hidden                                 default
mdt0-bkp/mdt0  acltype               off                                    default
mdt0-bkp/mdt0  context               none                                   default
mdt0-bkp/mdt0  fscontext             none                                   default
mdt0-bkp/mdt0  defcontext            none                                   default
mdt0-bkp/mdt0  rootcontext           none                                   default
mdt0-bkp/mdt0  relatime              off                                    default
mdt0-bkp/mdt0  redundant_metadata    all                                    default
mdt0-bkp/mdt0  overlay               on                                     default
mdt0-bkp/mdt0  encryption            off                                    default
mdt0-bkp/mdt0  keylocation           none                                   default
mdt0-bkp/mdt0  keyformat             none                                   default
mdt0-bkp/mdt0  pbkdf2iters           0                                      default
mdt0-bkp/mdt0  special_small_blocks  0                                      default
mdt0-bkp/mdt0  lustre:version        1                                      received
mdt0-bkp/mdt0  lustre:failover.node  <lustre_nids>  received
mdt0-bkp/mdt0  lustre:fsname         ascratch                               received
mdt0-bkp/mdt0  lustre:mgsnode        <lustre_nids>  received
mdt0-bkp/mdt0  lustre:svname         ascratch-MDT0000                       received
mdt0-bkp/mdt0  lustre:flags          4353                                   received
mdt0-bkp/mdt0  lustre:index          0                                      received

@rincebrain
Copy link
Contributor

The other thing I'd suggest is to report this issue also to wherever Lustre tracks bugs, and reference this bug - since they're reaching out and directly calling things in OpenZFS code, there's a nonzero chance that this particular case can't be hit without Lustre, at which point the question becomes whether Lustre is violating some expectation of using those calls or the implementation is making some additional assumption which is true without Lustre's usage but false there, and fix the behavior wherever it makes the most sense.

You could try setting ZFS_DEBUG_MODIFY (0x10, or 16) in zfs_flags, which should make it check the ARC buffers for modification in between changes any time something "legitimately" goes to change them, and see if it notices something breaking earlier. (You could also set 0x2/0x4, ZFS_DEBUG_DBUF_VERIFY / ZFS_DEBUG_DNODE_VERIFY, but those require you compile with --enable-debug to do something useful.)

@doma2203
Copy link
Author

doma2203 commented Feb 23, 2022

@rincebrain. thank you for the suggestion that it could be a Lustre bug, we've already opened the issue in Lustre: https://jira.whamcloud.com/browse/LU-15586 and we'll try to debug the ARC buffers as well.

@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Mar 1, 2022

If Lustre doesn't use the ZPL, is it weird that we're going off into zpl_get_file_info? @behlendorf

@behlendorf
Copy link
Contributor

@PaulZ-98 since Lustre writes its data out in a format which is compatible with the ZPL on-disk format it does leverage a few of the zpl_* callbacks. This is one of those cases where we can use the same ZPL function. Looking through the stacks of linked issues this strikes me as probably a OpenZFS issue, though I wouldn't rule out that Lustre might play a part in it.

What is strange here is it looks like we're either getting a damaged data buffer passed, or potentially one which wasn't re-initialized. As @rincebrain mentioned, setting ZFS_DEBUG_MODIFY | ZFS_DEBUG_DBUF_VERIFY | ZFS_DEBUG_DNODE_VERIFY in a debug build might be a good way to help narrow it down.

It can be also recreated by the tar on the kernel sources executed in a loop.

Can you expand on this a bit. Are you able to reproduce this issue with a new pool?

@doma2203
Copy link
Author

doma2203 commented Apr 4, 2022

@behlendorf we finally run the stack, but we had some difficulties with using ZFS debug with lustre. We cannot load osd_zfs due to the undetectable zfs_refcount_add symbol

kernel: [74690.082105]  osd_zfs: Unknown symbol zfs_refcount_add (err 0)

We made an update of the stack hoping this would fix the problem with the module: we bumped ZFS to the official 2.1.3 release, kernel to 4.18.0-348.7.1.el8_5 and lustre from the latest master (2.15.0_RC2_38_g8e8bbc0). Finally, we solved it with a workaround that allows the symbol to be forcibly exported:

diff --git a/module/zfs/refcount.c b/module/zfs/refcount.c
index 354e021d9..92d641b68 100644
--- a/module/zfs/refcount.c
+++ b/module/zfs/refcount.c
@@ -38,6 +38,14 @@ int reference_history = 3; /* tunable */
 static kmem_cache_t *reference_cache;
 static kmem_cache_t *reference_history_cache;
 
+#if defined(_KERNEL)
+EXPORT_SYMBOL(zfs_refcount_add);
+EXPORT_SYMBOL(zfs_refcount_create);
+#endif
+
+
+
+
 void
 zfs_refcount_init(void)
 {

We also set ZFS_DEBUG_MODIFY | ZFS_DEBUG_DBUF_VERIFY | ZFS_DEBUG_DNODE_VERIFY (0x16) in zfs_flags. We mounted the resources using lustre-zfs mount and we hit the following panic:

[87796.641026] Kernel panic - not syncing: dirtying dbuf obj=c80092 lvl=0 blkid=10 but not tx_held

[87796.677146] CPU: 18 PID: 16207 Comm: mount.lustre Kdump: loaded Tainted: P          IOE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
[87796.715272] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
[87796.735224] Call Trace:
87796.750373]  dump_stack+0x5c/0x80
[87796.766219]  panic+0xe7/0x2a9
[87796.781619]  dmu_tx_dirty_buf+0x117/0x3f0 [zfs]
[87796.798682]  ? rrw_enter_read_impl+0x125/0x220 [zfs]
[87796.815679]  dbuf_dirty+0x5e/0x1530 [zfs]
[87796.831998]  ? dbuf_read+0x139/0x680 [zfs]
[87796.847813]  dmu_write_impl+0x44/0x150 [zfs]
[87796.863641]  dmu_write_by_dnode+0x8e/0xe0 [zfs]
[87796.879597]  osd_write+0x118/0x3a0 [osd_zfs]
[87796.895392]  dt_record_write+0x32/0x110 [obdclass]
[87796.911190]  llog_osd_write_rec+0xd06/0x1ae0 [obdclass]
[87796.927699]  llog_write_rec+0x3f6/0x530 [obdclass]
[87796.943235]  llog_write+0x4df/0x550 [obdclass]
[87796.958243]  llog_process_thread+0xb8e/0x1aa0 [obdclass]
[87796.974051]  ? llog_process_or_fork+0x5e/0x560 [obdclass]
[87796.990079]  ? kmem_cache_alloc_trace+0x131/0x270
[87797.004891]  ? llog_write+0x550/0x550 [obdclass]
[87797.019560]  llog_process_or_fork+0x1c1/0x560 [obdclass]
[87797.034729]  llog_backup+0x354/0x520 [obdclass]
[87797.048909]  mgc_llog_local_copy+0x110/0x420 [mgc]
[87797.063504]  mgc_process_cfg_log+0x971/0xd80 [mgc]
[87797.077634]  mgc_process_log+0x6c3/0x800 [mgc]
[87797.091414]  ? config_log_add+0x3f5/0xa00 [mgc]
[87797.104982]  mgc_process_config+0xb53/0xe60 [mgc]
[87797.118529]  lustre_process_log+0x5fa/0xad0 [obdclass]
[87797.132327]  ? server_register_mount+0x4d1/0x740 [obdclass]
[87797.146529]  server_start_targets+0x1504/0x3010 [obdclass]
[87797.160454]  ? strlcpy+0x2d/0x40
[87797.171898]  ? class_config_dump_handler+0x730/0x730 [obdclass]
[87797.186054]  ? mgc_set_info_async+0x539/0xad0 [mgc]
[87797.198949]  ? mgc_set_info_async+0x539/0xad0 [mgc]
[87797.211583]  ? lustre_start_mgc+0xf7c/0x27c0 [obdclass]
[87797.224758]  server_fill_super+0x8ea/0x10d0 [obdclass]
[87797.237408]  lustre_fill_super+0x3a1/0x3f0 [lustre]
[87797.249568]  ? ll_inode_destroy_callback+0x120/0x120 [lustre]
[87797.262647]  mount_nodev+0x48/0xa0
[87797.273054]  legacy_get_tree+0x27/0x40
[87797.283602]  vfs_get_tree+0x25/0xb0
[87797.294051]  do_mount+0x2e2/0x950
[87797.303806]  ksys_mount+0xb6/0xd0
[87797.313335]  __x64_sys_mount+0x21/0x30
[87797.323140]  do_syscall_64+0x5b/0x1a0
[87797.332660]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[87797.343521] RIP: 0033:0x7fb3c21f892e

The panic is the same (except for the object id) whether it's the osd or mdt mount. We also tried to mount the pool natively, using zfs mount and it is working.

@behlendorf
Copy link
Contributor

Unfortunately, the stack trace here seems to indicate Lustre isn't quite calling the ZFS interfaces correctly. In particular, this call path appears to be missing a hold. That's something we'll need to look in to but it's unlikely to be related to the issue you hit.

We cannot load osd_zfs due to the undetectable zfs_refcount_add symbol

Sorry about the build issue. This was fixed in the master branch but never backported to the 2.1.x releases. I've gone ahead and opened a new PR to make sure it's in the next point release.

@doma2203
Copy link
Author

doma2203 commented Apr 5, 2022

@behlendorf, thank you for the informations, we have two questions:

  1. Do you suspect that the problem with the latest panic caused by Lustre-ZFS interface is specific only to 2.1.x version and won't show up in the 2.0.x?
  2. If not, do you think that you'll address this interface problem in the future or it is rather a long-term plan?

We are planning the next procedures and we are not sure what option will be the the best one:

  • We can revert to the previous version without debug and collect the other informations
  • We can downgrade ZFS to the 2.0.7 with debug and zfs_flags on the MDS and try to wait for the next SA_MAGIC issue.
  • ...or we can totally drop the zfs 2.1.x and draid@lustre idea and try again in the future?

@behlendorf
Copy link
Contributor

Do you suspect that the problem with the latest panic caused by Lustre-ZFS interface is specific only to 2.1.x version and won't show up in the 2.0.x?

I'd expect you to see this with the 2.0.x releases as well. The issue here is Lustre is running afoul of some long standing correctness checks ZFS is performing only when debugging is enabled. The upstream Lustre+ZFS testing must (reasonably) be using a production instead of debug build of ZFS in their CI testing. This would explain why the panic and debug build issue went unnoticed.

We are planning the next procedures and we are not sure what option will be the the best one:

Another option to consider would be to disable this specific debugging check is ZFS. As I mentioned it's always disabled in production builds so there's no harm in skipping it, and it would allow you to run with debugging. You can disable the problematic check with a relatively small change like this. My expectation would be that this should allow you to use a debug build of ZFS as originally intended with Lustre.

diff --git a/include/sys/dmu_tx.h b/include/sys/dmu_tx.h
index 71a9ac7ca..a5e781aea 100644
--- a/include/sys/dmu_tx.h
+++ b/include/sys/dmu_tx.h
@@ -162,7 +162,7 @@ void dmu_tx_add_new_object(dmu_tx_t *tx, dnode_t *dn);
 void dmu_tx_dirty_buf(dmu_tx_t *tx, struct dmu_buf_impl *db);
 void dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space);
 
-#ifdef ZFS_DEBUG
+#ifdef ZFS_DEBUG_DX_TX_DIRTY_BUF
 #define        DMU_TX_DIRTY_BUF(tx, db)        dmu_tx_dirty_buf(tx, db)
 #else
 #define        DMU_TX_DIRTY_BUF(tx, db)
diff --git a/module/zfs/dmu_tx.c b/module/zfs/dmu_tx.c
index fe9860066..3e73f3623 100644
--- a/module/zfs/dmu_tx.c
+++ b/module/zfs/dmu_tx.c
@@ -570,7 +570,7 @@ dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space)
        }
 }
 
-#ifdef ZFS_DEBUG
+#ifdef ZFS_DEBUG_DX_TX_DIRTY_BUF
 void
 dmu_tx_dirty_buf(dmu_tx_t *tx, dmu_buf_impl_t *db)
 {

If you're then still able to reproduce there issue that would be very helpful. Just for reference, we're running a stack very close this RHEL 8.5, Lustre 2.14, and zfs-2.1 with dRAID and have not observed this issue. It seems reasonable that is may somehow be related to your metadata intensive workload, but it's hard to say exactly how without more information or a consistent reproducer.

@doma2203
Copy link
Author

doma2203 commented Apr 13, 2022

@behlendorf we build and run ZFS with Lustre successfully (thanks a lot!) and after running ADF we have a brand new panic:

Apr 13 23:32:48 ascratch-mds01 kernel: [127008.190617] VERIFY(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0) failed
Apr 13 23:32:48 ascratch-mds01 kernel: [127008.211314] PANIC at dnode.c:599:dnode_allocate()
Apr 13 23:32:48 ascratch-mds01 kernel: [127008.228216] Showing stack for process 1453529
Apr 13 23:32:48 ascratch-mds01 kernel: VERIFY(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0) failed
Apr 13 23:32:48 ascratch-mds01 kernel: PANIC at dnode.c:599:dnode_allocate()
Apr 13 23:32:48 ascratch-mds01 kernel: Showing stack for process 1453529
Apr 13 23:32:48 ascratch-mds01 kernel: CPU: 16 PID: 1453529 Comm: mdt01_047 Kdump: loaded Tainted: P          IOE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.244697] CPU: 16 PID: 1453529 Comm: mdt01_047 Kdump: loaded Tainted: P          IOE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.281051] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.300228] Call Trace:
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.314774]  dump_stack+0x5c/0x80
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.329961]  spl_panic+0xd3/0xfb [spl]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.345379]  ? __switch_to_asm+0x41/0x70
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.361200]  ? dnode_cons+0x2a1/0x2b0 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.376866]  ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.391936]  ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.406626]  ? mutex_lock+0xe/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.421145]  ? aggsum_add+0x1bb/0x250 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.435910]  ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.450000]  ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.463780]  ? down_read+0xe/0xa0
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.476884]  dnode_allocate+0x795/0x9a0 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.490880]  dmu_object_alloc_impl+0x3ee/0x560 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.505211]  ? dbuf_verify.part.21+0x485/0x9a0 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.519781]  dmu_object_alloc_dnsize+0x1c/0x30 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.533974]  __osd_object_create+0x78/0x120 [osd_zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.548050]  osd_mkreg+0x98/0x250 [osd_zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.561072]  osd_create+0x2c6/0xcf0 [osd_zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.574194]  lod_sub_create+0x24a/0x4e0 [lod]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.587309]  lod_create+0x4d/0x3a0 [lod]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.599845]  mdd_create_object_internal+0xa9/0x310 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.613576]  mdd_create_object+0x9a/0x920 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.626369]  mdd_create+0xe2e/0x1ae0 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.638559]  ? mdd_links_rename+0x440/0x440 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.651371]  mdt_reint_open+0x28c2/0x3160 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.663942]  ? check_unlink_entry+0x14/0xc0 [obdclass]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.676907]  ? mdt_root_squash+0x1e/0x3f0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.689147]  mdt_reint_rec+0x117/0x270 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.700986]  mdt_reint_internal+0x4bc/0x7d0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.713224]  mdt_intent_open+0x137/0x420 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.725365]  ? mdt_intent_fixup_resent+0x1f0/0x1f0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.737955]  mdt_intent_opc+0x12c/0xbf0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.749419]  mdt_intent_policy+0x207/0x3a0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.761070]  ldlm_lock_enqueue+0x4e4/0xa80 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.772933]  ldlm_handle_enqueue0+0x634/0x1530 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.785093]  tgt_enqueue+0xa4/0x210 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.796047]  tgt_request_handle+0xc93/0x1a40 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.807620]  ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.819754]  ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.831748]  ptlrpc_main+0xc06/0x1560 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.842134]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.853173]  kthread+0x116/0x130
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.862081]  ? kthread_flush_work_fn+0x10/0x10
Apr 13 23:32:49 ascratch-mds01 kernel: [127008.872231]  ret_from_fork+0x1f/0x40
Apr 13 23:32:49 ascratch-mds01 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Apr 13 23:32:49 ascratch-mds01 kernel: Call Trace:
Apr 13 23:32:49 ascratch-mds01 kernel: dump_stack+0x5c/0x80
Apr 13 23:32:49 ascratch-mds01 kernel: spl_panic+0xd3/0xfb [spl]
Apr 13 23:32:49 ascratch-mds01 kernel: ? __switch_to_asm+0x41/0x70
Apr 13 23:32:49 ascratch-mds01 kernel: ? dnode_cons+0x2a1/0x2b0 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: ? mutex_lock+0xe/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: ? aggsum_add+0x1bb/0x250 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Apr 13 23:32:49 ascratch-mds01 kernel: ? down_read+0xe/0xa0
Apr 13 23:32:49 ascratch-mds01 kernel: dnode_allocate+0x795/0x9a0 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: dmu_object_alloc_impl+0x3ee/0x560 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: ? dbuf_verify.part.21+0x485/0x9a0 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: dmu_object_alloc_dnsize+0x1c/0x30 [zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: __osd_object_create+0x78/0x120 [osd_zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: osd_mkreg+0x98/0x250 [osd_zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: osd_create+0x2c6/0xcf0 [osd_zfs]
Apr 13 23:32:49 ascratch-mds01 kernel: lod_sub_create+0x24a/0x4e0 [lod]
Apr 13 23:32:49 ascratch-mds01 kernel: lod_create+0x4d/0x3a0 [lod]
Apr 13 23:32:49 ascratch-mds01 kernel: mdd_create_object_internal+0xa9/0x310 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: mdd_create_object+0x9a/0x920 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: mdd_create+0xe2e/0x1ae0 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: ? mdd_links_rename+0x440/0x440 [mdd]
Apr 13 23:32:49 ascratch-mds01 kernel: mdt_reint_open+0x28c2/0x3160 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: ? check_unlink_entry+0x14/0xc0 [obdclass]
Apr 13 23:32:49 ascratch-mds01 kernel: ? mdt_root_squash+0x1e/0x3f0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: mdt_reint_rec+0x117/0x270 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: mdt_reint_internal+0x4bc/0x7d0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: mdt_intent_open+0x137/0x420 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: ? mdt_intent_fixup_resent+0x1f0/0x1f0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: mdt_intent_opc+0x12c/0xbf0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: mdt_intent_policy+0x207/0x3a0 [mdt]
Apr 13 23:32:49 ascratch-mds01 kernel: ldlm_lock_enqueue+0x4e4/0xa80 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: ldlm_handle_enqueue0+0x634/0x1530 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: tgt_enqueue+0xa4/0x210 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: tgt_request_handle+0xc93/0x1a40 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: ptlrpc_main+0xc06/0x1560 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
Apr 13 23:32:49 ascratch-mds01 kernel: kthread+0x116/0x130
Apr 13 23:32:49 ascratch-mds01 kernel: ? kthread_flush_work_fn+0x10/0x10
Apr 13 23:32:49 ascratch-mds01 kernel: ret_from_fork+0x1f/0x40

From the client perspective we observed that ADF internally writes a bunch of data (between mmap and munmap) and calls ftruncate to extend the file by 64 KB in the loop:

00:18:52.967776 mmap(NULL, 148766720, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x1541924d2000 <0.000023>
00:18:52.967846 munmap(0x1541924d2000, 148766720) = 0 <0.000011>
00:18:52.967868 ftruncate(54, 148832256) = 0 <0.000241>
00:18:52.968120 mmap(NULL, 148832256, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x1541924c2000 <0.000024>
00:18:52.968189 munmap(0x1541924c2000, 148832256) = 0 <0.000013>
00:18:52.968213 ftruncate(54, 148897792) = 0 <0.000266>
00:18:52.968492 mmap(NULL, 148897792, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x1541924b2000 <0.000021>
00:18:52.968564 munmap(0x1541924b2000, 148897792) = 0 <0.000010>
00:18:52.968590 ftruncate(54, 148963328) = 0 <0.000338>
00:18:52.968960 mmap(NULL, 148963328, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x1541924a2000 <0.000023>
00:18:52.969040 munmap(0x1541924a2000, 148963328) = 0 <0.000011>
00:18:52.969062 ftruncate(54, 149028864) = 0 <0.000423>
00:18:52.969498 mmap(NULL, 149028864, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x154192492000 <0.000023>
00:18:52.969580 munmap(0x154192492000, 149028864) = 0 <0.000012>
00:18:52.969602 ftruncate(54, 149094400) = 0 <0.000218>
00:18:52.969830 mmap(NULL, 149094400, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x154192482000 <0.000023>
00:18:52.969902 munmap(0x154192482000, 149094400) = 0 <0.000010>
00:18:52.968590 ftruncate(54, 148963328) = 0 <0.000338>
00:18:52.968960 mmap(NULL, 148963328, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x1541924a2000 <0.000023>
00:18:52.969040 munmap(0x1541924a2000, 148963328) = 0 <0.000011>
00:18:52.969062 ftruncate(54, 149028864) = 0 <0.000423>
00:18:52.969498 mmap(NULL, 149028864, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x154192492000 <0.000023>
00:18:52.969580 munmap(0x154192492000, 149028864) = 0 <0.000012>
00:18:52.969602 ftruncate(54, 149094400) = 0 <0.000218>
00:18:52.969830 mmap(NULL, 149094400, PROT_READ|PROT_WRITE, MAP_SHARED, 54, 0) = 0x154192482000 <0.000023>
00:18:52.969902 munmap(0x154192482000, 149094400) = 0 <0.000010>
00:18:52.969936 ftruncate(54, 149159936) = 0 <0.000315>

If you still suspect metadata issues, we can also add our collectd lustre metadata plugin and graphs if it helps you diagnose the problem.

@doma2203
Copy link
Author

doma2203 commented Apr 25, 2022

@behlendorf, do you need more information or tests from our side?

@doma2203 doma2203 closed this as completed May 9, 2022
@doma2203
Copy link
Author

doma2203 commented May 9, 2022

We decided to close both issues and reformat the filesystem to the supported stack.

@behlendorf
Copy link
Contributor

@doma2203 thanks for the update and the debug information above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants