Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS with reference_tracking_enable makes the magic smoke come out #12589

Closed
rincebrain opened this issue Sep 26, 2021 · 5 comments
Closed

ZTS with reference_tracking_enable makes the magic smoke come out #12589

rincebrain opened this issue Sep 26, 2021 · 5 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rincebrain
Copy link
Contributor

rincebrain commented Sep 26, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version 11
Kernel Version 5.10.0-8-amd64
Architecture x86_64
OpenZFS Version ce2bdce

Describe the problem you're observing

I recently had occasion to want to try reference_tracking_enable on sparc64, so I flipped it on, and the magic smoke came out.

Thinking it might have been one of my local changes (my reason for flipping it on was to print state when an error condition happened, and I had added code to do that, not that it got reached yet...), I tested on vanilla git master and an x86_64 VM, and yup, still boom.

(Sorry the dump isn't from x86_64, kdump is still not the most reliable thing on the planet...)

Describe how to reproduce the problem

# echo 1 > /sys/module/zfs/parameters/reference_tracking_enable
# scripts/zfs-tests.sh -T zfs_receive

Include any warning/errors/backtraces from the system logs

[ 1407.047437] Kernel panic - not syncing: No such hold 0000000000000000 on refcount fffff8001169fec8
[ 1407.165346] CPU: 0 PID: 14272 Comm: diff Tainted: P           OE     5.10.0-8-sparc64 #1 Debian 5.10.46-4
[ 1407.291245] Call Trace:
[ 1407.323406] [<0000000000be2644>] panic+0xec/0x340
[ 1407.386284] [<0000000010889a30>] zfs_refcount_remove_many+0x2d0/0x2e0 [zfs]
[ 1407.478379] [<0000000010889a54>] zfs_refcount_remove+0x14/0x40 [zfs]
[ 1407.562388] [<00000000108269d0>] dmu_zfetch_stream_done+0x10/0x40 [zfs]
[ 1407.649844] [<00000000107fe550>] dbuf_prefetch_impl+0x90/0x600 [zfs]
[ 1407.733862] [<0000000010827588>] dmu_zfetch_run+0x188/0x380 [zfs]
[ 1407.814454] [<0000000010806a14>] dmu_buf_hold_array_by_dnode+0x134/0x6a0 [zfs]
[ 1407.909921] [<0000000010807c0c>] dmu_read_uio_dnode+0x2c/0x180 [zfs]
[ 1407.993937] [<0000000010807d8c>] dmu_read_uio_dbuf+0x2c/0x60 [zfs]
[ 1408.075659] [<00000000109251f0>] zfs_read+0x130/0x320 [zfs]
[ 1408.149382] [<000000001097263c>] zpl_iter_read+0xbc/0x200 [zfs]
[ 1408.227261] [<00000000006632a8>] new_sync_read+0xe8/0x1a0
[ 1408.298275] [<0000000000665358>] vfs_read+0xd8/0x180
[ 1408.363577] [<00000000006656ac>] ksys_read+0x4c/0xe0
[ 1408.428882] [<0000000000665754>] sys_read+0x14/0x40
[ 1408.493053] [<0000000000406174>] linux_sparc_syscall+0x34/0x44
[ 1408.569792] Press Stop-A (L1-A) from sun keyboard or send break
@rincebrain rincebrain added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 26, 2021
@rincebrain
Copy link
Contributor Author

Ah yes, that VM was running a dirty build too. I wish I'd finished that PR to tag such builds visibly.

Oh well. One round of building on a vanilla clone later, it'll repro on ce2bdce too, I shouldn't wonder...

@rincebrain
Copy link
Contributor Author

Yup repros fine on ce2bdce, gets to:

Test: /home/rich/zfs_vanilla_really/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_006_pos (run as root) [00:01] [PASS]
Test: /home/rich/zfs_vanilla_really/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_007_neg (run as root) [00:00] [PASS]
Test: /home/rich/zfs_vanilla_really/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_008_pos (run as root) [00:03] [PASS]
Test: /home/rich/zfs_vanilla_really/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_009_neg (run as root) [00:01] [PASS]
client_loop: send disconnect: Broken pipe

@behlendorf
Copy link
Contributor

That's troubling. After resolving this issue it would be nice to enable reference counting by default in the CI to catch this kind of thing early (assuming the overhead isn't cost prohibitive).

@rincebrain
Copy link
Contributor Author

I have a simple fix for the panic (whoever wrote the dmu_zfetch code to use refcount seems to have assumed what I would have, that "refcount_add_many" adds N references, not one reference with magic number N which you cannot decrement with refcount_remove...if reference_tracking is enabled), but then one set of tests still won't run.

Notably, pool_checkpoint's setup hits the 45m timeout and dies outright, presumably because the overhead of pool nesting updating all the references sucks.

So I guess you could make a runfile that just excluded pool_checkpoint and run that with reference_tracking_enabled=1 after I submit that fix. I'll even go try cutting a PR for a Github Action to do it...

@behlendorf
Copy link
Contributor

That's great, and what I would have assumed as well... I'm looking forward to seeing the fix.

Alternately, it looks like it would be safe to disable it in setup.sh for the pool_checkpoint tests and then re-enable it afterwards in the cleanup.

rincebrain added a commit to rincebrain/zfs that referenced this issue Dec 21, 2021
refcount_add_many(foo,N) is not the same as
for (i=0; i < N; i++) { refcount_add(foo); }

Unfortunately, this is only actually true with debug kernels and
reference_tracking_enable=1.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#12589 
Closes openzfs#12602
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants