fix a subtle bug in dnode_next_offset() with txg > 0 #11200

avg-I · 2020-11-13T13:01:41Z

Only stop the search when the upward search cannot find a matching
block pointer in the requested direction from the current offset.
That definitely means that the search is exhausted.

But if teh downward search cannot find a suitable block at the
requested level, then cycle back to the upward search for farther
offsets.

This should fix #11196

Signed-off-by: Andriy Gapon agapon@panzura.com

avg-I · 2020-11-16T15:15:09Z

I reproduced the test failure (zdb crash).
It seems to be of another special case where dnode_next_offset_level does not change offset while returning ESRCH. The special case is "recent", so I didn't have it on my original test system.
Thinking how to handle it best.

avg-I · 2020-12-01T14:14:43Z

@ahrens , @behlendorf , I think that the latest test failures for this PR actually uncover a pre-existing bug in dnode_next_offset related to large dnodes.

The failures are an assertion that txg != 0 after a level=1 search finds an L0 block that must have a hole but the subsequent search in that L0 block does not find any. I think that previously we would just return ESRCH to a caller, but my change considers that situation "impossible". Not sure if I was right about that or not.

I analyzed the problematic scenario and I see that it happens, for instance, in this call chain:
dmu_object_alloc_dnsize -> dmu_object_alloc_impl -> dmu_object_alloc_impl -> dmu_object_next -> dnode_next_offset.
dmu_object_next calls dnode_next_offset with minlvl=0, blkfill=32, txg=0 and flags=1 (find a hole).
dnode_next_offset first calls dmu_object_next_level with lvl=0 to examine dnodes in the current L0 block of the meta-dnode. dmu_object_next_level does not find a hole and returns ESRCH. Then, it is called again, now with lvl=1 and it looks for an L0 block where fill <= 31. It finds such a block, for instance, with fill of 5. So, then dmu_object_next_level is called again with lvl=0 and an appropriate starting offset to find an unallocated dnode in that L0 block.
But it does not find any!
That happens because, while the block has only 5 dnodes, some of them take several slots and together they fill all 32 slots in the block.
I am not sure if the fill of L0 meta-dnode blocks should reflect the number of allocated dnodes or the number of allocated slots.
In my tests I see the latter.
So far I have reproduced the problem only once locally and the meta-dnode was encrypted. Not sure if that played a role.

avg-I · 2020-12-01T14:30:06Z

Further looking through the code, I see two things.

First, yes, the fill count is the number of dnodes. That can be seen in dbuf_write_ready where fill is incremented once per dnode.

Second, it looks like the call chain I mentioned is the only call chain where both minimum level is zero and the search is for a hole. I see that dmu_object_alloc_impl has reasonable handling for dmu_object_next error.
Not sure if searches for holes at higher levels can result in any incorrect or sub-optimal behavior.

I think that it could have been reasonable to change the meaning of dnode fill from a number of dnodes to a number of slots. But I haven't even tried checking where else that fill is used. I am sure that there is something that depends on the current meaning.

So, for time being, I am just going to relax the assertion and to implement backward compatible behavior.
That is, just return ESRCH without attempting to advance the search in whatever search direction is requested.
(Seems like it's always forward for the hole search)

avg-I · 2020-12-03T09:43:21Z

The latest panic-s on all Linux flavors (except for Ubuntu 20.04 x86_64 which looks unrelated) look like a potential problem in dnode_sync_free_range_impl().
That function is called without dn_mtx.
In the if trunc block at the end the function sets dn_maxblkid and then does some validation without acquiring either dn_mtx or dn_struct_rwlock. I am not sure what the rules for modifying dn_maxblkid are, but that code looks potentially racy.

An example of the stack trace:

[13491.586253] Call Trace:
[13491.590968]  [<ffffffffa2f81400>] dump_stack+0x19/0x1b
[13491.597048]  [<ffffffffc063f72b>] spl_dumpstack+0x2b/0x30 [spl]
[13491.603540]  [<ffffffffc063f7f9>] spl_panic+0xc9/0x110 [spl]
[13491.609977]  [<ffffffffc07f4b66>] ? dbuf_rele_and_unlock+0x246/0x610 [zfs]
[13491.616972]  [<ffffffffc07eed99>] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[13491.623757]  [<ffffffffc07eeda9>] ? __raw_spin_unlock+0x9/0x10 [zfs]
[13491.630549]  [<ffffffffc07eedb9>] ? spin_unlock+0x9/0x10 [zfs]
[13491.637040]  [<ffffffffc07f8f20>] ? dbuf_read+0x4c0/0x680 [zfs]
[13491.643481]  [<ffffffffc07f52ed>] ? dbuf_rele+0x4d/0x80 [zfs]
[13491.649789]  [<ffffffffc082a1b5>] ? dnode_next_offset_level+0x4b5/0x800 [zfs]
[13491.656829]  [<ffffffffc082fa28>] dnode_next_offset+0x3b8/0x3c0 [zfs]
[13491.663618]  [<ffffffffc083187d>] dnode_sync_free_range_impl+0x25d/0x3c0 [zfs]
[13491.674324]  [<ffffffffc0831a4a>] dnode_sync_free_range+0x6a/0xd0 [zfs]
[13491.681230]  [<ffffffffc08319e0>] ? dnode_sync_free_range_impl+0x3c0/0x3c0 [zfs]
[13491.691909]  [<ffffffffc088a005>] range_tree_walk+0x75/0xb0 [zfs]
[13491.698541]  [<ffffffffc08330aa>] dnode_sync+0x3fa/0xe60 [zfs]
[13491.705076]  [<ffffffffc080da81>] dmu_objset_sync_dnodes+0x51/0x240 [zfs]
[13491.712038]  [<ffffffffc080dc99>] sync_dnodes_task+0x29/0x50 [zfs]
[13491.718712]  [<ffffffffc064769e>] taskq_thread+0x2ee/0x520 [spl]
[13491.725193]  [<ffffffffa28daf40>] ? wake_up_state+0x20/0x20
[13491.731567]  [<ffffffffc06473b0>] ? taskq_thread_spawn+0x60/0x60 [spl]
[13491.738273]  [<ffffffffa28c5c21>] kthread+0xd1/0xe0
[13491.744523]  [<ffffffffa28c5b50>] ? insert_kthread_work+0x40/0x40
[13491.750958]  [<ffffffffa2f94df7>] ret_from_fork_nospec_begin+0x21/0x21
[13491.757684]  [<ffffffffa28c5b50>] ? insert_kthread_work+0x40/0x40

avg-I · 2020-12-04T14:43:01Z

The FreeBSD head ztest crash also looks like a race.
I have been able to reproduce it a few times locally and in all cases I see that the crash happens at an offset just after L0 hole.
In all cases that L0 hole has a birth time and the birth time is equal to the currently syncing txg.
I suspect a scenario where dmu_object_next_level(lvl=1) finds a non-empty L0 block.
Then, after that call returns and before dmu_object_next_level(lvl=0) is called the last dnode in that block gets freed.
So, dmu_object_next_level(lvl=0) fails and the assert is hit.

Stack:

#0  thr_kill () at thr_kill.S:4
#1  0x00000008008ec784 in __raise (s=s@entry=6) at /usr/src/lib/libc/gen/raise.c:52
#2  0x00000008009a0199 in abort () at /usr/src/lib/libc/stdlib/abort.c:67
#3  0x000000080078d078 in libspl_assertf (file=<optimized out>, func=<optimized out>, line=<optimized out>, line@entry=2622, format=<optimized out>) at assert.c:45
#4  0x00000008003fa911 in libspl_assert (buf=0x0, file=0x297b5 <error: Cannot access memory at address 0x297b5>, func=0x6 <error: Cannot access memory at address 0x6>, line=0, line@entry=2622) at ../../lib/libspl/include/assert.h:46
#5  0x00000008004005e2 in dnode_next_offset (dn=0x801ae4000, flags=0, offset=offset@entry=0x7fffd918ae20, minlvl=minlvl@entry=0, blkfill=blkfill@entry=32, txg=txg@entry=0) at ../../module/zfs/dnode.c:2622
#6  0x00000008003e1d8f in dmu_object_next (os=os@entry=0x806faf400, objectp=objectp@entry=0x7fffd918af00, hole=hole@entry=B_FALSE, txg=txg@entry=0) at ../../module/zfs/dmu_object.c:441
#7  0x0000000000214dc8 in ztest_verify_dnode_bt (zd=0x801023eb8, id=<optimized out>) at ztest.c:5510
#8  0x000000000021dc78 in ztest_execute (test=test@entry=35, zi=zi@entry=0x221950 <ztest_info+1120>, id=id@entry=1) at ztest.c:6705
#9  0x000000000021d768 in ztest_thread (arg=<optimized out>) at ztest.c:6752
#10 0x000000080081c74b in thread_start (curthread=0x805361500) at /usr/src/lib/libthr/thread/thr_create.c:292

avg-I · 2021-06-24T05:32:27Z

@behlendorf , @ahrens , from your experience, how would you interpret the test results?

The checkstyle failure is because I didn't add a sign-off to the latest commit in this PR.

Two FreeBSD failures are both for the same test, cli_root/zfs_receive/receive-o-x_props_override.
I don't think that the change should have impacted that test.

The CentOS Stream 8 failure is suspicious, though.

[10287.079604] VERIFY3(spa->spa_checkpoint_info.sci_dspace == ckpoint_sm_space_sum) failed (212029440 == 379577856)
[10287.082610] PANIC at spa_checkpoint.c:276:spa_checkpoint_accounting_verify()
[10287.084528] Showing stack for process 1889638
[10287.085772] CPU: 0 PID: 1889638 Comm: txg_sync Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-310.el8.x86_64 #1
[10287.088490] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[10287.089999] Call Trace:
[10287.090600]  dump_stack+0x5c/0x80
[10287.091413]  spl_panic+0xd3/0xfb [spl]
[10287.092323]  ? arch_local_irq_save+0xe/0x20 [spl]
[10287.093456]  ? spl_kmem_cache_free+0x5c/0x1b0 [spl]
[10287.094706]  ? space_map_incremental_destroy+0x425/0x530 [zfs]
[10287.096143]  ? spa_checkpoint_discard_thread_sync+0x250/0x250 [zfs]
[10287.097682]  spa_checkpoint_discard_thread_sync+0x8b/0x250 [zfs]
[10287.099151]  dsl_sync_task_sync+0xaf/0x120 [zfs]
[10287.100307]  dsl_pool_sync+0x4d6/0x690 [zfs]
[10287.101383]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[10287.102790]  spa_sync+0x2d4/0x820 [zfs]
[10287.103760]  txg_sync_thread+0x2bf/0x400 [zfs]

http://build.zfsonlinux.org/builders/CentOS%20Stream%208%20x86_64%20%28TEST%29/builds/2192/steps/shell_4/logs/console

behlendorf · 2021-06-24T16:52:18Z

@avg-I these are all known unrelated failures. The FreeBSD receive-o-x_props_override test failures appear to have been caused by something which changed in the FreeBSD environment in the last few months. They've haven't yet been seriously investigated to my knowledge, but we did add a ZFS exception for them in the master branch until they're resolved. The CentOS 8 Stream panic is also something we've seen before and is unrelated.

There's a good chance if you add your signed-off-by and rebase this on the latest master you'll be able to get a clean CI run.

Only stop the search when the upward search cannot find a matching block pointer in the requested direction from the current offset. That definitely means that the search is exhausted. But if the downward search cannot find a suitable block at the requested level, then cycle back to the upward search for farther offsets. This also accounts for a special case of a search for an unallocated dnode that's been introduced with the large dnode support. That search is a level-0 search in a meta-dnode. It can fail without trying a subsequent L0 block if the current L0 block is full and at least one dnode occupies more than one slot. So, from the fill count perspective the block appears to have some free space, but it actually does not have any. Also, do not ascend needlessly beyond the maximum level that covers all possible offsets. Going above that level is bound to be fruitless. This change reworks a change to dnode_next_offset_level made in commit 031d7c2. After this change the span can never be too large for a shift. The described situation can happen with a (meta-)dnode like this: Object lvl iblk dblk dsize dnsize lsize %full type 0 6 128K 16K 575K 512 608K 3.78 DMU dnode That is, dn_datablkshift = 14, dn_indblkshift = 17, dn_nlevels = 6 and 64-bit offsets. Level 5 already covers all possible disk offsets. In addition, explicit handle too far offsets when working at the top level. For lower level we try hold a dbuf at the correct level and offset. If the offset is too far then the hold will fail and we will handle that accordingly. But at the top level we never checked that the requested offset is within the range that can be covered by the dnode's block pointers. Signed-off-by: Andriy Gapon <avg@FreeBSD.org>

avg-I · 2021-06-25T04:54:02Z

@behlendorf , thank you for the advice!
Just as you said.

behlendorf

Very subtle indeed. Are then any tests case we should consider adding to make sure we're exercising these corner cases. Unfortunately, it seems like it would be fairly difficult to setup some of these scenarios.

behlendorf

@avg-I your analysis of this bug looks right to me as does the proposed fix. But given how subtle this whole search function is it'd be nice if we could really verify it's right but adding some kind of correctness test for dnode_next_offset(). It seems like it's be fairly straight forward to test the txg == 0 case with something like a SEEK_HOLE test case (which is something). But you're right, exercising the txg != 0 case is a lot trickier.

behlendorf · 2022-01-21T20:28:12Z

module/zfs/dnode.c

+		if ((flags & DNODE_FIND_BACKWARDS) != 0) {
+			int epbs, span;
+
+			epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;


nit: epbs is already declared and set correctly at the top of the function.

behlendorf · 2022-01-21T20:28:44Z

module/zfs/dnode.c

+			int epbs, span;
+
+			epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
+			span = (lvl - 1) * epbs + dn->dn_datablkshift;


Suggested change

span = (lvl - 1) * epbs + dn->dn_datablkshift;

int span = (lvl - 1) * epbs + dn->dn_datablkshift;

behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 17, 2020

ahrens assigned mmaybee Jun 4, 2021

avg-I force-pushed the 11196-dnode_next_offset branch from 3a6eca8 to 7cb437a Compare June 24, 2021 20:41

behlendorf reviewed Jun 26, 2021

View reviewed changes

behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 21, 2022

behlendorf reviewed Jan 21, 2022

View reviewed changes

behlendorf mentioned this pull request Jan 23, 2023

subtle bug in dnode_next_offset() with txg > 0 #11196

Open

rrevans mentioned this pull request Mar 25, 2024

dnode_next_offset: backtrack if lower level does not match #16025

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix a subtle bug in dnode_next_offset() with txg > 0 #11200

fix a subtle bug in dnode_next_offset() with txg > 0 #11200

avg-I commented Nov 13, 2020

avg-I commented Nov 16, 2020

avg-I commented Dec 1, 2020

avg-I commented Dec 1, 2020

avg-I commented Dec 3, 2020

avg-I commented Dec 4, 2020

avg-I commented Jun 24, 2021

behlendorf commented Jun 24, 2021

avg-I commented Jun 25, 2021

behlendorf left a comment

behlendorf left a comment

behlendorf Jan 21, 2022

behlendorf Jan 21, 2022

	span = (lvl - 1) * epbs + dn->dn_datablkshift;
	int span = (lvl - 1) * epbs + dn->dn_datablkshift;

fix a subtle bug in dnode_next_offset() with txg > 0 #11200

Are you sure you want to change the base?

fix a subtle bug in dnode_next_offset() with txg > 0 #11200

Conversation

avg-I commented Nov 13, 2020

avg-I commented Nov 16, 2020

avg-I commented Dec 1, 2020

avg-I commented Dec 1, 2020

avg-I commented Dec 3, 2020

avg-I commented Dec 4, 2020

avg-I commented Jun 24, 2021

behlendorf commented Jun 24, 2021

avg-I commented Jun 25, 2021

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Jan 21, 2022

Choose a reason for hiding this comment

behlendorf Jan 21, 2022

Choose a reason for hiding this comment