Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a subtle bug in dnode_next_offset() with txg > 0 #11200

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

avg-I
Copy link
Contributor

@avg-I avg-I commented Nov 13, 2020

Only stop the search when the upward search cannot find a matching
block pointer in the requested direction from the current offset.
That definitely means that the search is exhausted.

But if teh downward search cannot find a suitable block at the
requested level, then cycle back to the upward search for farther
offsets.

This should fix #11196

Signed-off-by: Andriy Gapon agapon@panzura.com

@avg-I
Copy link
Contributor Author

avg-I commented Nov 16, 2020

I reproduced the test failure (zdb crash).
It seems to be of another special case where dnode_next_offset_level does not change offset while returning ESRCH. The special case is "recent", so I didn't have it on my original test system.
Thinking how to handle it best.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 17, 2020
@avg-I
Copy link
Contributor Author

avg-I commented Dec 1, 2020

@ahrens , @behlendorf , I think that the latest test failures for this PR actually uncover a pre-existing bug in dnode_next_offset related to large dnodes.

The failures are an assertion that txg != 0 after a level=1 search finds an L0 block that must have a hole but the subsequent search in that L0 block does not find any. I think that previously we would just return ESRCH to a caller, but my change considers that situation "impossible". Not sure if I was right about that or not.

I analyzed the problematic scenario and I see that it happens, for instance, in this call chain:
dmu_object_alloc_dnsize -> dmu_object_alloc_impl -> dmu_object_alloc_impl -> dmu_object_next -> dnode_next_offset.
dmu_object_next calls dnode_next_offset with minlvl=0, blkfill=32, txg=0 and flags=1 (find a hole).
dnode_next_offset first calls dmu_object_next_level with lvl=0 to examine dnodes in the current L0 block of the meta-dnode. dmu_object_next_level does not find a hole and returns ESRCH. Then, it is called again, now with lvl=1 and it looks for an L0 block where fill <= 31. It finds such a block, for instance, with fill of 5. So, then dmu_object_next_level is called again with lvl=0 and an appropriate starting offset to find an unallocated dnode in that L0 block.
But it does not find any!
That happens because, while the block has only 5 dnodes, some of them take several slots and together they fill all 32 slots in the block.
I am not sure if the fill of L0 meta-dnode blocks should reflect the number of allocated dnodes or the number of allocated slots.
In my tests I see the latter.
So far I have reproduced the problem only once locally and the meta-dnode was encrypted. Not sure if that played a role.

@avg-I
Copy link
Contributor Author

avg-I commented Dec 1, 2020

Further looking through the code, I see two things.

First, yes, the fill count is the number of dnodes. That can be seen in dbuf_write_ready where fill is incremented once per dnode.

Second, it looks like the call chain I mentioned is the only call chain where both minimum level is zero and the search is for a hole. I see that dmu_object_alloc_impl has reasonable handling for dmu_object_next error.
Not sure if searches for holes at higher levels can result in any incorrect or sub-optimal behavior.

I think that it could have been reasonable to change the meaning of dnode fill from a number of dnodes to a number of slots. But I haven't even tried checking where else that fill is used. I am sure that there is something that depends on the current meaning.

So, for time being, I am just going to relax the assertion and to implement backward compatible behavior.
That is, just return ESRCH without attempting to advance the search in whatever search direction is requested.
(Seems like it's always forward for the hole search)

@avg-I
Copy link
Contributor Author

avg-I commented Dec 3, 2020

The latest panic-s on all Linux flavors (except for Ubuntu 20.04 x86_64 which looks unrelated) look like a potential problem in dnode_sync_free_range_impl().
That function is called without dn_mtx.
In the if trunc block at the end the function sets dn_maxblkid and then does some validation without acquiring either dn_mtx or dn_struct_rwlock. I am not sure what the rules for modifying dn_maxblkid are, but that code looks potentially racy.

An example of the stack trace:

[13491.586253] Call Trace:
[13491.590968]  [<ffffffffa2f81400>] dump_stack+0x19/0x1b
[13491.597048]  [<ffffffffc063f72b>] spl_dumpstack+0x2b/0x30 [spl]
[13491.603540]  [<ffffffffc063f7f9>] spl_panic+0xc9/0x110 [spl]
[13491.609977]  [<ffffffffc07f4b66>] ? dbuf_rele_and_unlock+0x246/0x610 [zfs]
[13491.616972]  [<ffffffffc07eed99>] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[13491.623757]  [<ffffffffc07eeda9>] ? __raw_spin_unlock+0x9/0x10 [zfs]
[13491.630549]  [<ffffffffc07eedb9>] ? spin_unlock+0x9/0x10 [zfs]
[13491.637040]  [<ffffffffc07f8f20>] ? dbuf_read+0x4c0/0x680 [zfs]
[13491.643481]  [<ffffffffc07f52ed>] ? dbuf_rele+0x4d/0x80 [zfs]
[13491.649789]  [<ffffffffc082a1b5>] ? dnode_next_offset_level+0x4b5/0x800 [zfs]
[13491.656829]  [<ffffffffc082fa28>] dnode_next_offset+0x3b8/0x3c0 [zfs]
[13491.663618]  [<ffffffffc083187d>] dnode_sync_free_range_impl+0x25d/0x3c0 [zfs]
[13491.674324]  [<ffffffffc0831a4a>] dnode_sync_free_range+0x6a/0xd0 [zfs]
[13491.681230]  [<ffffffffc08319e0>] ? dnode_sync_free_range_impl+0x3c0/0x3c0 [zfs]
[13491.691909]  [<ffffffffc088a005>] range_tree_walk+0x75/0xb0 [zfs]
[13491.698541]  [<ffffffffc08330aa>] dnode_sync+0x3fa/0xe60 [zfs]
[13491.705076]  [<ffffffffc080da81>] dmu_objset_sync_dnodes+0x51/0x240 [zfs]
[13491.712038]  [<ffffffffc080dc99>] sync_dnodes_task+0x29/0x50 [zfs]
[13491.718712]  [<ffffffffc064769e>] taskq_thread+0x2ee/0x520 [spl]
[13491.725193]  [<ffffffffa28daf40>] ? wake_up_state+0x20/0x20
[13491.731567]  [<ffffffffc06473b0>] ? taskq_thread_spawn+0x60/0x60 [spl]
[13491.738273]  [<ffffffffa28c5c21>] kthread+0xd1/0xe0
[13491.744523]  [<ffffffffa28c5b50>] ? insert_kthread_work+0x40/0x40
[13491.750958]  [<ffffffffa2f94df7>] ret_from_fork_nospec_begin+0x21/0x21
[13491.757684]  [<ffffffffa28c5b50>] ? insert_kthread_work+0x40/0x40

@avg-I
Copy link
Contributor Author

avg-I commented Dec 4, 2020

The FreeBSD head ztest crash also looks like a race.
I have been able to reproduce it a few times locally and in all cases I see that the crash happens at an offset just after L0 hole.
In all cases that L0 hole has a birth time and the birth time is equal to the currently syncing txg.
I suspect a scenario where dmu_object_next_level(lvl=1) finds a non-empty L0 block.
Then, after that call returns and before dmu_object_next_level(lvl=0) is called the last dnode in that block gets freed.
So, dmu_object_next_level(lvl=0) fails and the assert is hit.

Stack:

#0  thr_kill () at thr_kill.S:4
#1  0x00000008008ec784 in __raise (s=s@entry=6) at /usr/src/lib/libc/gen/raise.c:52
#2  0x00000008009a0199 in abort () at /usr/src/lib/libc/stdlib/abort.c:67
#3  0x000000080078d078 in libspl_assertf (file=<optimized out>, func=<optimized out>, line=<optimized out>, line@entry=2622, format=<optimized out>) at assert.c:45
#4  0x00000008003fa911 in libspl_assert (buf=0x0, file=0x297b5 <error: Cannot access memory at address 0x297b5>, func=0x6 <error: Cannot access memory at address 0x6>, line=0, line@entry=2622) at ../../lib/libspl/include/assert.h:46
#5  0x00000008004005e2 in dnode_next_offset (dn=0x801ae4000, flags=0, offset=offset@entry=0x7fffd918ae20, minlvl=minlvl@entry=0, blkfill=blkfill@entry=32, txg=txg@entry=0) at ../../module/zfs/dnode.c:2622
#6  0x00000008003e1d8f in dmu_object_next (os=os@entry=0x806faf400, objectp=objectp@entry=0x7fffd918af00, hole=hole@entry=B_FALSE, txg=txg@entry=0) at ../../module/zfs/dmu_object.c:441
#7  0x0000000000214dc8 in ztest_verify_dnode_bt (zd=0x801023eb8, id=<optimized out>) at ztest.c:5510
#8  0x000000000021dc78 in ztest_execute (test=test@entry=35, zi=zi@entry=0x221950 <ztest_info+1120>, id=id@entry=1) at ztest.c:6705
#9  0x000000000021d768 in ztest_thread (arg=<optimized out>) at ztest.c:6752
#10 0x000000080081c74b in thread_start (curthread=0x805361500) at /usr/src/lib/libthr/thread/thr_create.c:292

@avg-I
Copy link
Contributor Author

avg-I commented Jun 24, 2021

@behlendorf , @ahrens , from your experience, how would you interpret the test results?

The checkstyle failure is because I didn't add a sign-off to the latest commit in this PR.

Two FreeBSD failures are both for the same test, cli_root/zfs_receive/receive-o-x_props_override.
I don't think that the change should have impacted that test.

The CentOS Stream 8 failure is suspicious, though.

[10287.079604] VERIFY3(spa->spa_checkpoint_info.sci_dspace == ckpoint_sm_space_sum) failed (212029440 == 379577856)
[10287.082610] PANIC at spa_checkpoint.c:276:spa_checkpoint_accounting_verify()
[10287.084528] Showing stack for process 1889638
[10287.085772] CPU: 0 PID: 1889638 Comm: txg_sync Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-310.el8.x86_64 #1
[10287.088490] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[10287.089999] Call Trace:
[10287.090600]  dump_stack+0x5c/0x80
[10287.091413]  spl_panic+0xd3/0xfb [spl]
[10287.092323]  ? arch_local_irq_save+0xe/0x20 [spl]
[10287.093456]  ? spl_kmem_cache_free+0x5c/0x1b0 [spl]
[10287.094706]  ? space_map_incremental_destroy+0x425/0x530 [zfs]
[10287.096143]  ? spa_checkpoint_discard_thread_sync+0x250/0x250 [zfs]
[10287.097682]  spa_checkpoint_discard_thread_sync+0x8b/0x250 [zfs]
[10287.099151]  dsl_sync_task_sync+0xaf/0x120 [zfs]
[10287.100307]  dsl_pool_sync+0x4d6/0x690 [zfs]
[10287.101383]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[10287.102790]  spa_sync+0x2d4/0x820 [zfs]
[10287.103760]  txg_sync_thread+0x2bf/0x400 [zfs]

http://build.zfsonlinux.org/builders/CentOS%20Stream%208%20x86_64%20%28TEST%29/builds/2192/steps/shell_4/logs/console

@behlendorf
Copy link
Contributor

@avg-I these are all known unrelated failures. The FreeBSD receive-o-x_props_override test failures appear to have been caused by something which changed in the FreeBSD environment in the last few months. They've haven't yet been seriously investigated to my knowledge, but we did add a ZFS exception for them in the master branch until they're resolved. The CentOS 8 Stream panic is also something we've seen before and is unrelated.

There's a good chance if you add your signed-off-by and rebase this on the latest master you'll be able to get a clean CI run.

Only stop the search when the upward search cannot find a matching
block pointer in the requested direction from the current offset.
That definitely means that the search is exhausted.

But if the downward search cannot find a suitable block at the
requested level, then cycle back to the upward search for farther
offsets.

This also accounts for a special case of a search for an unallocated
dnode that's been introduced with the large dnode support.  That search
is a level-0 search in a meta-dnode.  It can fail without trying a
subsequent L0 block if the current L0 block is full and at least one
dnode occupies more than one slot.  So, from the fill count perspective
the block appears to have some free space, but it actually does not have
any.

Also, do not ascend needlessly beyond the maximum level that covers all
possible offsets.  Going above that level is bound to be fruitless.
This change reworks a change to dnode_next_offset_level made in
commit 031d7c2.  After this change the span can never be too
large for a shift.

The described situation can happen with a (meta-)dnode like this:

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K   575K     512   608K    3.78  DMU dnode

That is, dn_datablkshift = 14, dn_indblkshift = 17, dn_nlevels = 6
and 64-bit offsets.  Level 5 already covers all possible disk offsets.

In addition, explicit handle too far offsets when working at the top
level.  For lower level we try hold a dbuf at the correct level and
offset.  If the offset is too far then the hold will fail and we will
handle that accordingly.  But at the top level we never checked that the
requested offset is within the range that can be covered by the dnode's
block pointers.

Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
@avg-I avg-I force-pushed the 11196-dnode_next_offset branch from 3a6eca8 to 7cb437a Compare June 24, 2021 20:41
@avg-I
Copy link
Contributor Author

avg-I commented Jun 25, 2021

@behlendorf , thank you for the advice!
Just as you said.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very subtle indeed. Are then any tests case we should consider adding to make sure we're exercising these corner cases. Unfortunately, it seems like it would be fairly difficult to setup some of these scenarios.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 21, 2022
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avg-I your analysis of this bug looks right to me as does the proposed fix. But given how subtle this whole search function is it'd be nice if we could really verify it's right but adding some kind of correctness test for dnode_next_offset(). It seems like it's be fairly straight forward to test the txg == 0 case with something like a SEEK_HOLE test case (which is something). But you're right, exercising the txg != 0 case is a lot trickier.

if ((flags & DNODE_FIND_BACKWARDS) != 0) {
int epbs, span;

epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: epbs is already declared and set correctly at the top of the function.

int epbs, span;

epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
span = (lvl - 1) * epbs + dn->dn_datablkshift;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
span = (lvl - 1) * epbs + dn->dn_datablkshift;
int span = (lvl - 1) * epbs + dn->dn_datablkshift;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

subtle bug in dnode_next_offset() with txg > 0
3 participants