Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in dnode_check_slots_free() #7388

Merged
merged 1 commit into from
Apr 10, 2018

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Apr 4, 2018

Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable. This patch adds a check for whether dn_dirty_link is
active to determine if we are in this state.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Signed-off-by: Tom Caputi tcaputi@datto.com

How Has This Been Tested?

We are currently running a compile workload which usually triggers the problem within an hour. We will confirm that the problem does not occur overnight.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@nivedita76
Copy link
Contributor

nivedita76 commented Apr 5, 2018

This patch seems a little broken, trying to fix up. Not sure how it worked for you? There is only one os_synced_dnodes multilist, not one per TXG. I get an oops with the patch.

for (int i = 0; i < TXG_SIZE; i++) {
boolean_t link_active;
multilist_sublist_t *mls;
multilist_t *synced_list = &dn->dn_objset->os_synced_dnodes[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dn->dn_objset->os_synced_dnodes isn't per-tgx_size and may be NULL.

@nivedita76
Copy link
Contributor

This change seems to fix it up, bit ugly with the list_d2l though.

diff --git a/usr/src/git/zfs/module/zfs/dnode.c b/dnode.c
index 962027231e8f..51adc49ee68c 100644
--- a/usr/src/git/zfs/module/zfs/dnode.c
+++ b/dnode.c
@@ -1085,6 +1085,7 @@ static boolean_t
 dnode_check_evictable(dnode_t *dn)
 {
        dmu_object_type_t type;
+       multilist_t *synced_list;
 
        mutex_enter(&dn->dn_mtx);
        type = dn->dn_type;
@@ -1099,13 +1100,9 @@ dnode_check_evictable(dnode_t *dn)
         * We check for this by determining if the dnode has any
         * active links into os_synced_dnodes.
         */
-       for (int i = 0; i < TXG_SIZE; i++) {
-               boolean_t link_active;
-               multilist_sublist_t *mls;
-               multilist_t *synced_list = &dn->dn_objset->os_synced_dnodes[i];
-
-               mls = multilist_sublist_lock_obj(synced_list, dn);
-               link_active = multilist_link_active(&dn->dn_dirty_link[i]);
+       if((synced_list = dn->dn_objset->os_synced_dnodes) != NULL) {
+               multilist_sublist_t *mls = multilist_sublist_lock_obj(synced_list, dn);
+               boolean_t link_active = list_link_active(list_d2l(&mls->mls_list, dn));
                multilist_sublist_unlock(mls);
 
                if (link_active)

@nivedita76
Copy link
Contributor

Nope not quite.

[ 817.941427] divide error: 0000 [#1] SMP PTI
[ 817.941501] Modules linked in: macvtap macvlan tap bridge devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables nvidia_uvm(OE) 8021q garp mrp stp llc dm_mod crct10dif_pclmul crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vboxpci(OE) nvidia_drm(OE) intel_cstate nvidia_modeset(OE) vboxnetadp(OE) vboxnetflt(OE) intel_uncore snd_usb_audio nvidia(OE) vboxdrv(OE) intel_rapl_perf snd_usbmidi_lib snd_rawmidi snd_seq_device snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore
[ 817.943248] CPU: 70 PID: 81117 Comm: recordmcount Tainted: G W OE 4.16.0-rani-zfs+ #9
[ 817.943307] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D8 WS/Z10PE-D8 WS, BIOS 3407 03/10/2017
[ 817.943366] RIP: 0010:dnode_multilist_index_func+0x9c/0xb0
[ 817.943443] RSP: 0018:ffffab6ff3d3b828 EFLAGS: 00010246
[ 817.943543] RAX: 995896532df34bdb RBX: ffff961f5e3d5000 RCX: 0000000000000000
[ 817.943586] RDX: 0000000000000000 RSI: c26ea0dddc4b959a RDI: ffff961f5e3d5000
[ 817.943629] RBP: ffff961f5b2df878 R08: 0000000000000000 R09: 0000000000000001
[ 817.943677] R10: 0000000000000000 R11: 0000000000000000 R12: ffff961f5b2df7d8
[ 817.943747] R13: 0000000000000000 R14: ffff961f6f903c68 R15: ffff961f5b2df638
[ 817.943812] FS: 00007f54f4ce1500(0000) GS:ffff9627fd200000(0000) knlGS:0000000000000000
[ 817.943857] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 817.943924] CR2: 0000562c26d19c20 CR3: 000000153113c004 CR4: 00000000003626e0
[ 817.943989] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 817.944067] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 817.944137] Call Trace:
[ 817.944198] multilist_sublist_lock_obj+0x12/0x20
[ 817.944260] dnode_check_slots_free+0xfb/0x1b0
[ 817.944335] dnode_hold_impl+0x5d5/0xf30
[ 817.944482] dmu_object_alloc_dnsize+0x241/0x520
[ 817.944655] zfs_mknode+0x130/0xf70
[ 817.944946] ? _raw_spin_unlock+0x24/0x30
[ 817.945022] ? dmu_tx_assign+0x230/0x560
[ 817.945178] zfs_create+0x6b9/0x9e0
[ 817.945389] zpl_create+0xaa/0x160
[ 817.945521] lookup_open+0x5da/0x780
[ 817.945868] path_openat+0x32c/0xd10
[ 817.945941] ? filemap_map_pages+0x1de/0x380
[ 817.946057] do_filp_open+0xb4/0x130
[ 817.946381] ? do_sys_open+0x1c7/0x250
[ 817.946422] do_sys_open+0x1c7/0x250
[ 817.946508] do_syscall_64+0x66/0x120
[ 817.946568] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

@nivedita76
Copy link
Contributor

Is it possible to do this by checking the dnodes refcount? There should be a hold on it until the userquota updates task is done with it.

@tcaputi
Copy link
Contributor Author

tcaputi commented Apr 5, 2018

Unfortunately it is not possible to do this with refcounts. In both cases that cause the crash and those that don't the recount is 1. I think I'm just missing a null check. The test crashed for me after a couple hours as well. I'll see about fixing it tonight.

@nivedita76
Copy link
Contributor

My suggested diff added the null check, but then I got a divide by zero after some time, which must mean I managed to race with multilist_destroy somehow?

@nivedita76
Copy link
Contributor

I've added a rw lock to protect os_synced_dnodes and used that in the place it gets destroyed as well as the evictable function. So far so good.

There should be an easier way of figuring out whether the dnode is involved in user quota update though, maybe just add another flag for it?

Also, would it not be better to split out the links for synced dnodes from dirty dnodes? The code is quite confusing and the only advantage is saving two pointers. There's also a couple other places that use dn_dirty_link active to check whether something is on the dirty list and its unclear whether they can run concurrently with user quota updates and if so if they expect to consider the synced list as dirty.

@tcaputi tcaputi force-pushed the dnode_destroy_fix branch 2 times, most recently from ad6d51f to da380d0 Compare April 6, 2018 20:07
@tcaputi
Copy link
Contributor Author

tcaputi commented Apr 6, 2018

@nivedita76 I think the latest push looks a lot like what you were suggesting. We have to add 64 bits to dnode_t to make it work, but Brian found some unnecessary bloat in there that we can remove in a later patch to make up for it.

@nivedita76
Copy link
Contributor

[ 60.446136] perf: interrupt took too long (17218 > 7155), lowering kernel.perf_event_max_sample_rate to 11600
[ 490.061685] general protection fault: 0000 [#1] SMP PTI
[ 490.061708] Modules linked in: macvtap macvlan tap bridge devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables nvidia_uvm(OE) 8021q garp mrp stp llc dm_mod snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vboxpci(OE) intel_cstate snd_usb_audio vboxnetadp(OE) vboxnetflt(OE) snd_usbmidi_lib intel_uncore snd_rawmidi vboxdrv(OE) snd_seq_device nvidia_drm(OE) nvidia_modeset(OE) intel_rapl_perf snd_hda_codec_realtek nvidia(OE) snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore
[ 490.061965] CPU: 48 PID: 1252 Comm: dp_sync_taskq Tainted: G W OE 4.16.0-rani-zfs+ #14
[ 490.061976] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D8 WS/Z10PE-D8 WS, BIOS 3407 03/10/2017
[ 490.061989] RIP: 0010:dbuf_sync_list+0x58/0x110
[ 490.062000] RSP: 0018:ffffbc418c76fc00 EFLAGS: 00010286
[ 490.062014] RAX: dead000000000100 RBX: 0000000000000000 RCX: 0000000000000000
[ 490.062026] RDX: ffffa1473a892600 RSI: 0000000000000000 RDI: ffffa1475c87c278
[ 490.062034] RBP: ffffa1475c87c278 R08: 0000000000000000 R09: 0000000000000001
[ 490.062044] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa1475c87c288
[ 490.062055] R13: dead000000000200 R14: dead000000000100 R15: dead000000000100
[ 490.062068] FS: 0000000000000000(0000) GS:ffffa1477da00000(0000) knlGS:0000000000000000
[ 490.062077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 490.062089] CR2: 0000000801003598 CR3: 000000086a812003 CR4: 00000000003626e0
[ 490.062098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 490.062105] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 490.062112] Call Trace:
[ 490.062132] dnode_sync+0x63a/0x1110
[ 490.062159] ? __mutex_unlock_slowpath+0x4b/0x2b0
[ 490.062197] sync_dnodes_task+0x7c/0xb0
[ 490.062211] taskq_thread+0x327/0x680
[ 490.062229] ? wake_up_q+0x80/0x80
[ 490.062288] ? taskq_thread_spawn+0x60/0x60
[ 490.062296] kthread+0x116/0x130
[ 490.062308] ? kthread_create_worker_on_cpu+0x70/0x70
[ 490.062324] ret_from_fork+0x35/0x40
[ 490.062354] Code: 48 83 ec 10 48 89 14 24 48 89 44 24 08 48 8b 45 10 49 39 c4 0f 84 aa 00 00 00 48 8b 5d 08 4c 8b 7d 10 49 29 df 0f 84 99 00 00 00 <49> 83 7f 18 00 0f 85 8e 00 00 00 49 8b 57 20 48 8b 42 40 48 83
[ 490.062830] RIP: dbuf_sync_list+0x58/0x110 RSP: ffffbc418c76fc00

@tcaputi tcaputi force-pushed the dnode_destroy_fix branch from da380d0 to 8ae0656 Compare April 6, 2018 23:06
@tcaputi
Copy link
Contributor Author

tcaputi commented Apr 6, 2018

@nivedita76 sorry about that. looks like I got the boolean logic backwards. I just pushed the corrected logic. Please try it again when you get the chance and I will watch buildbot to see the results.

@@ -332,6 +332,7 @@ struct dnode {
kcondvar_t dn_notxholds;
enum dnode_dirtycontext dn_dirtyctx;
uint8_t *dn_dirtyctx_firstset; /* dbg: contents meaningless */
uint64_t dn_dirty_txg; /* txg dnode was last dirtied */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd move this up under dn_assigned_txg.

can_evict = (dn->dn_type == DMU_OT_NONE && !DNODE_IS_DIRTY(dn));
mutex_exit(&dn->dn_mtx);
return (can_evict);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it's worth adding a function for this. How about, this change in dnode_check_slots_free?

-			dmu_object_type_t type = dn->dn_type;
+			boolean_t free = (dn->dn_type == DMU_OT_NONE && !DNODE_IS_DIRTY(dn));

@nivedita76
Copy link
Contributor

Running ok so far

@tcaputi tcaputi force-pushed the dnode_destroy_fix branch from 8ae0656 to 5d4f3e5 Compare April 7, 2018 07:08
@tcaputi
Copy link
Contributor Author

tcaputi commented Apr 7, 2018

@behlendorf fixed your comments and (hopefully) the bug from testing (didn't initialize dn_dirty_txg to 0 upon construction)

@nivedita76
Copy link
Contributor

Should it be set back to 0 in dnode_destroy (and asserted in dnode_dest)?

There's also some stuff in dnode_move where you want it, though I guess that is not used on linux?

@tcaputi tcaputi force-pushed the dnode_destroy_fix branch from 5d4f3e5 to 61b60b0 Compare April 8, 2018 07:04
@tcaputi
Copy link
Contributor Author

tcaputi commented Apr 8, 2018

@nivedita76 addressed your comments (and a few other places) and repushed.

@tcaputi tcaputi force-pushed the dnode_destroy_fix branch from 61b60b0 to d8083b1 Compare April 9, 2018 07:30
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks @tcaputi. @nivedita76 if possible could you please verify the latest version of the PR fully resolves the race you were seeing.

Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Fixes: openzfs#7147

Signed-off-by: Tom Caputi <tcaputi@datto.com>
@tcaputi tcaputi force-pushed the dnode_destroy_fix branch from d8083b1 to 63e17c4 Compare April 9, 2018 22:31
@nivedita76
Copy link
Contributor

Yes the list del errors are gone.
I’m getting a sporadic hang still but that looks like a different issue. Still trying to get a good log of that.

@codecov
Copy link

codecov bot commented Apr 10, 2018

Codecov Report

Merging #7388 into master will increase coverage by 0.07%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7388      +/-   ##
==========================================
+ Coverage   76.29%   76.36%   +0.07%     
==========================================
  Files         330      330              
  Lines      104235   104292      +57     
==========================================
+ Hits        79529    79647     +118     
+ Misses      24706    24645      -61
Flag Coverage Δ
#kernel 76.41% <95.23%> (ø) ⬆️
#user 65.48% <85.71%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f30166...63e17c4. Read the comment docs.

@behlendorf behlendorf merged commit edc1e71 into openzfs:master Apr 10, 2018
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 10, 2018
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Requires-spl: spl-0.7-release
Issue openzfs#7147
Issue openzfs#7388
Issue openzfs#7997
tonyhutter pushed a commit to LLNL/zfs that referenced this pull request Oct 10, 2018
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Requires-spl: spl-0.7-release
Issue openzfs#7147
Issue openzfs#7388
Issue openzfs#7997
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Oct 31, 2018
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7147
Closes openzfs#7388
tonyhutter pushed a commit that referenced this pull request Nov 13, 2018
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147
Closes #7388
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants