Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INFO: possible recursive locking detected, arc_reclaim #3701

Closed
kernelOfTruth opened this issue Aug 23, 2015 · 12 comments
Closed

INFO: possible recursive locking detected, arc_reclaim #3701

kernelOfTruth opened this issue Aug 23, 2015 · 12 comments
Labels
Component: Test Suite Indicates an issue with the test framework or a test case
Milestone

Comments

@kernelOfTruth
Copy link
Contributor

[   72.496518] =============================================
[   72.496518] [ INFO: possible recursive locking detected ]
[   72.496519] 4.1.6_dtop-VIII.4+BFS-VRQ-v0.5 #1 Tainted: G           O   
[   72.496519] ---------------------------------------------
[   72.496520] arc_reclaim/28092 is trying to acquire lock:
[   72.496545]  (&mls->mls_lock){+.+...}, at: [<ffffffffc054f551>] multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496546] 
               but task is already holding lock:
[   72.496562]  (&mls->mls_lock){+.+...}, at: [<ffffffffc054f551>] multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496563] 
               other info that might help us debug this:
[   72.496563]  Possible unsafe locking scenario:

[   72.496564]        CPU0
[   72.496564]        ----
[   72.496565]   lock(&mls->mls_lock);
[   72.496566]   lock(&mls->mls_lock);
[   72.496566] 
                *** DEADLOCK ***

[   72.496567]  May be due to missing lock nesting notation

[   72.496567] 1 lock held by arc_reclaim/28092:
[   72.496582]  #0:  (&mls->mls_lock){+.+...}, at: [<ffffffffc054f551>] multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496583] 
               stack backtrace:
[   72.496585] CPU: 2 PID: 28092 Comm: arc_reclaim Tainted: G           O    4.1.6_dtop-VIII.4+BFS-VRQ-v0.5 #1
[   72.496586] Hardware name: ASUS All Series/P9D WS, BIOS 2104 11/26/2014
[   72.496588]  ffffffff832c1450 00000000b8f86fd6 ffff8807d506fb48 ffffffff81e73e4f
[   72.496590]  0000000000000000 ffffffff832c1450 ffff8807d506fbd8 ffffffff81134e43
[   72.496592]  ffff8807d506fba8 00000000000b2db0 ffff8807000005c5 ffff8807d70bcd08
[   72.496592] Call Trace:
[   72.496596]  [<ffffffff81e73e4f>] dump_stack+0x4c/0x6e
[   72.496599]  [<ffffffff81134e43>] __lock_acquire+0x1633/0x17a0
[   72.496600]  [<ffffffff81135876>] lock_acquire+0xd6/0x2c0
[   72.496613]  [<ffffffffc054f551>] ? multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496616]  [<ffffffff81e7c1ec>] mutex_lock_nested+0x5c/0x5a0
[   72.496628]  [<ffffffffc054f551>] ? multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496640]  [<ffffffffc054f551>] ? multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496657]  [<ffffffffc056fd59>] ? spa_get_random+0x29/0x50 [zfs]
[   72.496660]  [<ffffffff817b13ae>] ? get_random_bytes+0x5e/0x210
[   72.496673]  [<ffffffffc054f551>] multilist_sublist_lock+0x21/0x40 [zfs]
[   72.496688]  [<ffffffffc04db70e>] arc_state_multilist_index_func+0xa5e/0xb30 [zfs]
[   72.496701]  [<ffffffffc04e1508>] arc_space_return+0x3868/0x3ea0 [zfs]
[   72.496714]  [<ffffffffc04e40a4>] arc_shrink+0x1b4/0x530 [zfs]
[   72.496725]  [<ffffffffc04e3fe0>] ? arc_shrink+0xf0/0x530 [zfs]
[   72.496728]  [<ffffffffc02439dc>] __thread_exit+0x8c/0xa0 [spl]
[   72.496730]  [<ffffffffc0243970>] ? __thread_exit+0x20/0xa0 [spl]
[   72.496732]  [<ffffffff811165a2>] kthread+0xf2/0x110
[   72.496734]  [<ffffffff8100e8f9>] ? sched_clock+0x9/0x10
[   72.496737]  [<ffffffff811164b0>] ? kthread_create_on_node+0x2f0/0x2f0
[   72.496738]  [<ffffffff81e82722>] ret_from_fork+0x42/0x70
[   72.496740]  [<ffffffff811164b0>] ? kthread_create_on_node+0x2f0/0x2f0
[   72.503109] hardirqs last  enabled at (311019): [<ffffffff81e7c5c5>] mutex_lock_nested+0x435/0x5a0
[   72.503408] hardirqs last disabled at (311018): [<ffffffff81e7c220>] mutex_lock_nested+0x90/0x5a0
[   72.503706] softirqs last  enabled at (0): [<ffffffff810eb569>] copy_process.part.7+0x579/0x20b0
[   72.504009] softirqs last disabled at (0): [<          (null)>]           (null)
[   72.504970] ZFS: Loaded module v0.6.4-1, ZFS pool version 5000, ZFS filesystem version 5
[   73.289614] SPL: using hostid 0x00000000

kernel was loaded with threadirqs

and built with

CONFIG_SCHED_DEBUG
CONFIG_SCHEDSTATS
CONFIG_SCHED_STACK_END_CHECK
CONFIG_TIMER_STATS
CONFIG_PROVE_LOCKING
CONFIG_LOCK_STAT
CONFIG_DEBUG_LOCKDEP

and a special variant of BFS (cpu scheduler), VRQ 0.5

@kernelOfTruth kernelOfTruth changed the title INFO: possible recursive locking detected INFO: possible recursive locking detected, arc_reclaim Aug 23, 2015
@dweeezil
Copy link
Contributor

ZoL is not yet compatible with CONFIG_PROVE_LOCKING. Nested locks need to be annotated properly. See openzfs/spl@79a0056 for a helper function which was added awhile ago to help and also https://github.com/torvalds/linux/blob/master/Documentation/locking/lockdep-design.txt#L130.

@behlendorf
Copy link
Contributor

@dweeezil this reminds me, do you have any related patches which needs to be merged to ensure lock profiling works properly? As long as they're non-disruptive it would be nice to get them in the tree.

@dweeezil
Copy link
Contributor

@behlendorf It's something I definitely am interested in but I don't use lockdep much in my current large-scale testing due to the huge overhead it creates so this has percolated down my to-do list.

@behlendorf
Copy link
Contributor

@dweeezil OK, I just suspected you might already have patches for this since I know you've been working in this area. No problem.

@kernelOfTruth
Copy link
Contributor Author

referencing: openzfs/spl#480 add spin_lock_irqsave_nolockdep and mutex type MUTEX_NOLOCKDEP

@behlendorf behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case Bug - Minor labels Oct 13, 2015
@behlendorf behlendorf added this to the 0.7.0 milestone Oct 13, 2015
@xflou
Copy link

xflou commented Jan 10, 2016

@behlendorf and @dweeezil

Sorry to hijack this thread, but I had related general questions regarding addressing ZFS/NFS issues that are very painful to those of us running ZOL on dedicated production NFS servers.

  1. How far down the list is correcting "ZFS/NFS" issues? I know that are some quick patches thrown at it. Approximate realistic timeframe.
  2. Do either of you use NFS in your production and or test environments for sharing files with your users?
  3. If not, what do you use instead to share data with your user population over the network.

I ask this because although I really like the stability and reliability of ZOL, the NFS issues are painful in a production environment, and I wanted to get a sense of how long we need to hang-on for these NFS issues to be completely addressed, and also find out how each of you handles this issue in your own production environments.

Thanks!

@behlendorf
Copy link
Contributor

How far down the list is correcting "ZFS/NFS" issues? I know that are some quick patches thrown at it. Approximate realistic timeframe.

Definitely near the top, this is one of the more common configurations for people using ZoL. In fact, the recently tagged 0.6.5.4 release should address the majority of the ZoL+NFS issues. What would be very helpful to us is if you could provide a list of the NFS related issues you're still seeing with this release.

https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5.4

Do either of you use NFS in your production and or test environments for sharing files with your users?

Absolutely we depend on it in production 24/7. However, our specific user workloads and hardware configuration don't seem to as easily trigger the issues which have been reported.

If not, what do you use instead to share data with your user population over the network.

Along with NFS we depend heavily on Lustre.

@xflou
Copy link

xflou commented Jan 10, 2016

@behlendorf

Thanks for the quick reply. I will update to the 6.5.4 version and see if that resolves the issue with the "ARC_RECLAIM" using 99% CPU utilization and previous to that we had the "ARC_ADAPT" issue which seem related . I was starting to lose faith and considering the illuminos commercial version(Nexenta), but am not familiar enough to know if they have the same issues with NFS. Now that I know a bit more on the commitment, I will hang on. I would prefer to stay on ZOL using CENTOS.

PS. would you be able to share your current production configuration and how you use NFS?

Thank again!

@dweeezil
Copy link
Contributor

@xflou Although completely unrelated to this original issue, as I recently mentioned in another issue recently, certain metadata-heavy workloads can easily cause the arc adapt thread to spin trying to free up metadata. I don't have a good handle on the exact causes yet but any zfs send operations definitely exacerbate the situation. At this point, the only solution I'm aware of is to make sure there's enough RAM that the metadata never overshoots the limit. You might consider opening a separate issue which describes the problems you're seeing in an NFS environment.

@xflou
Copy link

xflou commented Jan 11, 2016

@dweeezil once I get the 6.5.4 installed, I'll keep an eye on things and open a separate case. When you say enough RAM are you also referring to the ZFS ram setting for arc_min and max?

I have 256G of RAM on the system and during heavy loads I typically see about 126G being used.

Should my arc setting be set higher than they are current? Please see below:

options zfs zfs_arc_min=10737418240
options zfs zfs_arc_max=68719476736

Thanks for the help!!

@behlendorf
Copy link
Contributor

@xflou could you comment on if the latest 0.6.5.4+ releases has improved your NFS/ZFS issues?

@behlendorf behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 26, 2016
@behlendorf
Copy link
Contributor

Closing the original reported issue as a duplicate of #3912.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case
Projects
None yet
Development

No branches or pull requests

4 participants