Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

prakashsurya · 2014-07-11T20:14:38Z

Please see individual commits. This stack pulls in a number of patches, and aligns us more closely with the illumos code.

EDIT: Compiling and testing this set of patches requires an SPL change as well: openzfs/spl#363

ryao · 2014-07-11T21:04:26Z

@prakashsurya attribution on the 4756 patch is incorrect.

prakashsurya · 2014-07-11T21:05:47Z

Ah, yes. Shows me as the author; that should be George Wilson, right?

ryao · 2014-07-11T21:08:40Z

That is correct. It would be nice if you would correct the authorship dates as well, although that is not strictly necessary. I know that I do not always remember to do that.

prakashsurya · 2014-07-11T21:16:17Z

OK. I think I got the author and date attributed correctly now, but please double check to be sure.

prakashsurya · 2014-07-15T22:10:07Z

I'm getting a build failure on Debian 6:

/tmp/zfs-build--XeMnBdpX/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zfeature.c: In function 'spa_feature_get_refcount':
/tmp/zfs-build--XeMnBdpX/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zfeature.c:400: error: 'refcount' may be used uninitialized in this function

I'll update with a fix for this soon.

prakashsurya · 2014-07-16T20:14:47Z

The automated testing also triggered this:

[ 2070.324307] SPLError: 22718:0:(kmem.h:91:sanitize_flags()) FATAL allocation for task txg_sync (22718) which used GFP flags 0x290778bc with PF_NOFS set
[ 2070.341032] SPLError: 22718:0:(kmem.h:91:sanitize_flags()) SPL PANIC
[ 2070.349533] SPL: Showing stack for process 22718
[ 2070.349538] Pid: 22718, comm: txg_sync Tainted: P           2.6.32-61-server #124-Ubuntu
[ 2070.349541] Call Trace:
[ 2070.349562]  [<ffffffffa03b1b07>] spl_debug_dumpstack+0x27/0x40 [spl]
[ 2070.349570]  [<ffffffffa03b30f1>] spl_debug_bug+0x81/0xd0 [spl]
[ 2070.349579]  [<ffffffffa03babd8>] kmem_alloc_debug+0x4a8/0x530 [spl]
[ 2070.349638]  [<ffffffffa0537f43>] space_reftree_add_node+0x43/0x70 [zfs]
[ 2070.349675]  [<ffffffffa0537f9a>] space_reftree_add_seg+0x2a/0x50 [zfs]
[ 2070.349711]  [<ffffffffa053ed2c>] vdev_dtl_reassess+0x1cc/0x860 [zfs]
[ 2070.349748]  [<ffffffffa052e73c>] ? spa_history_log_sync+0x22c/0x710 [zfs]
[ 2070.349786]  [<ffffffffa0537c90>] ? space_reftree_compare+0x0/0x40 [zfs]
[ 2070.349825]  [<ffffffffa053ebdd>] vdev_dtl_reassess+0x7d/0x860 [zfs]
[ 2070.349889]  [<ffffffffa053266d>] ? spa_config_held+0xcd/0xf0 [zfs]
[ 2070.349928]  [<ffffffffa053ebdd>] vdev_dtl_reassess+0x7d/0x860 [zfs]
[ 2070.349984]  [<ffffffffa052f1eb>] ? spa_history_log_internal+0x8b/0x170 [zfs]
[ 2070.350022]  [<ffffffffa04e2d22>] ? dnode_free+0x1d2/0x370 [zfs]
[ 2070.350115]  [<ffffffffa04c8c74>] ? dmu_object_free+0x84/0x140 [zfs]
[ 2070.350154]  [<ffffffffa0501f2e>] dsl_scan_done+0x1be/0x2c0 [zfs]
[ 2070.350198]  [<ffffffffa055e3b1>] ? spa_feature_is_active+0x61/0xe0 [zfs]
[ 2070.350237]  [<ffffffffa050602a>] dsl_scan_sync+0x6fa/0x1000 [zfs]
[ 2070.350279]  [<ffffffffa05a2b9d>] ? zio_wait+0x24d/0x4c0 [zfs]
[ 2070.350319]  [<ffffffffa051beec>] spa_sync+0x4dc/0xd50 [zfs]
[ 2070.350332]  [<ffffffff810399cf>] ? pvclock_clocksource_read+0x4f/0xb0
[ 2070.350377]  [<ffffffffa053a65c>] txg_sync_thread+0x39c/0x700 [zfs]
[ 2070.350385]  [<ffffffff8105ec35>] ? set_user_nice+0xe5/0x160
[ 2070.350428]  [<ffffffffa053a2c0>] ? txg_sync_thread+0x0/0x700 [zfs]
[ 2070.350441]  [<ffffffffa03bc721>] thread_generic_wrapper+0x71/0xd0 [spl]
[ 2070.350453]  [<ffffffffa03bc6b0>] ? thread_generic_wrapper+0x0/0xd0 [spl]
[ 2070.350460]  [<ffffffff810863f6>] kthread+0x96/0xa0
[ 2070.350466]  [<ffffffff810141aa>] child_rip+0xa/0x20
[ 2070.350471]  [<ffffffff81086360>] ? kthread+0x0/0xa0
[ 2070.350477]  [<ffffffff810141a0>] ? child_rip+0x0/0x20

So I think we'll need to change the kmem_alloc call in space_reftree_add_node to use KM_PUSHPAGE.

dweeezil · 2014-07-16T20:25:19Z

@prakashsurya Yes, that appears to be correct. I fully expected a few of these to come to light during testing. I do usually scan these large ports for kmem_alloc() and taskq_dispatch() calls which might need updating but clearly missed this one.

prakashsurya · 2014-07-16T22:39:42Z

With that KM_PUSHPAGE fix in place, it's passed the usual regressions tests. I think I'm going to put this on our Lustre testing resource next and see how it holds up there.

prakashsurya · 2014-07-16T23:18:39Z

I spoke too soon, it failed the xfstests regression suite on Ubuntu 10.04.3 amd64. Here's what the dmesg output looks like: http://bpaste.net/raw/469805/

prakashsurya · 2014-07-16T23:56:25Z

And I've just hit this ztest failure using the regression suite:

5 vdevs, 7 datasets, 23 threads, 600 seconds...
Pass   1,  SIGKILL,   0 ENOSPC,  7.1% of  238M used,   3% done,    9m43s to go
Pass   2,  SIGKILL,   0 ENOSPC, 15.3% of  238M used,  13% done,    8m44s to go
Pass   3,  SIGKILL,   0 ENOSPC, 19.7% of  238M used,  22% done,    7m50s to go
Pass   4, Complete,   0 ENOSPC, 12.9% of  476M used,  37% done,    6m18s to go
Pass   5,  SIGKILL,   0 ENOSPC, 12.9% of  508M used,  48% done,    5m14s to go
Pass   6,  SIGKILL,   0 ENOSPC, 13.3% of  508M used,  54% done,    4m38s to go
ztest: ../../module/zfs/dbuf.c:1400: Assertion `refcount_count(&db->db_holds) == db->db_dirtycnt (0x2 == 0x1)' failed.

behlendorf · 2014-07-17T01:03:02Z

The ztest failure is a long standing unrelated issue which we still need to run to ground. The xfstests failure on the other hand I've never seen before...

This changes moves the called to metaslab_group_alloc_update() to the metaslab_sync_reassess() function. The original placement of the call within metaslab_sync_done() appears to have been a simple mistake, introduced by ac72fac. This aligns us more closely to the upstream illumos code base. Signed-off-by: Prakash Surya <surya1@llnl.gov>

4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <seb@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 illumos/illumos-gate@0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov>

4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4730 illumos/illumos-gate@be08211 Porting notes: Under ZFSonlinux, one of the effects of not destroying the taskq is that zdb would never exit (due to the SPL taskq implementation). Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov>

4756 metaslab_group_preload() could deadlock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> The metaslab_group_preload() function grabs the mg_lock and then later tries to grab the metaslab lock. This lock ordering may lead to a deadlock since other consumers of the mg_lock will grab the metaslab lock first. References: https://www.illumos.org/issues/4756 illumos/illumos-gate@30beaff Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>

prakashsurya · 2014-07-18T23:39:37Z

OK. So I've rebased this stack of patches on latest master, and made a few small fixes to address the KM_PUSHPAGE issue and Debian build failure. It's made a single pass through the regression suite with no failures, but that previous xfstest failure needs to be looked into before I would feel safe merging. I need to verify if that failure is due to this patch or not.

As part of the write throttle & i/o schedule performance work the zfs_trunc() function should have been updated to use TXG_WAIT. Using TXG_WAIT ensures that the tx will be part of the next txg. If TXG_NOWAIT is used and retried for ERESTART errors then the tx can suffer from starvation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2488

behlendorf · 2014-07-21T17:28:20Z

@prakashsurya The xfstests failure was hit at least once again in your weekend testing. However, after taking a look at it I don't believe it was introduced by this patch. Perhaps just aggravated.

It appears that the new write throttle code is starving zfs_trunc() leading to sluggish performance. This can happen because the dmu_tx_assign() in zfs_trunc() wasn't changed to use TXG_WAIT which would ensure that it cannot be starved and will be part of the next txg. This hunk was part of the original write throttle patch but appears to have been missed in the port.

I've proposed a fix in #2518.

prakashsurya · 2014-07-22T05:02:24Z

@behlendorf OK, cool. Thanks for digging into that issue. I'll look into putting this on one of our Lustre test filesystems since it's passed all of the regressions tests so far.

4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4730 illumos/illumos-gate@be08211 Porting notes: Under ZFSonlinux, one of the effects of not destroying the taskq is that zdb would never exit (due to the SPL taskq implementation). Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#2488

4756 metaslab_group_preload() could deadlock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> The metaslab_group_preload() function grabs the mg_lock and then later tries to grab the metaslab lock. This lock ordering may lead to a deadlock since other consumers of the mg_lock will grab the metaslab lock first. References: https://www.illumos.org/issues/4756 illumos/illumos-gate@30beaff Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#2488

As part of the write throttle & i/o schedule performance work the zfs_trunc() function should have been updated to use TXG_WAIT. Using TXG_WAIT ensures that the tx will be part of the next txg. If TXG_NOWAIT is used and retried for ERESTART errors then the tx can suffer from starvation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes openzfs#2488

prakashsurya mentioned this pull request Jul 11, 2014

Recent illumos/Delphix features #2149

Closed

Prakash Surya and others added 4 commits July 18, 2014 13:09

behlendorf mentioned this pull request Jul 21, 2014

zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT) #2518

Closed

behlendorf added this to the 0.6.4 milestone Jul 21, 2014

behlendorf added the Bug label Jul 21, 2014

behlendorf closed this in 93cf207 Jul 22, 2014

prakashsurya deleted the illumos-4101 branch July 24, 2014 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

prakashsurya commented Jul 11, 2014

ryao commented Jul 11, 2014

prakashsurya commented Jul 11, 2014

ryao commented Jul 11, 2014

prakashsurya commented Jul 11, 2014

prakashsurya commented Jul 15, 2014

prakashsurya commented Jul 16, 2014

dweeezil commented Jul 16, 2014

prakashsurya commented Jul 16, 2014

prakashsurya commented Jul 16, 2014

prakashsurya commented Jul 16, 2014

behlendorf commented Jul 17, 2014

prakashsurya commented Jul 18, 2014

behlendorf commented Jul 21, 2014

prakashsurya commented Jul 22, 2014

Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

Conversation

prakashsurya commented Jul 11, 2014

ryao commented Jul 11, 2014

prakashsurya commented Jul 11, 2014

ryao commented Jul 11, 2014

prakashsurya commented Jul 11, 2014

prakashsurya commented Jul 15, 2014

prakashsurya commented Jul 16, 2014

dweeezil commented Jul 16, 2014

prakashsurya commented Jul 16, 2014

prakashsurya commented Jul 16, 2014

prakashsurya commented Jul 16, 2014

behlendorf commented Jul 17, 2014

prakashsurya commented Jul 18, 2014

behlendorf commented Jul 21, 2014

prakashsurya commented Jul 22, 2014