Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

Closed
wants to merge 4 commits into from
Closed

Illumos #4101, #4102, #4103, #4105, #4106, #4730, #4756 #2488

wants to merge 4 commits into from

Conversation

prakashsurya
Copy link
Member

Please see individual commits. This stack pulls in a number of patches, and aligns us more closely with the illumos code.

EDIT: Compiling and testing this set of patches requires an SPL change as well: openzfs/spl#363

@ryao
Copy link
Contributor

ryao commented Jul 11, 2014

@prakashsurya attribution on the 4756 patch is incorrect.

@prakashsurya
Copy link
Member Author

Ah, yes. Shows me as the author; that should be George Wilson, right?

@ryao
Copy link
Contributor

ryao commented Jul 11, 2014

That is correct. It would be nice if you would correct the authorship dates as well, although that is not strictly necessary. I know that I do not always remember to do that.

@prakashsurya
Copy link
Member Author

OK. I think I got the author and date attributed correctly now, but please double check to be sure.

@prakashsurya
Copy link
Member Author

I'm getting a build failure on Debian 6:

/tmp/zfs-build--XeMnBdpX/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zfeature.c: In function 'spa_feature_get_refcount':
/tmp/zfs-build--XeMnBdpX/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zfeature.c:400: error: 'refcount' may be used uninitialized in this function

I'll update with a fix for this soon.

@prakashsurya
Copy link
Member Author

The automated testing also triggered this:

[ 2070.324307] SPLError: 22718:0:(kmem.h:91:sanitize_flags()) FATAL allocation for task txg_sync (22718) which used GFP flags 0x290778bc with PF_NOFS set
[ 2070.341032] SPLError: 22718:0:(kmem.h:91:sanitize_flags()) SPL PANIC
[ 2070.349533] SPL: Showing stack for process 22718
[ 2070.349538] Pid: 22718, comm: txg_sync Tainted: P           2.6.32-61-server #124-Ubuntu
[ 2070.349541] Call Trace:
[ 2070.349562]  [<ffffffffa03b1b07>] spl_debug_dumpstack+0x27/0x40 [spl]
[ 2070.349570]  [<ffffffffa03b30f1>] spl_debug_bug+0x81/0xd0 [spl]
[ 2070.349579]  [<ffffffffa03babd8>] kmem_alloc_debug+0x4a8/0x530 [spl]
[ 2070.349638]  [<ffffffffa0537f43>] space_reftree_add_node+0x43/0x70 [zfs]
[ 2070.349675]  [<ffffffffa0537f9a>] space_reftree_add_seg+0x2a/0x50 [zfs]
[ 2070.349711]  [<ffffffffa053ed2c>] vdev_dtl_reassess+0x1cc/0x860 [zfs]
[ 2070.349748]  [<ffffffffa052e73c>] ? spa_history_log_sync+0x22c/0x710 [zfs]
[ 2070.349786]  [<ffffffffa0537c90>] ? space_reftree_compare+0x0/0x40 [zfs]
[ 2070.349825]  [<ffffffffa053ebdd>] vdev_dtl_reassess+0x7d/0x860 [zfs]
[ 2070.349889]  [<ffffffffa053266d>] ? spa_config_held+0xcd/0xf0 [zfs]
[ 2070.349928]  [<ffffffffa053ebdd>] vdev_dtl_reassess+0x7d/0x860 [zfs]
[ 2070.349984]  [<ffffffffa052f1eb>] ? spa_history_log_internal+0x8b/0x170 [zfs]
[ 2070.350022]  [<ffffffffa04e2d22>] ? dnode_free+0x1d2/0x370 [zfs]
[ 2070.350115]  [<ffffffffa04c8c74>] ? dmu_object_free+0x84/0x140 [zfs]
[ 2070.350154]  [<ffffffffa0501f2e>] dsl_scan_done+0x1be/0x2c0 [zfs]
[ 2070.350198]  [<ffffffffa055e3b1>] ? spa_feature_is_active+0x61/0xe0 [zfs]
[ 2070.350237]  [<ffffffffa050602a>] dsl_scan_sync+0x6fa/0x1000 [zfs]
[ 2070.350279]  [<ffffffffa05a2b9d>] ? zio_wait+0x24d/0x4c0 [zfs]
[ 2070.350319]  [<ffffffffa051beec>] spa_sync+0x4dc/0xd50 [zfs]
[ 2070.350332]  [<ffffffff810399cf>] ? pvclock_clocksource_read+0x4f/0xb0
[ 2070.350377]  [<ffffffffa053a65c>] txg_sync_thread+0x39c/0x700 [zfs]
[ 2070.350385]  [<ffffffff8105ec35>] ? set_user_nice+0xe5/0x160
[ 2070.350428]  [<ffffffffa053a2c0>] ? txg_sync_thread+0x0/0x700 [zfs]
[ 2070.350441]  [<ffffffffa03bc721>] thread_generic_wrapper+0x71/0xd0 [spl]
[ 2070.350453]  [<ffffffffa03bc6b0>] ? thread_generic_wrapper+0x0/0xd0 [spl]
[ 2070.350460]  [<ffffffff810863f6>] kthread+0x96/0xa0
[ 2070.350466]  [<ffffffff810141aa>] child_rip+0xa/0x20
[ 2070.350471]  [<ffffffff81086360>] ? kthread+0x0/0xa0
[ 2070.350477]  [<ffffffff810141a0>] ? child_rip+0x0/0x20

So I think we'll need to change the kmem_alloc call in space_reftree_add_node to use KM_PUSHPAGE.

@dweeezil
Copy link
Contributor

@prakashsurya Yes, that appears to be correct. I fully expected a few of these to come to light during testing. I do usually scan these large ports for kmem_alloc() and taskq_dispatch() calls which might need updating but clearly missed this one.

@prakashsurya
Copy link
Member Author

With that KM_PUSHPAGE fix in place, it's passed the usual regressions tests. I think I'm going to put this on our Lustre testing resource next and see how it holds up there.

@prakashsurya
Copy link
Member Author

I spoke too soon, it failed the xfstests regression suite on Ubuntu 10.04.3 amd64. Here's what the dmesg output looks like: http://bpaste.net/raw/469805/

@prakashsurya
Copy link
Member Author

And I've just hit this ztest failure using the regression suite:

5 vdevs, 7 datasets, 23 threads, 600 seconds...
Pass   1,  SIGKILL,   0 ENOSPC,  7.1% of  238M used,   3% done,    9m43s to go
Pass   2,  SIGKILL,   0 ENOSPC, 15.3% of  238M used,  13% done,    8m44s to go
Pass   3,  SIGKILL,   0 ENOSPC, 19.7% of  238M used,  22% done,    7m50s to go
Pass   4, Complete,   0 ENOSPC, 12.9% of  476M used,  37% done,    6m18s to go
Pass   5,  SIGKILL,   0 ENOSPC, 12.9% of  508M used,  48% done,    5m14s to go
Pass   6,  SIGKILL,   0 ENOSPC, 13.3% of  508M used,  54% done,    4m38s to go
ztest: ../../module/zfs/dbuf.c:1400: Assertion `refcount_count(&db->db_holds) == db->db_dirtycnt (0x2 == 0x1)' failed.

@behlendorf
Copy link
Contributor

The ztest failure is a long standing unrelated issue which we still need to run to ground. The xfstests failure on the other hand I've never seen before...

Prakash Surya and others added 4 commits July 18, 2014 13:09
This changes moves the called to metaslab_group_alloc_update() to the
metaslab_sync_reassess() function. The original placement of the call
within metaslab_sync_done() appears to have been a simple mistake,
introduced by ac72fac.

This aligns us more closely to the upstream illumos code base.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.

This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.

The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram

In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:

    * 4K sector devices will not see any compression benefit
    * large space_maps require more metadata on-disk
    * large space_maps require more time to load (typically random reads)

Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.

A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.

References:
  https://www.illumos.org/issues/4101
  https://www.illumos.org/issues/4102
  https://www.illumos.org/issues/4103
  https://www.illumos.org/issues/4105
  https://www.illumos.org/issues/4106
  illumos/illumos-gate@0713e23

Porting notes:

A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()

Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4730
  illumos/illumos-gate@be08211

Porting notes:

Under ZFSonlinux, one of the effects of not destroying the taskq is that
zdb would never exit (due to the SPL taskq implementation).

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.

References:
  https://www.illumos.org/issues/4756
  illumos/illumos-gate@30beaff

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
@prakashsurya
Copy link
Member Author

OK. So I've rebased this stack of patches on latest master, and made a few small fixes to address the KM_PUSHPAGE issue and Debian build failure. It's made a single pass through the regression suite with no failures, but that previous xfstest failure needs to be looked into before I would feel safe merging. I need to verify if that failure is due to this patch or not.

behlendorf added a commit to behlendorf/zfs that referenced this pull request Jul 21, 2014
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2488
@behlendorf
Copy link
Contributor

@prakashsurya The xfstests failure was hit at least once again in your weekend testing. However, after taking a look at it I don't believe it was introduced by this patch. Perhaps just aggravated.

It appears that the new write throttle code is starving zfs_trunc() leading to sluggish performance. This can happen because the dmu_tx_assign() in zfs_trunc() wasn't changed to use TXG_WAIT which would ensure that it cannot be starved and will be part of the next txg. This hunk was part of the original write throttle patch but appears to have been missed in the port.

I've proposed a fix in #2518.

@behlendorf behlendorf added this to the 0.6.4 milestone Jul 21, 2014
@behlendorf behlendorf added the Bug label Jul 21, 2014
@prakashsurya
Copy link
Member Author

@behlendorf OK, cool. Thanks for digging into that issue. I'll look into putting this on one of our Lustre test filesystems since it's passed all of the regressions tests so far.

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 22, 2014
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()

Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4730
  illumos/illumos-gate@be08211

Porting notes:

Under ZFSonlinux, one of the effects of not destroying the taskq is that
zdb would never exit (due to the SPL taskq implementation).

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#2488
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 22, 2014
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.

References:
  https://www.illumos.org/issues/4756
  illumos/illumos-gate@30beaff

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#2488
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jul 22, 2014
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes openzfs#2488
@prakashsurya prakashsurya deleted the illumos-4101 branch July 24, 2014 18:31
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jun 9, 2015
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes openzfs#2488
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants