ZFS 1MB Block Size #354

behlendorf · 2011-08-04T18:34:31Z

This is a patch we are going to need to recreate to reduce the ZFS overhead and improve throughput.

--- From Andreas ---
At one time we were working with a ZFS patch that implemented a max 1MB blocksize, and this improved normal IO performance under certain workloads. In particular, it reduced lock contention and indirect block allocation in the indirect block tree, which allowed a single server to have a much higher peak throughput.

The way the patch was implemented was to allow RAID-Z(2) blocks to be up to 1MB in size, but the individual blocks on each VDEV were still limited to 128kB. That also avoided the overhead of fragmenting and repacking large writes into multiple small buffers for the VDEVs with standard 8+2 RAID-Z2 configurations. The default maximum blocksize was left at 128kB, and a pool property had to be set to use the 1MB blocksize, to maintain compatibility with standard ZFS.

Unfortunately, that patch was never released by Oracle. I never saw the code, so I don't know the underlying details of how it was implemented, but I don't think it was overly complex to implement. There definitely IS a single constant which controls the maximum blocksize. but the patch was definitely more than just changing that one constant.

adilger · 2014-06-05T21:07:48Z

FYI, I recently found http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/178388 when looking at this issue. I don't know if that patch has since been superseded or a version has been ported to ZoL, but if not then it is a good starting point.

behlendorf · 2014-06-05T22:14:12Z

@adilger This work is actually still moving along. I proposed an initial patch in #2049 which was functional but not polished enough to be merged. It still needed documentation, to be integrated with zfs send/receive, rigorous testing to make sure we didn't accidentally introduce an issue, etc.

After I posted that patch @ahrens picked it up and has pushed it father along. He's added the zfs send support and better integrated with it feature flags. What I'd love to see happen are Matt's updated patches rebased against ZoL so we can start testing it and getting some performance results. The latest large block changes can be found here:

https://github.com/ahrens/illumos/tree/largeblock

dweeezil · 2014-06-06T11:25:29Z

I'll note that many of his newer changes require the extensible dataset facility and potentially other of the newer bits of infrastructure that are part of #2149. FWIW, I did a recent rebase of this stack and also have a bunch of other newer things read to port on top of it.

adilger · 2014-07-25T20:43:28Z

I thought it worthwhile to post here that Matt's latest commit ahrens/illumos@3d2b7c9 looks like large blocksize support is substantially complete and is looking for testing. It appears that the patch allows blocksize up to 16MB, though it isn't yet clear what kind of problems such large blocks might face on Linux due to VM issues.

For compatibility reasons, it keeps the default blocksize for each dataset at 128KB unless the "org.open-zfs:large_block" feature is explicitly enabled on and the "recordsize" dataset property is set larger than 128KB.

ahrens · 2014-07-25T20:47:44Z

Note that my patch only allows blocksize up to 1MB by default. Change is currently out for review for illumos. Comments appreciated, including about how this might impact ZFS on Linux. Review has more overview description: https://reviews.csiden.org/r/51/

DeHackEd · 2014-07-25T21:43:48Z

A meta-issue for this review is at #2519 which I did post my concern list from the standpoint of a Linux user. One documentation issue goes with it which would not be platform-specific.

adilger · 2014-11-17T22:21:03Z

Should this be closed in light of #2865 or is there some value in keeping this issue open?

behlendorf · 2014-11-18T00:04:29Z

Typically we leave the original issue open until the pull request is merged. The pull request itself is functionally complete and ready to be merged. But we need to see some additional testing in order to determine if it's going to be stable and perform well without any additional memory management improvements.

@adilger it would be great if we could get some independent verification of these changes.

Improve the large block feature test coverage by extending ztest to frequently change the recordsize. This is specificially designed to catch corner cases which might otherwise go unnoticed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #354

behlendorf · 2015-05-11T21:24:34Z

Finally merged!

f1512ee Illumos 5027 - zfs large block support

5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 illumos/illumos-gate@b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#354

Improve the large block feature test coverage by extending ztest to frequently change the recordsize. This is specificially designed to catch corner cases which might otherwise go unnoticed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#354

sentania · 2015-07-12T02:04:18Z

I recently patched to a level that included this fix. I did not upgrade my pool, but was able to successfull create a zvol with a 1MB block size, which immediately triggered the following panic:

[ 2739.227921] VERIFY3(size <= spa_maxblocksize(spa)) failed (1048576 <= 131072)
[ 2739.228023] PANIC at arc.c:1547:arc_buf_alloc()
[ 2739.228073] Showing stack for process 3480
[ 2739.228078] CPU: 10 PID: 3480 Comm: zvol Tainted: P O 3.18.16-gentoo #1
[ 2739.228080] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014 10/22/2009
[ 2739.228082] 000000000000060b ffff8807f42079e8 ffffffff814a2b18 00000000000000fb
[ 2739.228091] ffffffffa1fcaf05 ffff8807f42079f8 ffffffffa063eeb4 ffff8807f4207b78
[ 2739.228094] ffffffffa063ef5d ffff8807f4207a78 ffffffff00000030 ffff8807f4207b88
[ 2739.228097] Call Trace:
[ 2739.228108] [] dump_stack+0x46/0x58
[ 2739.228118] [] spl_dumpstack+0x3d/0x3f [spl]
[ 2739.228124] [] spl_panic+0xa7/0xda [spl]
[ 2739.228129] [] ? spl_kmem_cache_alloc+0x649/0x668 [spl]
[ 2739.228135] [] ? mutex_unlock+0x9/0xb
[ 2739.228158] [] ? dbuf_rele_and_unlock+0x2bf/0x3bb [zfs]
[ 2739.228161] [] ? mutex_unlock+0x9/0xb
[ 2739.228173] [] ? dbuf_rm_spill+0x749/0x786 [zfs]
[ 2739.228185] [] arc_buf_alloc+0x53/0x165 [zfs]
[ 2739.228198] [] dbuf_read+0x2cf/0x71a [zfs]
[ 2739.228214] [] dmu_buf_rele_array+0x242/0x415 [zfs]
[ 2739.228229] [] dmu_buf_hold_array_by_bonus+0xc9/0xe8 [zfs]
[ 2739.228244] [] dmu_read_req+0x49/0xe9 [zfs]
[ 2739.228258] [] zrl_is_locked+0x1267/0x1593 [zfs]
[ 2739.228266] [] taskq_create+0x4e3/0x648 [spl]
[ 2739.228272] [] ? wake_up_process+0x34/0x34
[ 2739.228277] [] ? taskq_create+0x276/0x648 [spl]
[ 2739.228282] [] kthread+0xcd/0xd5
[ 2739.228285] [] ? kthread_create_on_node+0x16c/0x16c
[ 2739.228289] [] ret_from_fork+0x58/0x90
[ 2739.228292] [] ? kthread_create_on_node+0x16c/0x16c

I would imagine that if the pool isn't upgraded we don't want to allow the creation of larger than 128k blocksize.

behlendorf · 2015-07-13T21:17:39Z

@sentania we definitely don't. I've filed this crash as #3591 so we can track it. Is it possible you were running with a new version of the ZFS utilities and an old version of the kernel modules when this happened?

async move

behlendorf added the Difficulty - Medium label Oct 3, 2014

behlendorf closed this as completed in f1512ee May 11, 2015

behlendorf mentioned this issue Jul 13, 2015

VERIFY3(size <= spa_maxblocksize(spa)) failed (1048576 <= 131072) #3591

Closed

ahrens added a commit to ahrens/zfs that referenced this issue Jul 7, 2021

async move (openzfs#354)

3b84fb4

async move

sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 23, 2022

DLPX-80617 zcache hits panic after adding tiny disk (openzfs#354)

fc3711a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS 1MB Block Size #354

ZFS 1MB Block Size #354

behlendorf commented Aug 4, 2011

adilger commented Jun 5, 2014

behlendorf commented Jun 5, 2014

dweeezil commented Jun 6, 2014

adilger commented Jul 25, 2014

ahrens commented Jul 25, 2014

DeHackEd commented Jul 25, 2014

adilger commented Nov 17, 2014

behlendorf commented Nov 18, 2014

behlendorf commented May 11, 2015

sentania commented Jul 12, 2015

behlendorf commented Jul 13, 2015