Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS 1MB Block Size #354

Closed
behlendorf opened this issue Aug 4, 2011 · 11 comments
Closed

ZFS 1MB Block Size #354

behlendorf opened this issue Aug 4, 2011 · 11 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@behlendorf
Copy link
Contributor

This is a patch we are going to need to recreate to reduce the ZFS overhead and improve throughput.

--- From Andreas ---
At one time we were working with a ZFS patch that implemented a max 1MB blocksize, and this improved normal IO performance under certain workloads. In particular, it reduced lock contention and indirect block allocation in the indirect block tree, which allowed a single server to have a much higher peak throughput.

The way the patch was implemented was to allow RAID-Z(2) blocks to be up to 1MB in size, but the individual blocks on each VDEV were still limited to 128kB. That also avoided the overhead of fragmenting and repacking large writes into multiple small buffers for the VDEVs with standard 8+2 RAID-Z2 configurations. The default maximum blocksize was left at 128kB, and a pool property had to be set to use the 1MB blocksize, to maintain compatibility with standard ZFS.

Unfortunately, that patch was never released by Oracle. I never saw the code, so I don't know the underlying details of how it was implemented, but I don't think it was overly complex to implement. There definitely IS a single constant which controls the maximum blocksize. but the patch was definitely more than just changing that one constant.

@adilger
Copy link
Contributor

adilger commented Jun 5, 2014

FYI, I recently found http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/178388 when looking at this issue. I don't know if that patch has since been superseded or a version has been ported to ZoL, but if not then it is a good starting point.

@behlendorf
Copy link
Contributor Author

@adilger This work is actually still moving along. I proposed an initial patch in #2049 which was functional but not polished enough to be merged. It still needed documentation, to be integrated with zfs send/receive, rigorous testing to make sure we didn't accidentally introduce an issue, etc.

After I posted that patch @ahrens picked it up and has pushed it father along. He's added the zfs send support and better integrated with it feature flags. What I'd love to see happen are Matt's updated patches rebased against ZoL so we can start testing it and getting some performance results. The latest large block changes can be found here:

https://github.com/ahrens/illumos/tree/largeblock

@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2014

I'll note that many of his newer changes require the extensible dataset facility and potentially other of the newer bits of infrastructure that are part of #2149. FWIW, I did a recent rebase of this stack and also have a bunch of other newer things read to port on top of it.

@adilger
Copy link
Contributor

adilger commented Jul 25, 2014

I thought it worthwhile to post here that Matt's latest commit ahrens/illumos@3d2b7c9 looks like large blocksize support is substantially complete and is looking for testing. It appears that the patch allows blocksize up to 16MB, though it isn't yet clear what kind of problems such large blocks might face on Linux due to VM issues.

For compatibility reasons, it keeps the default blocksize for each dataset at 128KB unless the "org.open-zfs:large_block" feature is explicitly enabled on and the "recordsize" dataset property is set larger than 128KB.

@ahrens
Copy link
Member

ahrens commented Jul 25, 2014

Note that my patch only allows blocksize up to 1MB by default. Change is currently out for review for illumos. Comments appreciated, including about how this might impact ZFS on Linux. Review has more overview description: https://reviews.csiden.org/r/51/

@DeHackEd
Copy link
Contributor

A meta-issue for this review is at #2519 which I did post my concern list from the standpoint of a Linux user. One documentation issue goes with it which would not be platform-specific.

@adilger
Copy link
Contributor

adilger commented Nov 17, 2014

Should this be closed in light of #2865 or is there some value in keeping this issue open?

@behlendorf
Copy link
Contributor Author

Typically we leave the original issue open until the pull request is merged. The pull request itself is functionally complete and ready to be merged. But we need to see some additional testing in order to determine if it's going to be stable and perform well without any additional memory management improvements.

@adilger it would be great if we could get some independent verification of these changes.

behlendorf added a commit that referenced this issue May 11, 2015
Improve the large block feature test coverage by extending ztest
to frequently change the recordsize.  This is specificially designed
to catch corner cases which might otherwise go unnoticed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #354
@behlendorf
Copy link
Contributor Author

Finally merged!

f1512ee Illumos 5027 - zfs large block support

dasjoe pushed a commit to dasjoe/zfs that referenced this issue May 24, 2015
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  illumos/illumos-gate@b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#354
dasjoe pushed a commit to dasjoe/zfs that referenced this issue May 24, 2015
Improve the large block feature test coverage by extending ztest
to frequently change the recordsize.  This is specificially designed
to catch corner cases which might otherwise go unnoticed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#354
@sentania
Copy link

I recently patched to a level that included this fix. I did not upgrade my pool, but was able to successfull create a zvol with a 1MB block size, which immediately triggered the following panic:

[ 2739.227921] VERIFY3(size <= spa_maxblocksize(spa)) failed (1048576 <= 131072)
[ 2739.228023] PANIC at arc.c:1547:arc_buf_alloc()
[ 2739.228073] Showing stack for process 3480
[ 2739.228078] CPU: 10 PID: 3480 Comm: zvol Tainted: P O 3.18.16-gentoo #1
[ 2739.228080] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014 10/22/2009
[ 2739.228082] 000000000000060b ffff8807f42079e8 ffffffff814a2b18 00000000000000fb
[ 2739.228091] ffffffffa1fcaf05 ffff8807f42079f8 ffffffffa063eeb4 ffff8807f4207b78
[ 2739.228094] ffffffffa063ef5d ffff8807f4207a78 ffffffff00000030 ffff8807f4207b88
[ 2739.228097] Call Trace:
[ 2739.228108] [] dump_stack+0x46/0x58
[ 2739.228118] [] spl_dumpstack+0x3d/0x3f [spl]
[ 2739.228124] [] spl_panic+0xa7/0xda [spl]
[ 2739.228129] [] ? spl_kmem_cache_alloc+0x649/0x668 [spl]
[ 2739.228135] [] ? mutex_unlock+0x9/0xb
[ 2739.228158] [] ? dbuf_rele_and_unlock+0x2bf/0x3bb [zfs]
[ 2739.228161] [] ? mutex_unlock+0x9/0xb
[ 2739.228173] [] ? dbuf_rm_spill+0x749/0x786 [zfs]
[ 2739.228185] [] arc_buf_alloc+0x53/0x165 [zfs]
[ 2739.228198] [] dbuf_read+0x2cf/0x71a [zfs]
[ 2739.228214] [] dmu_buf_rele_array+0x242/0x415 [zfs]
[ 2739.228229] [] dmu_buf_hold_array_by_bonus+0xc9/0xe8 [zfs]
[ 2739.228244] [] dmu_read_req+0x49/0xe9 [zfs]
[ 2739.228258] [] zrl_is_locked+0x1267/0x1593 [zfs]
[ 2739.228266] [] taskq_create+0x4e3/0x648 [spl]
[ 2739.228272] [] ? wake_up_process+0x34/0x34
[ 2739.228277] [] ? taskq_create+0x276/0x648 [spl]
[ 2739.228282] [] kthread+0xcd/0xd5
[ 2739.228285] [] ? kthread_create_on_node+0x16c/0x16c
[ 2739.228289] [] ret_from_fork+0x58/0x90
[ 2739.228292] [] ? kthread_create_on_node+0x16c/0x16c

I would imagine that if the pool isn't upgraded we don't want to allow the creation of larger than 128k blocksize.

@behlendorf
Copy link
Contributor Author

@sentania we definitely don't. I've filed this crash as #3591 so we can track it. Is it possible you were running with a new version of the ZFS utilities and an old version of the kernel modules when this happened?

ahrens added a commit to ahrens/zfs that referenced this issue Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

6 participants