-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS 1MB Block Size #354
Comments
FYI, I recently found http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/178388 when looking at this issue. I don't know if that patch has since been superseded or a version has been ported to ZoL, but if not then it is a good starting point. |
@adilger This work is actually still moving along. I proposed an initial patch in #2049 which was functional but not polished enough to be merged. It still needed documentation, to be integrated with zfs send/receive, rigorous testing to make sure we didn't accidentally introduce an issue, etc. After I posted that patch @ahrens picked it up and has pushed it father along. He's added the |
I'll note that many of his newer changes require the extensible dataset facility and potentially other of the newer bits of infrastructure that are part of #2149. FWIW, I did a recent rebase of this stack and also have a bunch of other newer things read to port on top of it. |
I thought it worthwhile to post here that Matt's latest commit ahrens/illumos@3d2b7c9 looks like large blocksize support is substantially complete and is looking for testing. It appears that the patch allows blocksize up to 16MB, though it isn't yet clear what kind of problems such large blocks might face on Linux due to VM issues. For compatibility reasons, it keeps the default blocksize for each dataset at 128KB unless the "org.open-zfs:large_block" feature is explicitly enabled on and the "recordsize" dataset property is set larger than 128KB. |
Note that my patch only allows blocksize up to 1MB by default. Change is currently out for review for illumos. Comments appreciated, including about how this might impact ZFS on Linux. Review has more overview description: https://reviews.csiden.org/r/51/ |
A meta-issue for this review is at #2519 which I did post my concern list from the standpoint of a Linux user. One documentation issue goes with it which would not be platform-specific. |
Should this be closed in light of #2865 or is there some value in keeping this issue open? |
Typically we leave the original issue open until the pull request is merged. The pull request itself is functionally complete and ready to be merged. But we need to see some additional testing in order to determine if it's going to be stable and perform well without any additional memory management improvements. @adilger it would be great if we could get some independent verification of these changes. |
Improve the large block feature test coverage by extending ztest to frequently change the recordsize. This is specificially designed to catch corner cases which might otherwise go unnoticed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #354
Finally merged! f1512ee Illumos 5027 - zfs large block support |
5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 illumos/illumos-gate@b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#354
Improve the large block feature test coverage by extending ztest to frequently change the recordsize. This is specificially designed to catch corner cases which might otherwise go unnoticed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#354
I recently patched to a level that included this fix. I did not upgrade my pool, but was able to successfull create a zvol with a 1MB block size, which immediately triggered the following panic: [ 2739.227921] VERIFY3(size <= spa_maxblocksize(spa)) failed (1048576 <= 131072) I would imagine that if the pool isn't upgraded we don't want to allow the creation of larger than 128k blocksize. |
This is a patch we are going to need to recreate to reduce the ZFS overhead and improve throughput.
--- From Andreas ---
At one time we were working with a ZFS patch that implemented a max 1MB blocksize, and this improved normal IO performance under certain workloads. In particular, it reduced lock contention and indirect block allocation in the indirect block tree, which allowed a single server to have a much higher peak throughput.
The way the patch was implemented was to allow RAID-Z(2) blocks to be up to 1MB in size, but the individual blocks on each VDEV were still limited to 128kB. That also avoided the overhead of fragmenting and repacking large writes into multiple small buffers for the VDEVs with standard 8+2 RAID-Z2 configurations. The default maximum blocksize was left at 128kB, and a pool property had to be set to use the 1MB blocksize, to maintain compatibility with standard ZFS.
Unfortunately, that patch was never released by Oracle. I never saw the code, so I don't know the underlying details of how it was implemented, but I don't think it was overly complex to implement. There definitely IS a single constant which controls the maximum blocksize. but the patch was definitely more than just changing that one constant.
The text was updated successfully, but these errors were encountered: