-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[superseded by #3451][RFC, l2arc accounting & checksum issue] l2arc-write-target-size.diff , cumulative patch #3433
Conversation
Thanks as always @kernelOfTruth. |
adapted to openzfs#3216, adaption to openzfs#2129 in @ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; } ZFSonLinux: openzfs#3114 openzfs#3400 openzfs#3433
@sempervictus pushed to #3216 edit: also pushed to https://github.com/kernelOfTruth/zfs/commits/tuxoko_zfs/abd_next_19.05.2015%2B3433 only one patch should be on top compared to current abd_next: Take the buildbot results with a grain of salt since they, according to @behlendorf , currently don't cover L2ARC testing So it is advised to give this some preliminary non-production beating & test-runs, perhaps you could also do a checksum check (zdb & sha256sum) of pretty large files (e.g. larger than the ARC size - if available) - at the beginning of the tests when the files are on the zpool & vdevs, then after the extensive testing - just to be sure that the files are intact. The FreeBSD, NetBSD, FreeNAS bug-tracker & mailing-list entries talk about checksum errors on the L2ARC after all - that false data should be dropped from the transfers - but still: better safe ... |
adapted to abd_next (May 19th 2015) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; } ZFSonLinux: zfsonlinux#3114 zfsonlinux#3400 zfsonlinux#3433
Running the following stack on my test host presently:
I've wrapped around L2ARC once already, and its working properly. |
Looks like the L2ARC is holding up. However, ARC itself isn't doing so well:
Looks like the L2 headers are overrunning the ARC cache limits and OOMing the VMs running from the local ZVOLs. Not ideal, but at least the L2ARCs aren't showing insane capacities anymore. EDIT: exporting the zpool results in arc_adapt pegging a core while freeing memory for ~20s. I'm assuming this is due to the L2 headers being purged. |
@sempervictus thanks a lot for your report ! Okay, so now that L2ARC is working as intended ™ (I can't see where that regression might have come from related to that patchset - will have to look again later, when preparing those patches bit-by-bit), it's time to fix everything up - then ZFS should be ready to take off =) That huge memory consumption was supposed to be fixed by "Illumos - 5408 managing ZFS cache devices requires lots of RAM", I'll take a look at Illumos-gate whether I can find more fixes Alright according to #3433 (comment) that patch stack, which you ran for this test, doesn't include the above mentioned fix from Illumos 5408. @sempervictus Could you please give the patch stack from #3216 a test ? That supposedly would show much more promising results related to the OOMing and ARC cache tripping over limits and gives the confirmation that this patchset from @avg-I (thanks a lot !) in all cases is needed for the huge upcoming changes (ABD + #3115 ) and milestone (0.6.5 ? 0.7.0 ?) to have optimum stability and performance. In any case I'll prepare a pull request with broken-out commits, patches ... Following your EDIT: I'm sure it's related - in the past when my L2ARC was full (due to a test close to 100 GiB l2arc with 16 GiB RAM) and needed to be purged there also was some big load on the CPU Thanks ! |
Thanks @kernelOfTruth, this is starting to come together. I'll build out #3216, but with large block having been adopted one some hosts, i can't test without destroying their pools. I noticed @dweeezil pushed more commits to #3115, hopefully we can get that rebased against the large block feature and abd_next to create a stable stack which can be tested across current ZFS hosts. Thanks again, will post more results as i get 'em. |
Built a basic stack off of #3216:
Results are promising:
Will take a while to wrap around the L2ARC. I'll push a deep |
I think that the changes to l2arc_write_buffers() might not be entirely correct. The problem is that elsewhere throughout the ARC code those statistics are updated based on Changing To summarize: the current code in this pull request updates |
@avg-I Thank you very much for your detailed explanation ! I've pushed a pull-request to let the build-bots crunch on it & to have the commit available for testing Now waiting to hear back from the upstream review ... via @avg-I :
CC: @skiselkov , @ahrens , @buffyg |
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3400 Issue openzfs#3433 Issue openzfs#3451
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451 Cherry-pick ported by: Tim Chase <tim@chase2k.com> Cherry-picked from ef56b07
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451 Ported-to-0.6.4.2-by: Tim Chase <tim@chase2k.com>
l2arc-write-target-size.diff
suggested by Andriy Gapon ( @avg-I ) on the FreeBSD mailing list as a fix
which is a cumulation of the following patches:
consistently use asize for allocated size
l2arc_compress_buf: zero buffer tail only if compression succeeds
l2arc_compress_buf: remove redundant check of an unsigned value
l2arc: restore correct rounding up of asize of compressed data
Revert "Transform the I/O when vdev_physical_ashift is greater than"
account for ashift when choosing buffers to be written to l2arc device
References:
freebsd/freebsd-src@master...avg-I:review/l2arc-write-target-size
http://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020242.html
[needs a more extensive list]
Ported-by:
kernelOfTruth kernelOfTruth@gmail.com
Fixes (supposedly):
zfsonlinux#3114
zfsonlinux#3400
[a potentially several others]