-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arc_read()/arc_access() refactoring and cleanup. #14123
Conversation
fc20a76
to
4cef946
Compare
As this patch refactors arc_read() and arc_access(), any clue if it would affect #14120 ? |
I saw that report of yours, but I am not certain about its causes. I don't think this patch should change MRU/MFU distribution (arc_p), especially after you disabled prefetch, that is mostly affected by this PR. So I don't think it would, but its hard to say for sure, its a quite tangled area. |
@mmaybee @ahrens Could you please take a look on this PR? And in particular, could you explain me why prefetched read in arc_mru_ghost and especially arc_mfu_ghost states promoted buffer to arc_mru instead of arc_mfu? Don't you remember what thinking was behind it 16 years ago? Unfortunately message in the illumos commit 13506d1eefb is not descriptive. |
I think a prefetched read in arc_mru_ghost means that the block was read by a predictive prefetch, and then evicted before being demand-accessed. In that case it seems reasonable for a subsequent access to move it to arc_mru, since it was only demand-accessed once. However, I see that if we get a hit on a prefetched block in arc_mru, we move it to mfu. It seems odd to me that we would have a block that was only requested once by the user (but happens to have been correctly predictively prefetched) to be in the MFU. It definitely make sense that a hit on a prefetched block in arc_mru_ghost or arc_mru should go to the same state, so the current logic seems wrong. But I would think it should go to arc_mru, not arc_mfu, because it was only "hit" once. |
Thanks, @ahrens. I should have noticed that part myself. Fixed.
Neither my nor I think the existing code promotes previously prefetched block from MRU to MFU on demand hit. Only on following access, whether it will be another demand or another prefetch after the current demand. It may look like promotes only on demand requests could be right, but from anon or ghost we promote to MRU on any request, including prefetch, so the current logic is at least consistent. |
404312f
to
ef585e3
Compare
As promised during the Developer Summit, I am working on uncached prefetch implementation. Since those two patches are going to overlap in arc.c, I'd appreciate some reviews for this one meanwhile. Thanks in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring and cleanup and test changes look good. I have not so much expertise in the area to comment on the behavioral changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks pretty reasonable and it's held up well in my local stress testing. I'll keep throwing interesting workloads at it to be certain, but right now this LGTM.
With the addition of the new iostat
kstat entries we'll want to update the arcstat
and arc_summary
commands. Otherwise, we'll end up reporting slightly skewed results due to the ignored "iohit" counters.
a0083da
to
f9c462b
Compare
Added sync_pool to couple more unstable tests. |
Removed hash_lock argument from arc_evict_hdr(), noticed by @ryao. |
Slightly refactor beginning of arc_access() as proposed by @ryao, and fix minor stats issues I spotted while at it. |
module/zfs/arc.c
Outdated
* its logic to balance MRU/MFU based on the original state. | ||
*/ | ||
arc_adapt(arc_hdr_size(hdr), hdr->b_l1hdr.b_state); | ||
add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to add an additional reference here? Maybe a comment indicating why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To stop arc_access() from putting this header without any buffers and so other references but obviously nonevictable onto the evictable list of MRU or MFU state. Added comment.
ASSERT(HDR_HAS_L1HDR(hdr)); | ||
ASSERT(!HDR_IO_IN_PROGRESS(hdr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we guarantee that an I/O is not in progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because header having reference will never appear on evictable headers list of any state. That is why I added references for in-progress headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
include/sys/arc_impl.h
Outdated
@@ -174,6 +178,7 @@ typedef struct l1arc_buf_hdr { | |||
zfs_refcount_t b_refcnt; | |||
|
|||
arc_callback_t *b_acb; | |||
arc_callback_t **b_acbp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two accesses to the callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To efficiently implement FIFO to keep original order. First pointer points to the first element, second -- to the acb_next of the last element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ordering critical here? Just wondering about preserving the ordering vs the additional memory requirement for each l1hdr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be. In many benchmarks I saw that requests arriving in order from different threads and blocked on the same indirect block are reaching the data in wrong order, confusing speculative prefetcher and potentially I/O scheduler and disks. Some time ago I made prefetcher to do speculations for read before reading indirect blocks, that significantly helped reads, but for writes it appeared difficult. I hope this change should if not fix the problem, but may be improve it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It is probably more efficient to make arc_callback_t
have an extra pointer for the previous element so that we can process the callback list in reverse order in arc_read_done
. This would mean that the memory penalty would only apply when we have a read in progress and not to every l1hdr and gives you the ordering you need. Would that suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be cheaper for the header, indeed, but still could add one more traverse. I'll think about it closer tomorrow morning. But if we care so much about this structure, we could probably save much more by hiding b_freeze_* under ZFS_DEBUG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented. First traversal left reversed to not traverse extra time. Hope there will be no significant downsides. The callbacks though should be called in right order, that should be more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I also like the idea of putting b_freeze_*
behind a debug flag to reduce the memory.
86d1af2
to
17ccf06
Compare
ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.
Recent ARC commits added new statistic counters, such as iohits, uncached state, etc. Represent those. Also some of previously reported numbers were confusing or even made no sense. Cleanup and restructure existing reports. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue #14115 Issue #14123 Issue #14243 Closes #14320
ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#14123
ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#14123
Recent ARC commits added new statistic counters, such as iohits, uncached state, etc. Represent those. Also some of previously reported numbers were confusing or even made no sense. Cleanup and restructure existing reports. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue openzfs#14115 Issue openzfs#14123 Issue openzfs#14243 Closes openzfs#14320
Earlier as part of openzfs#14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.
Earlier as part of #14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15340
Earlier as part of openzfs#14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#15340
Earlier as part of #14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15340
New Features - Block cloning (#13392) - Linux container support (#14070, #14097, #12263) - Scrub error log (#12812, #12355) - BLAKE3 checksums (#12918) - Corrective "zfs receive" - Vdev and zpool user properties Performance - Fully adaptive ARC (#14359) - SHA2 checksums (#13741) - Edon-R checksums (#13618) - Zstd early abort (#13244) - Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Earlier as part of openzfs#14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#15340
ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable.
How Has This Been Tested?
Lots of code reading, basic read/write tests with ZFS_DEBUG and lots of ZTS runs.
Types of changes
Checklist:
Signed-off-by
.