-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[buildbot, proof-of-concept (*2)/testing, #2129 + new (tracepoint multilist) #3115], <feedback, comments> #3216
Conversation
Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2. Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
zfsolinux currently uses vmalloc backed slab for ARC buffers. There are some major problems with this approach. One is that 32-bit system have only a handful of vmalloc space. Another is that the fragmentation in slab will easily trigger OOM in busy system. With ABD, we use scatterlist to allocate data buffers. In this approach we can allocate in HIGHMEM, which alleviates vmalloc space pressure on 32-bit. Also, we don't have to rely on slab, so there's no fragmentation issue. But for metadata buffers, we still uses linear buffer from slab. The reason for this is that there are a lot of *_phys pointers directly point to metadata buffers. So it's kind of impractical to change all those code. Currently, ABD is not enabled and its API will treat them as normal buffers. We will enable it once all relevant code is modified to use the API. Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Modify/Add incremental fletcher function prototype to match abd_iterate_rfunc callback type. Also, reduce duplicated code a bit in zfs_fletcher.c. Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
1. Use abd_t in arc_buf_t->b_data, dmu_buf_t->db_data, zio_t->io_data and zio_transform_t->zt_orig_data 2. zio_* function take abd_t for data Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
1. Add checksum function for abd_t 2. Use abd_t version checksum function in zio_checksum_table 3. Make zio_checksum_compute and zio_checksum_error handle abd_t Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
…d zil.c Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Use ABD API on related pointers and functions.(b_data, db_data, zio_*(), etc.) Suggested-by: DHE <git@dehacked.net> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
bb93410
to
e19172f
Compare
pad[PAGE_SIZE] to pad[4096] (how about archs other than x86, x86_64 ?)
This reverts commit 037763e. XXX - expand this comment as to why we're reverting it. conflicts due to openzfs#2129, in: static void arc_buf_free_on_write (abd_t) static void arc_buf_l2_cdata_free (related to arc_buf_free_on_write(l2hdr->b_tmp_cdata, hdr->b_size, abd_free); ) l2arc_release_cdata_buf (abd_free instead of zio_data_buf_free)
This reverts commit ecf3d9b. Conflicts: module/zfs/arc.c module/zfs/ddt.c module/zfs/spa_misc.c
cd43e4a
to
115a8c2
Compare
5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Richard Lowe <richlowe@richlowe.net> conflicts due to openzfs#2129, in: arc.c static void arc_evict_ghost : (with additional argument "arc_buf_contents_t type" since Revert "Allow arc_evict_ghost() to only evict meta data" is left out) l2arc_release_cdata_buf(arc_buf_hdr_t *ab) : zio_data_buf_free(l2hdr->b_tmp_cdata, ab->b_size); is changed to abd_free(l2hdr->b_tmp_cdata, hdr->b_size); fix commit 989fd51 "Change ASSERT(!"...") to cmn_err(CE_PANIC, ...)"
2bd686e
to
af0684c
Compare
5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> Ported by: Tim Chase <tim@chase2k.com> Porting notes: Due to the restructuring of the ARC-related structures, this patch conflicts with at least the following existing ZoL commits: 6e1d727 Fix inaccurate arcstat_l2_hdr_size calculations The ARC_SPACE_HDRS constant no longer exists and has been somewhat equivalently replaced by HDR_L2ONLY_SIZE. e0b0ca9 Add visibility in to cached dbufs The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr struct requires additional structure member names to be used when referencing the inner items. Also, the presence of L1 or L2 inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR macros. conflicts due to openzfs#2129 solely and/or in combination with these changes, in: cmd/ztest/ztest.c @ztest_dmu_read_write_zcopy abd_copy_from_buf(bigbuf_arcbufs[j]->b_data, (caddr_t)bigbuf + (off - bigoff), chunksize); stays, only if-statement above is expanded to: if (i != 5 || chunksize < (SPA_MINBLOCKSIZE * 2)) { module/zfs/arc.c @l2arc_decompress_zio bzero(hdr->b_buf->b_data, hdr->b_size); to abd_zero(hdr->b_buf->b_data, hdr->b_size); moved: struct l2arc_buf_hdr { to typedef struct l2arc_buf_hdr { keeping in mind for later: /* temporary buffer holder for in-flight compressed data */ abd_t *b_tmp_cdata; @l2arc_release_cdata_buf(arc_buf_hdr_t *hdr) abd_free(l2hdr->b_tmp_cdata, hdr->b_size); to zio_data_buf_free(l2hdr->b_tmp_cdata, hdr->b_size); to abd_free(hdr->b_l1hdr.b_tmp_cdata, hdr->b_size); @l2arc_decompress_zio(zio_t *zio, arc_buf_hdr_t *hdr, enum zio_compress c) bzero(hdr->b_buf->b_data, hdr->b_size); to bzero(hdr->b_l1hdr.b_buf->b_data, hdr->b_size) abd_zero(hdr->b_buf->b_data, hdr->b_size); to abd_zero(hdr->b_l1hdr.b_buf->b_data, hdr->b_size); line below that: zio->io_data = zio->io_orig_data = hdr->b_buf->b_data; to zio->io_data = zio->io_orig_data = hdr->b_l1hdr.b_buf->b_data; merge conflict in @arc_cksum_equal(arc_buf_t *buf) fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); to abd_fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); (no change) @arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock) cmn_err(CE_PANIC, "invalid arc state 0x%p", hdr->b_state); to cmn_err(CE_PANIC, "invalid arc state 0x%p", hdr->b_l1hdr.b_state); + fixing >80 lines problem in include/sys/arc_impl.h void *b_tmp_cdata; to abd_t *b_tmp_cdata;
@kernelOfTruth Please see my recent comments in #3115. I don't think you should have 2130841 (from 7168285 or wherever) in your testing stack for the time being. That said, I do hope to get the "lock contention on arcs_mtx" patch in working order very soon. |
@dweeezil thank you very much for the heads up 👍 I should have made my explanation in the first comment more clearer: bc88866 (Fix arc_adjust_meta() behavior) - I've made sure that that commit isn't included - only the handling to do it is different from #3189 & #3190
so everything should be in order |
Threw this @ SCST-operated iSCSI hosts and the crashes are upon us:
With a side of:
Context: 8 disk Z2 VDEV with two 250GB SSDs partitioned @ 80/20 with the 20's being a SLOG mirror and the 80's being an L2ARC span. On the bright side, the system showed IOWait @ 50-70% prior after a few days of load due to the L2ARC evictions, now it works just fine before the crash. |
…3115_WIP_clean integrated changes from the (old, archived) openzfs#2129 pull-request into "Fix misuse of input argument in traverse_visitbp" ecfb0b5
Merging the "zfs_master_28.04.2015" branch - so now that this pull-request is up to the 0.6.4 (or inofficial 0.6.4.1) tag state with stability and performance enhancements, adding one additional commit from #3349 which should address a hardlock ( superseded by #3367 - only the commit message got updated ). Those, that are using this pull-request for heavy testing, throughput, production, etc. please wait until the buildbots give green light & then give it a try. During the next days/weeks I might integrate the current outstanding commits from upstream. Since the latest ABD patchstack integrates support for metadata in addition to data and the changes from #3115 also add kind of similar changes that makes things more complex for me - so I'm somewhat stuck at the moment at the new pull-request ( #3360 ) - therefore focusing more on this one. Thanks |
… devices Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> additional changes & notes for ZFSOnLinux: changes related to @struct l2arc_dev { are in include/sys/arc_impl.h changes for module/zfs/arc.c removed (dupe) superfluous comment from @l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all) /* * Tell ARC this no longer exists in L2ARC. */ @arc_hdr_destroy(arc_buf_hdr_t *hdr) arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); arc_space_return functions are also existent in dweeezil@ab37b29 (openzfs#3115 pull-request) thus neither uncommenting or removing since it might hamper functionality or data integrity comments ?
hm,
|
I saw the same call to IMPLY when going over the patch on their side, nice catch finding it. Should this be ported as well, or redacted in the patch? EDIT - by the way, in order to test this with recent changes in master, you might want to merge tuxoko's abd_next branch which apparently works with large blocks now (i just built with it instead of 2129 or this PR). |
@sempervictus I'd say that's up to @behlendorf and the other devs in charge of ZFSOnLinux to decide if this needs to be introduced for debug purposes this support goes way back until 2010 (partly even 2009 and before) and illumos/illumos-gate@56f3320 - which would be quite some changes The current question is - if it's safe to comment out that line - according to the comments in the code, it appears so, but better safe than sorry |
it compiles - but:
not really sure how to handle that right now - since it implies that it would need an change in the current code ... http://stackoverflow.com/questions/26411482/switch-condition-has-boolean-value it appears to be more of a benign warning and related to coding style rather than something serious edit: Pushing the commit which comments out the talked about line Let's see how it fares with xfstest and zfsstress |
@arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr) line 1912 // IMPLY(l2hdr->b_daddr == L2ARC_ADDR_UNSET, HDR_L2_WRITING(hdr));
@sempervictus thanks for the heads-up - I will use his abd-next probably within the next few days 👍 Like written in #3360 I'm rather stuck with the two metadata-handling ways of ABD & the changes from 3115 - so I'll leave that to the experts for now (also due to personal time constraints) |
4a1ec31
to
d7e1fd0
Compare
Now contains the fix for "l2arc space accounting mismatch"
|
e3036cd
to
84029b2
Compare
adapted to openzfs#3216, adaption to openzfs#2129 in @ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; } ZFSonLinux: openzfs#3114 openzfs#3400 openzfs#3433
Now contains the patch from #3433 l2arc-write-target-size.diff |
84029b2
to
db01007
Compare
db01007
to
a7683f5
Compare
…to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit e14bb3258d05c1b1077e2db7cf77088924e56919 Also, consistently use asize / a_sz for the allocated size, psize / p_sz for the physical size. Where the latter accounts for possible size reduction because of compression, whereas the former accounts for possible size expansion because of alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. modified to account for the changes of openzfs#2129 (ABD) , openzfs#3115 (Illumos - 5408 managing ZFS cache devices requires lots of RAM) and Illumos - 5701 zpool list reports incorrect "alloc" value for cache devices
The goal of this pull-request compared to ( #3189 , #3190 - working tests, so not doing too much changes to them)
is to get an up-to-date base (zfs master from March 20th, latest),
and clean, logical-split up commits to make any problems in this combination/porting as easily and clearly visible right from the start to prevent any long-standing issues or regressions introduced from this
Overview/contents:
zfs master
ABD: linear/scatter dual typed buffer for ARC, ABD: https://github.com/kernelOfTruth/zfs/tree/tuxoko_zfs/abd_17.02.2015 (current state)
Illumos - 5497 lock contention, Illumos - 5408 cache devices requiring lots of RAM, Illumos 5369 - arc flags enum and some additional minor changes: https://github.com/kernelOfTruth/zfs/tree/dweeezil_zfs/lock-contention-on-arcs_mtx_23.03.2015 (current state, just updated)
I'm pushing this early to get the buildbots running and to get an early insight if fixes or changes work
edit:
personal porting notes in HTML, exported from tomboy:
http://pastebin.com/A1sMCGXG
The difference in this approach from #3189 and #3190 was that
Revert "Allow arc_evict_ghost() to only evict meta data"
dweeezil@730d690
was omitted
and that "Fix arc_adjust_meta() behavior"
bc88866
thus the reverts of both practically where done in commit "5497 lock contention on arcs_mtx" kernelOfTruth@2130841
Previously I had assumed that there would be at least some changes that would survive in the new code - but was wrong - like pointed out in #3115 (comment) .
Obviously: longer observation beforehand == less work later, less clutter
this made porting more difficult but still realizable.
The pull-requests are done on a per-commit basis which should give a better overview instead of a huge cumulative single "blob"
Not sure how "mergeable" the code is - the single commits should be logical, working units by themselves, like pointed out by @behlendorf
Each commit has some details in merge conflicts (due to #2119 and/or #3115 ) added
There's still the warning from #3115 , dweeezil@7168285
The buildbot issues with undefined PAGE_SHIFT have been dealt with in kernelOfTruth@e19172f , will have to test it in the other proof-of-concept (*1) to see if that gets the green lights for the other failing buildbots
Not sure why here some commits are still "queued for automatic testing" despite long time having passed
edit:
quick link to the tree: https://github.com/kernelOfTruth/zfs/commits/zfs_master_20.03.2015_2129+3115_WIP_clean