smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

gkatev · 2023-01-30T15:56:21Z

Hi, I'd like to propose some changes to the registration cache in smsc/xpmem. Currently, looking up a registration is an operation that iterates the whole tree (or until found). This is necessary as the tree is used to store attachments to different procs/endpoints, but can get slow and scale poorly. To address this, in this PR I suggest creating one regcache/tree per each endpoint/peer and using the find() method (O(logn) instead of O(n)).

This PR supersedes #11130.
FYI @devreal

The first commit implements the juicy part of this PR (find vs iterate).
The second and smaller commit takes the opportunity to add support for the MCA_RCACHE_FLAGS_PERSIST and MCA_RCACHE_FLAGS_CACHE_BYPASS flags
- Would this be better in its own PR instead of bundled here? If yes I can branch it out to a new one

Note that to put this together I took "bound" to mean the last valid byte in the range, inclusive, and fixed some instances where this was mishandled. Please let me know if this is not accurate/desired.

The benefit with this change will vary depending on the number of standing registrations and message size. I expect the relative gain to be relevant for small-to-medium messages (but larger than any copy-in-copy-out-to-single-copy thresholds). In a (single) node with 64 ranks, using mirobenchmarks, tuned's alltoall @ 1K improves from ~143 us to ~124 us. Using my component (xhc), the hierarchical allreduce @ 16K improves from ~39 us to ~31 us, while the (generally worse) non-hierarchical allreduce (more attachments/regs) @ 4K improves from 100 us to 42 us .

While using a single tree for each endpoint will result in more trees, the total number of tree nodes does not increase.
A planned future enhancement that would cache peer-process-to-rcache information will further improve this.

Signed-off-by: George Katevenis gkatev@ics.forth.gr

ompiteam-bot · 2023-01-30T15:56:23Z

Can one of the admins verify this patch?

bosilca

ok to test.

bosilca · 2023-01-30T16:48:26Z

opal/mca/smsc/xpmem/smsc_xpmem_module.c

-        return OPAL_SUCCESS;
+    endpoint->vma_module = mca_rcache_base_vma_module_alloc();
+    if (OPAL_UNLIKELY(NULL == endpoint->vma_module)) {
+        OBJ_RELEASE(endpoint);


All your return paths are polluted with the release of the endpoint object, that was allocated too early. If you delay the creation of the endpoint until you gather all necessary information the code will look more readable.

I moved the modex operation earlier to avoid one such instance. But I'm not sure I can elegantly avoid the release(s) in the other 2 fail blocks, as all 3 operations (ep, vma_module, xpmem_get) may fail and all need cleanup.

bosilca · 2023-01-30T17:01:46Z

opal/mca/smsc/xpmem/smsc_xpmem_module.c

+            reg = NULL;
+        }
+    } else {
+        /* If there is a registration that overlaps with the requested range, but


It is possible there are multiple regions that overlaps with the current one (imagine two prior registrations with a gap in the middle, gap that is then covered by this registration). If I understand the code correctly, you expect to merge these 3 registration, but the current code can only merge two registrations (so it will merge one of the earlier registrations with the current one).

Indeed I see. Apparently also the old code only merges with the first overlapping area found. I updated the PR, and now the loop does not stop on the first overlapping area it finds, but keeps going and includes potentially all 4 surrounding areas. Please review again.

Better, but still suboptimal. There are two issues I noticed:

each find operation takes and release a lock on the underlying opal_interval_tree_t object. If the merge operation is lowered into the interval object itself, we can do the search and the merge with only one lock and one unlock.

having the search outside the opal_interval_tree_t also trigger the tree traversal for each find operation. If we move the merge down into this object, we can do all the merges with a single tree traversal (because the nodes are ordered).

I agree it is suboptimal; we need multiple operations as find() does not support partial matching. There are also some thread safety concerns (in this spot and also in some other ones). It is my intution that to address these concerns the atomicity point needs to be moved inside the tree instead of all the way out in smsc.

Moving the merge operation inside the tree indeed sounds good (and would/could also improve the atomicity thing). Even better if we would have an XPMEM reg immediately ready to take the place of the old ones as soon as we invalidate them.

Anyway besides the rambling above, these problems exist in the old implementation with iterate() and they remain in the find() one suggested here. I don't currently have a concrete plan to tackle them, so I would suggest we merge the find() improvement which is ready and allegedly does not introduce new shortcomings. We could create an issue to track fixes and improvements that need to be made in the future.

bosilca · 2023-02-08T15:46:00Z

opal/mca/smsc/xpmem/smsc_xpmem_module.c

+            reg = NULL;
+        }
+    } else {
+        /* If there is a registration that overlaps with the requested range, but


Better, but still suboptimal. There are two issues I noticed:

each find operation takes and release a lock on the underlying opal_interval_tree_t object. If the merge operation is lowered into the interval object itself, we can do the search and the merge with only one lock and one unlock.

having the search outside the opal_interval_tree_t also trigger the tree traversal for each find operation. If we move the merge down into this object, we can do all the merges with a single tree traversal (because the nodes are ordered).

bosilca · 2023-02-08T15:55:42Z

opal/mca/smsc/xpmem/smsc_xpmem_internal.h

@@ -67,10 +71,6 @@ typedef struct mca_smsc_xpmem_component_t mca_smsc_xpmem_component_t;

 struct mca_smsc_xpmem_module_t {
    mca_smsc_module_t super;
-


I am skeptical about the interest of this particular change. If we assume the communication library is only about point-to-point communications without reuse of buffers, then this change might make sense. But I don't think this is the most obvious pattern in which MPI is used, and as soon as we add collective communications to the mix sharing registrations is essential.

Let's imagine an MPI_Allreduce on a very fat node. The temporary buffer used in the collective will be registered for most peers (depending on the collective algorithm, but more than once) instead of being registered once. I understand this is a balance between search time (which with your approach is now logarithmic) and total number of registrations.

It would be interesting to see the impact of the proposed change on different scenarios, where shared and individual registrations can be assessed. Can you run a benchmark (preferably IMB) for the different usage registration patterns of the user and temporary buffers. Basically MPI_Allreduce, MPI_Allgather, and MPI_Alltoall should give a good picture of the impact.

I agree we want sharing of registrations between components in the future. I have in mind that we should implement a structure that caches associations of processes to smsc/xpmem endpoints. So when each component requests an endpoint, we return an already existing one if there is, achieving sharing of xpmem attachements to the specific process between different components.

If I understood your point correctly, we shouldn't have more total registrations with find() compared to iterate(). And the previous code did not achieve sharing between different components either. Even though it kept all registrations in the same structure, it filtered them by smsc endpoint. Now the registrations are scattered between more trees, but the nubmer of regs per endpoint remains the same.

I've ran some numbers to demonstrate the potential of this change, some of them are attached on the original comment above. For example in my collectives component (which uses smsc/xpmem, and I would actually plan to make a PR for in the upcoming days) there's a non-hierarchical allreduce version under which for small messages the root attaches to every other rank's buffer and performs the reduction himself. With 64 procs and with iterate, there are 63 regs in his cache + a couple other as result of the supporting MPI operations in the microbenchmark. With find() there are 1 or 2 or 3 regs to each rank (besides search time, the regs are also spread out to multiple trees). As a result, in this scenario with 64 procs @ 4K message size, the average allreduce latency improves from 100 us to 42 us.

I can also upload some graphs for these measurements to make examination easier.

Please upload the graphs. thanks.

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

gkatev · 2023-02-09T16:09:51Z

@bosilca, below are the graphs from some performance measurements I had done, with the old iterate() approach versus the new find() one. They are from the OSU suite, on a node with 2x Epyc 7501 (64 procs total). In most cases the difference will be noticeable in relatively small sizes (but above the single-copy threshold), which is why I show up until 8 KB. The gap is also expected to widen in pontially busier non-microbenchmark scenarios with more active registrations.

janjust · 2024-01-30T16:53:37Z

@bosilca Out of curiosity, why wasn't this merged??

github-actions bot added the Target: main label Jan 30, 2023

gkatev mentioned this pull request Jan 30, 2023

smsc/xpmem rcache: refactor to use tree find() instead of iterate() #11130

Closed

bosilca reviewed Jan 30, 2023

View reviewed changes

gkatev force-pushed the smsc_rcache_find branch 3 times, most recently from 7a3a825 to 462ae7f Compare February 2, 2023 12:17

gkatev requested a review from bosilca February 8, 2023 12:31

bosilca reviewed Feb 8, 2023

View reviewed changes

gkatev added 2 commits February 9, 2023 16:44

Refactor smsc/xpmem reg cache to use find() instead of iterate()

83fbee9

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

smsc/xpmem: Implement rcache flags PERSIST and BYPASS

b1568d9

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

gkatev force-pushed the smsc_rcache_find branch from 462ae7f to b1568d9 Compare February 9, 2023 14:53

bosilca approved these changes Feb 9, 2023

View reviewed changes

bosilca merged commit 58e54b5 into open-mpi:main Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

gkatev commented Jan 30, 2023

ompiteam-bot commented Jan 30, 2023

bosilca left a comment

bosilca Jan 30, 2023

gkatev Jan 31, 2023

bosilca Jan 30, 2023

gkatev Jan 31, 2023

bosilca Feb 8, 2023

gkatev Feb 8, 2023

bosilca Feb 8, 2023

bosilca Feb 8, 2023

gkatev Feb 8, 2023

bosilca Feb 8, 2023

gkatev commented Feb 9, 2023

janjust commented Jan 30, 2024

		@@ -67,10 +71,6 @@ typedef struct mca_smsc_xpmem_component_t mca_smsc_xpmem_component_t;

		struct mca_smsc_xpmem_module_t {
		mca_smsc_module_t super;

smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

Conversation

gkatev commented Jan 30, 2023

ompiteam-bot commented Jan 30, 2023

bosilca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkatev commented Feb 9, 2023

janjust commented Jan 30, 2024