Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smsc/xpmem: Refactor reg-cache to use tree's find() instead of iterate() #11358

Merged
merged 2 commits into from
Jan 30, 2024

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Jan 30, 2023

Hi, I'd like to propose some changes to the registration cache in smsc/xpmem. Currently, looking up a registration is an operation that iterates the whole tree (or until found). This is necessary as the tree is used to store attachments to different procs/endpoints, but can get slow and scale poorly. To address this, in this PR I suggest creating one regcache/tree per each endpoint/peer and using the find() method (O(logn) instead of O(n)).

This PR supersedes #11130.
FYI @devreal

  • The first commit implements the juicy part of this PR (find vs iterate).
  • The second and smaller commit takes the opportunity to add support for the MCA_RCACHE_FLAGS_PERSIST and MCA_RCACHE_FLAGS_CACHE_BYPASS flags
    • Would this be better in its own PR instead of bundled here? If yes I can branch it out to a new one

Note that to put this together I took "bound" to mean the last valid byte in the range, inclusive, and fixed some instances where this was mishandled. Please let me know if this is not accurate/desired.


The benefit with this change will vary depending on the number of standing registrations and message size. I expect the relative gain to be relevant for small-to-medium messages (but larger than any copy-in-copy-out-to-single-copy thresholds). In a (single) node with 64 ranks, using mirobenchmarks, tuned's alltoall @ 1K improves from ~143 us to ~124 us. Using my component (xhc), the hierarchical allreduce @ 16K improves from ~39 us to ~31 us, while the (generally worse) non-hierarchical allreduce (more attachments/regs) @ 4K improves from 100 us to 42 us .

While using a single tree for each endpoint will result in more trees, the total number of tree nodes does not increase.
A planned future enhancement that would cache peer-process-to-rcache information will further improve this.


Signed-off-by: George Katevenis gkatev@ics.forth.gr

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to test.

return OPAL_SUCCESS;
endpoint->vma_module = mca_rcache_base_vma_module_alloc();
if (OPAL_UNLIKELY(NULL == endpoint->vma_module)) {
OBJ_RELEASE(endpoint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All your return paths are polluted with the release of the endpoint object, that was allocated too early. If you delay the creation of the endpoint until you gather all necessary information the code will look more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the modex operation earlier to avoid one such instance. But I'm not sure I can elegantly avoid the release(s) in the other 2 fail blocks, as all 3 operations (ep, vma_module, xpmem_get) may fail and all need cleanup.

reg = NULL;
}
} else {
/* If there is a registration that overlaps with the requested range, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible there are multiple regions that overlaps with the current one (imagine two prior registrations with a gap in the middle, gap that is then covered by this registration). If I understand the code correctly, you expect to merge these 3 registration, but the current code can only merge two registrations (so it will merge one of the earlier registrations with the current one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I see. Apparently also the old code only merges with the first overlapping area found. I updated the PR, and now the loop does not stop on the first overlapping area it finds, but keeps going and includes potentially all 4 surrounding areas. Please review again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but still suboptimal. There are two issues I noticed:

  1. each find operation takes and release a lock on the underlying opal_interval_tree_t object. If the merge operation is lowered into the interval object itself, we can do the search and the merge with only one lock and one unlock.
  2. having the search outside the opal_interval_tree_t also trigger the tree traversal for each find operation. If we move the merge down into this object, we can do all the merges with a single tree traversal (because the nodes are ordered).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is suboptimal; we need multiple operations as find() does not support partial matching. There are also some thread safety concerns (in this spot and also in some other ones). It is my intution that to address these concerns the atomicity point needs to be moved inside the tree instead of all the way out in smsc.

Moving the merge operation inside the tree indeed sounds good (and would/could also improve the atomicity thing). Even better if we would have an XPMEM reg immediately ready to take the place of the old ones as soon as we invalidate them.

Anyway besides the rambling above, these problems exist in the old implementation with iterate() and they remain in the find() one suggested here. I don't currently have a concrete plan to tackle them, so I would suggest we merge the find() improvement which is ready and allegedly does not introduce new shortcomings. We could create an issue to track fixes and improvements that need to be made in the future.

@gkatev gkatev force-pushed the smsc_rcache_find branch 3 times, most recently from 7a3a825 to 462ae7f Compare February 2, 2023 12:17
@gkatev gkatev requested a review from bosilca February 8, 2023 12:31
reg = NULL;
}
} else {
/* If there is a registration that overlaps with the requested range, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but still suboptimal. There are two issues I noticed:

  1. each find operation takes and release a lock on the underlying opal_interval_tree_t object. If the merge operation is lowered into the interval object itself, we can do the search and the merge with only one lock and one unlock.
  2. having the search outside the opal_interval_tree_t also trigger the tree traversal for each find operation. If we move the merge down into this object, we can do all the merges with a single tree traversal (because the nodes are ordered).

@@ -67,10 +71,6 @@ typedef struct mca_smsc_xpmem_component_t mca_smsc_xpmem_component_t;

struct mca_smsc_xpmem_module_t {
mca_smsc_module_t super;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical about the interest of this particular change. If we assume the communication library is only about point-to-point communications without reuse of buffers, then this change might make sense. But I don't think this is the most obvious pattern in which MPI is used, and as soon as we add collective communications to the mix sharing registrations is essential.

Let's imagine an MPI_Allreduce on a very fat node. The temporary buffer used in the collective will be registered for most peers (depending on the collective algorithm, but more than once) instead of being registered once. I understand this is a balance between search time (which with your approach is now logarithmic) and total number of registrations.

It would be interesting to see the impact of the proposed change on different scenarios, where shared and individual registrations can be assessed. Can you run a benchmark (preferably IMB) for the different usage registration patterns of the user and temporary buffers. Basically MPI_Allreduce, MPI_Allgather, and MPI_Alltoall should give a good picture of the impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we want sharing of registrations between components in the future. I have in mind that we should implement a structure that caches associations of processes to smsc/xpmem endpoints. So when each component requests an endpoint, we return an already existing one if there is, achieving sharing of xpmem attachements to the specific process between different components.

If I understood your point correctly, we shouldn't have more total registrations with find() compared to iterate(). And the previous code did not achieve sharing between different components either. Even though it kept all registrations in the same structure, it filtered them by smsc endpoint. Now the registrations are scattered between more trees, but the nubmer of regs per endpoint remains the same.

I've ran some numbers to demonstrate the potential of this change, some of them are attached on the original comment above. For example in my collectives component (which uses smsc/xpmem, and I would actually plan to make a PR for in the upcoming days) there's a non-hierarchical allreduce version under which for small messages the root attaches to every other rank's buffer and performs the reduction himself. With 64 procs and with iterate, there are 63 regs in his cache + a couple other as result of the supporting MPI operations in the microbenchmark. With find() there are 1 or 2 or 3 regs to each rank (besides search time, the regs are also spread out to multiple trees). As a result, in this scenario with 64 procs @ 4K message size, the average allreduce latency improves from 100 us to 42 us.

I can also upload some graphs for these measurements to make examination easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please upload the graphs. thanks.

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
@gkatev
Copy link
Contributor Author

gkatev commented Feb 9, 2023

@bosilca, below are the graphs from some performance measurements I had done, with the old iterate() approach versus the new find() one. They are from the OSU suite, on a node with 2x Epyc 7501 (64 procs total). In most cases the difference will be noticeable in relatively small sizes (but above the single-copy threshold), which is why I show up until 8 KB. The gap is also expected to widen in pontially busier non-microbenchmark scenarios with more active registrations.

tuned_allgather
tuned_allreduce
tuned_alltoall
tuned_gather
tuned_reduce
xhc_allreduce
xhc_flat_allreduce

@janjust
Copy link
Contributor

janjust commented Jan 30, 2024

@bosilca Out of curiosity, why wasn't this merged??

@bosilca bosilca merged commit 58e54b5 into open-mpi:main Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants