Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smsc/xpmem rcache: refactor to use tree find() instead of iterate() #11130

Closed
wants to merge 2 commits into from

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Nov 30, 2022

Hi, I'd like to propose some changes to the registration cache in smsc/xpmem. Currently looking up a registration is an operation that iterates the whole tree (until found). This is necessary as the tree is used to store attachments to different procs/endpoints, but can get slow and scale poorly. I suggest creating one regcache/tree per each peer proc and using the find() (O(logn)) method.

The benefit will vary depending on the number of standing registrations and message size. I expect the relative gain to be relevant for small-to-medium messages (but larger than any copy-in-copy-out-to-single-copy thresholds). In a (single) node with 64 ranks, using mirobenchmarks, tuned's alltoall @ 1K improves from ~143 us to ~124 us. Using my component (xhc), the hierarchical allreduce @ 16K improves from ~39 us to ~31 us, while the (generally worse) non-hierarchical allreduce (more attachments/regs) @ 4K improves from 100 us to 42 us .

This change will result in more trees, but the same total number of tree nodes. I considered alternatives like augmenting the base/bound with the endpoint/proc ID in the high/low values, but I think this approach is simpler. I want to also look into going from one tree per smsc endpoint to one tree per peer process, which would reduce duplication and allow sharing of xpmem mappings between defferent components (would require caching reg caches in a peer-proc-indexed structure, and creating/returning them in get_endpoint).

I wanted to get some input and general comments before proceeding with this PR.


I also took the chance to:

  1. Add support for (lack of) MCA_RCACHE_FLAGS_PERSIST and MCA_RCACHE_FLAGS_CACHE_BYPASS (though they won't be used anywhere atm :-) )
  2. Adjust/fix the bounds for the attachments and registrations (all the -1 and +1 here and there)

Regarding (2), I took "bound" to mean the last valid byte in the range, inclusive (is this accurate?). Thus, the last valid byte (aka the bound) in an XPMEM registration starting at base with length x is at base + x - 1). I adjusted the code following this rationale. But generally there's still a number of rogue +-1s in rcache_base and opal_interval_tree, which I'm not sure are 100% correct (but I'm not as comfortable with that code or with the one that would be affected by changing these).

Also as I was assimilating the code I had some concerns regarding the atomic operations and thread safety; I'm not sure if they are indeed problems or not, and how they depend on the various MPI thread support models. I've left two TODO comments regarding this inside smsc_xpmem_module.c.

Signed-off-by: George Katevenis gkatev@ics.forth.gr

FYI @devreal @hjelmn

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@gkatev gkatev changed the title xpmem rcache: refactor to use tree find() instead of iterate() smsc/xpmem rcache: refactor to use tree find() instead of iterate() Nov 30, 2022
Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
@devreal
Copy link
Contributor

devreal commented Jan 18, 2023

ok to test

@devreal
Copy link
Contributor

devreal commented Jan 18, 2023

@gkatev What is the state of this? I see debug print statements in the code. Is there anything in particular we can help with?

@gkatev
Copy link
Contributor Author

gkatev commented Jan 18, 2023

Hi @devreal, I plan to work again on it soon. The find() instead of iterate() part is functional, but I want to verify one last time that it all works correctly (thus the debug prints). And I still want to look into making endpoints be 1 per process. I also had thread safety concerns but these might be out of the scope of this PR.

Any feedback is welcome. For example:

I took "bound" to mean the last valid byte in the range, inclusive

Would you say this is accurate/desired?

Or if you meant code contributions, one could maybe look into the -/+1 stuff in rcache/interval_tree code, or into the 1 ep/proc part (creating a proc-to-ep structure, so as to have all components use the same endpoint/apid/regcache).

@devreal
Copy link
Contributor

devreal commented Jan 18, 2023

I suggest to split this PR into multiple pieces that are related (the find part, the flags, and endpoints). That makes reviewing easier and we can discuss the different points in separate issues. Right now it's a bit fuzzy. Splitting would also avoid the situation where fixes that are ready are held up by work in progress.

@gkatev
Copy link
Contributor Author

gkatev commented Jan 18, 2023

That makes sense. I will look into organizing/splitting up the features and create different commits/PRs (?) accordingly.

@devreal
Copy link
Contributor

devreal commented Jan 19, 2023

Thanks! Having different PRs would make life easier here.

@gkatev
Copy link
Contributor Author

gkatev commented Jan 30, 2023

I created #11358 to replace this PR. It includes the implementation of find() instead of iterate(). It also contains the implementation of the extra rcache flags in a separate commit -- if we'd like a different PR for that too, they are easy to separate.

Besides the changes in #11358, there pends the enhancement of having one rcache per process (instead of per endpoint), which is currently unimplemented and not contained in any PR.

There also some atomicity concerns, but they are likely out of the context of these changes, and I'm not currently looking into them.

@gkatev gkatev closed this Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants