-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846
Comments
No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST. |
Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag? |
Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP) |
Unfortunately no, not yet. In BP5Writer.cpp there's code that starts with the comment "Two-step metadata aggregation" that implements this for BP5, but it hasn't been done yet for SST. (Here, we're exploiting some characteristics of BP5 metadata. In particular, many time multiple ranks have identical meta-metadata and we can discard those, keeping only one unique copy. This reduces overall metadata size dramatically, at a cost of having to do aggregation in multiple stages. Norbert implemented a fix for this in the BP5 writer, but probably it should be reworked so that it can be shared between engines that use BP5 serialization. Doing that right (so that we use a simple approach for small scale and only go to more complex measures when necessary) isn't wildly hard, but it's non-trivial (and something I probably can't get to this week). |
No, this should be independent of those changes. |
In the meantime, I'll try if using this as a workaround might help. This should fix the GatherV call at the cost of a slightly higher latency, but I don't know if there is any 32bit indexing going on later on that will break things again. |
I'd think that that would function as a workaround. As far as I know there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to implement something smarter, but if this gets you through, let me know. |
I can't look at this right now but note that the two level aggregation did
not help with the attributes, only with meta-meta data. That is if an
attribute is defined on all processes, that blows up the aggregation size.
If that is the reason you reach the limit, two level aggregation does not
decrease it.
…On Tue, Oct 17, 2023, 3:19 PM Greg Eisenhauer ***@***.***> wrote:
In the meantime, I'll try if using this
<853ff0d>
as a workaround might help
I'd think that that would function as a workaround. As far as I know
there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to
implement something smarter, but if this gets you through, let me know.
—
Reply to this email directly, view it on GitHub
<#3846 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYYYLPG6WYKALFC3Q5BMIDX72AX5AVCNFSM6AAAAAA6DWCEE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRWGQYDIOJXGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
That is absolutely true... |
The job now ran through without crashing at 7168 nodes. I'll now try going full scale.
We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this. |
Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale) |
Excellent... Adding an issue to address these things #3852 across the board. |
Yes, you should absolutely do this. At least currently, all attributes from all ranks are stored and installed by the reader, with duplicates doing nothing. Setting the same attributes on all nodes just adds overhead. |
* ADIOS2: Optionally write attributes only from given ranks Ref. ornladios/ADIOS2#3846 (comment) * ADIOS2 < v2.9 compatibility in tests * Documentation
Describe the bug
CP_consolidateDataToRankZero()
insource/adios2/toolkit/sst/cp/cp_common.c
collects the metadata to rank 0 upon EndStep. In PIConGPU, a single rank's contribution is ~38948 bytes.On 7000 Frontier nodes with 8 GPUs per node:
38948B*7000*8 = 2080MB
Looking into
CP_consolidateDataToRankZero()
:Since the
Displs
is a vector of int, the maximum supported dest buffer size for this method is 2GB.To Reproduce
-- no reproducer --
Expected behavior
Some method to handle SST metadata aggregation at large scale
Desktop (please complete the following information):
Additional context
I'm setting
MarshalMethod = bp5
in SSTFollowing up
Was the issue fixed? Please report back.
The text was updated successfully, but these errors were encountered: