Load balance at restart is memory-intensive #1013

ykempf · 2024-08-20T06:26:26Z

The scheme of our restart reading is inefficient in terms of memory high-water mark at least.

We read block counts and try to spread that evenly, but then the load balance will reshuffle things based on the LB_WEIGHT that's read in in the second stage. And that leads to massive rejigging of MPI domains and a significant peak in HWM. I assume this grew organically but it would seem more logical to simply read in the LB_WEIGHT and balance according to that, then read in, that should reduce the initial memory peak seen in current investigations.

It's not impossible I will file a patch soonish on this, but I have certain manuscript waiting for me... If anyone picks this up, I'll be grateful. :)

ykempf · 2024-08-20T06:44:09Z

Now as @markusbattarbee pointed out, it would be a tricky mesh of small reads instead of the current sequential approach, so maybe not worth bothering with right now. If it can't reshuffle at restart it probably won't fit well in memory at runtime either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balance at restart is memory-intensive #1013

Load balance at restart is memory-intensive #1013

ykempf commented Aug 20, 2024

ykempf commented Aug 20, 2024

Load balance at restart is memory-intensive #1013

Load balance at restart is memory-intensive #1013

Comments

ykempf commented Aug 20, 2024

ykempf commented Aug 20, 2024