[FEA] Stress test UCX shuffle after Host memory limits are in place #8901
Labels
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
test
Only impacts tests
Is your feature request related to a problem? Please describe.
I don't think we need to do anything special to support UCX shuffle to limit host memory. It should really only be on the GPU or when spilling. But even with the spilling it uses bounce buffers for CPU data transfers. The problem here really comes down to memory pressure and thundering herds. Shuffle is by definition a thundering herd. If executor A has a very large shuffle buffer that we need to send to executor B, and B has a similarly large one to send to A. There could be a situation where both A and B have already read in the host memory buffer so that they can send the data to the other, and then A and B both try to allocate a device buffer to receive it. This causes the spill to kick in, but we cannot spill because A and B are holding onto too much host memory for the spill to allocate the full buffer. This could result in a deadlock. There are two ways to work around this, and I think we need to implement both of them.
First we want to be sure that any receiving allocation happens before the sending data is brought into memory. When the send data is brought into memory it is copied to the bounce buffer and then made spillable again. That should be enough to prevent a deadlock. It would be great to also add in some better APIs so that we don’t necessarily have to read back in all of the data from disk if we are going to a bounce buffer, and are not likely to touch some of the data ever again. But that is an optimization.
I think in practice the order of operations for UCX + Spill is happening today, but we need to be sure of that and we need to have it documented that it is not optional.
This is to really stress test UCX shuffle and see that we are not in a case where we can get deadlocks.
The text was updated successfully, but these errors were encountered: