[FEA] Add retry to Multi-threaded shuffle reader for host memory allocations #8900
Labels
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
task
Work required that improves the product but is not user facing
Is your feature request related to a problem? Please describe.
#9862 should make sure that we have limited the total amount of off heap host memory that we use, but we still need to update the code so it works properly and retries failed allocations.
The multi-threaded shuffle reader reads the shuffle data in a thread pool which means we need to update the thread pool to make it clear when a thread might be blocked on the pool, and what tasks the pool is working on. We want the ideal case, where there is enough host memory, to go as quickly as it does today, but we also want it to work in all cases. As such I think the deserialization code will need to act a lot like the input format multi-threaded reader code. We also want to make sure that the allocations are in retry blocks.
The text was updated successfully, but these errors were encountered: