Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add retry to Multi-threaded shuffle reader for host memory allocations #8900

Open
revans2 opened this issue Aug 1, 2023 · 1 comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 1, 2023

Is your feature request related to a problem? Please describe.

#9862 should make sure that we have limited the total amount of off heap host memory that we use, but we still need to update the code so it works properly and retries failed allocations.

The multi-threaded shuffle reader reads the shuffle data in a thread pool which means we need to update the thread pool to make it clear when a thread might be blocked on the pool, and what tasks the pool is working on. We want the ideal case, where there is enough host memory, to go as quickly as it does today, but we also want it to work in all cases. As such I think the deserialization code will need to act a lot like the input format multi-threaded reader code. We also want to make sure that the allocations are in retry blocks.

@revans2 revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Aug 1, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 8, 2023
@abellina
Copy link
Collaborator

This issue is a manifestation of not having good limits in the MT shuffle: #9153

As part of #8900, we should work out the pool of memory that we are allowing to be inflight. In terms of what I am finding in #9153, it seems 128MB is a better default per thread, so I'm going to PR that. That said, we need to work out if this is going to be a static amount dedicated for shuffle in #8900, or if we are going to allow the maxBytesInFlight to "fluctuate" as host memory usage is in demand.

@revans2 revans2 changed the title [FEA] Update Multi-threaded shuffle reader to limit host memory usage [FEA] Add retry to Multi-threaded shuffle reader for host memory allocations Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

3 participants