Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have host spill use the new HostAlloc API #9257

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Sep 19, 2023

This also includes a small memory leak fix for the RapidsDiskStore that was exposed when I started to test with spill this way.

This does not test the code path when we enable host memory limits instead of the host spill store limits. That will be covered by #8883, which should be the next thing I work on.

This fixes #8881

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2 revans2 self-assigned this Sep 19, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Sep 19, 2023

build

@@ -192,7 +192,7 @@ class RapidsDiskStore(diskBlockManager: RapidsDiskBlockManager)
val path = id.getDiskPath(diskBlockManager)
withResource(new FileInputStream(path)) { fis =>
val (header, hostBuffer) = SerializedHostTableUtils.readTableHeaderAndBuffer(fis)
val hostCols = closeOnExcept(hostBuffer) { _ =>
val hostCols = withResource(hostBuffer) { _ =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

abellina
abellina previously approved these changes Sep 19, 2023
// TODO: this is disabled for now since subsequent work will tie this into
// our host allocator apis.
if (false && !wouldFit) {
if (!wouldFit) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revans2 this will change the behavior in 23.10. Before this, we would still to host. Now we are going to skip to disk if we can't fit given the limit. Just making sure that was intended.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that is what we want, but if anyone has a different opinion I am all ears. We want to get to the point where we have a hard limit on host memory. The reason we don't want to make the changes piecemeal is to reduce the pain a customer might see when running a job. This change is only going to show up when a customer wants to spill from GPU memory, and it is larger than the spill store size. I think that is fairly rare, so I am willing to take the hit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default the host store is configured to 1GB + pinnedPoolSize, and pinnedPoolSize is defaulted to 0. I think if we raised it to 2GB I'd agree, otherwise it seems fairly common to bump up batchSizeBytes to 2GB and those would spill to disk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, with such a small spill store, we are bound to spill to disk very often anyway... so I don't know if we are saving too much.

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Sep 19, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Sep 20, 2023

@jlowe can you please take a look?

@revans2
Copy link
Collaborator Author

revans2 commented Sep 21, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Sep 21, 2023

build

@revans2 revans2 merged commit 2a8518e into NVIDIA:branch-23.10 Sep 21, 2023
@revans2 revans2 deleted the memory_limit_spill_uses_new_API branch September 21, 2023 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Trigger host memory spilling when more host memory is needed
4 participants