Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask gpu memory leaking #7092

Closed
lmeyerov opened this issue Jan 7, 2021 · 6 comments
Closed

[BUG] dask gpu memory leaking #7092

lmeyerov opened this issue Jan 7, 2021 · 6 comments
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@lmeyerov
Copy link

lmeyerov commented Jan 7, 2021

Describe the bug

Normally, if a dask task has an exn, downstream tasks should also have an exn... except it seems gpu partitions leak during this with map_partitions / persist, which afaict is a normal pattern for ingest phases as dtypes get figured out.

Steps/Code to reproduce bug

Increase M below to use more memory based on your GPU and watch nvidia-smi in a separate window. Run the test cell a few times until you use up all your memory and your browser crashes (if a local gpu)

Setup:

import cudf, dask, dask_cuda, dask_cudf
client = dask.distributed.Client('dask-scheduler:8786') # managed pool starting at 32MB

# setup some tiny partitions (negligible gpu memory)
N = 20
gdf = cudf.DataFrame({'x': [1 for x in range(N)]})
dgdf = dask_cudf.from_cudf(gdf, npartitions=N)
dgdf

Notebook cell:

# bigger gdf alloc per partition
M = 10 * 1024 * 1024

step = 0
def expand(partition):
    global step # run on one gpu so ok
    step = step + 1
    if step == 5:
        raise RuntimeError('i should trigger exns and gc')
    return cudf.DataFrame({'x': [step for x in range(M)]})

for i in range(10):
    print('===============', i, '===============')
    dgdf2 = {}
    dgdf3 = {}
    try:
        dgdf2 = dgdf.map_partitions(expand, meta=dgdf)
        dgdf3 = dgdf2.persist() # force; also interesting - should the exn be here or next line?
        dgdf3.sample(frac=0.0001).compute() # force; collect tiny from all
    except:
        1
    del dgdf2
    del dgdf3
    !nvidia-smi

Expected behavior

The above process can be run indefinitely without running out of memory

Environment overview (please complete the following information)

Ubuntu 18 / 10.2 -> Docker Ubuntu 18 10.2 -> Conda rapids 0.17

Smaller pascal GPU (2GB)

@lmeyerov lmeyerov added Needs Triage Need team to review and classify bug Something isn't working labels Jan 7, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. dask Dask issue and removed Needs Triage Need team to review and classify labels Jan 8, 2021
@quasiben
Copy link
Member

quasiben commented Jan 8, 2021

Dask will eventually clean things up. Can you try:

client.cancel(dgdf2)

xref: https://distributed.dask.org/en/latest/memory.html#aggressively-clearing-data

@lmeyerov
Copy link
Author

lmeyerov commented Jan 8, 2021

@quasiben Cool, confirming that explicitly canceling dgdf2 + dgdf3 works between outer-loop iterations.

I'm still stuck on: How would that look within one iteration of dgdf.map_partitions(...).persist()?

AFAICT CPU-oriented dask preserves working partitions so users can retry/inspect/etc. That's safe for CPUs because of swap/vmem. For GPU kernels, even with CPU<>GPU mapping (as was enabled in our testbed here), that seems more system-halting-prone. Maybe there's a mode/flag to free-partitions-on-fail or otherwise halt?

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@harrism
Copy link
Member

harrism commented Jul 20, 2021

@quasiben can this be closed or can you answer the most recent question from @lmeyerov ?

@quasiben
Copy link
Member

@lmeyerov sorry I didn't answer the last question. Many things have changed since you first opened this issue. Recently, @madsbk has been exploring heterogenous computing and I would suggest we move the conversation to this Dask issue:
dask/distributed#5201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants