[BUG] dask gpu memory leaking #7092

lmeyerov · 2021-01-07T00:06:44Z

Describe the bug

Normally, if a dask task has an exn, downstream tasks should also have an exn... except it seems gpu partitions leak during this with map_partitions / persist, which afaict is a normal pattern for ingest phases as dtypes get figured out.

Steps/Code to reproduce bug

Increase M below to use more memory based on your GPU and watch nvidia-smi in a separate window. Run the test cell a few times until you use up all your memory and your browser crashes (if a local gpu)

Setup:

import cudf, dask, dask_cuda, dask_cudf
client = dask.distributed.Client('dask-scheduler:8786') # managed pool starting at 32MB

# setup some tiny partitions (negligible gpu memory)
N = 20
gdf = cudf.DataFrame({'x': [1 for x in range(N)]})
dgdf = dask_cudf.from_cudf(gdf, npartitions=N)
dgdf

Notebook cell:

# bigger gdf alloc per partition
M = 10 * 1024 * 1024

step = 0
def expand(partition):
    global step # run on one gpu so ok
    step = step + 1
    if step == 5:
        raise RuntimeError('i should trigger exns and gc')
    return cudf.DataFrame({'x': [step for x in range(M)]})

for i in range(10):
    print('===============', i, '===============')
    dgdf2 = {}
    dgdf3 = {}
    try:
        dgdf2 = dgdf.map_partitions(expand, meta=dgdf)
        dgdf3 = dgdf2.persist() # force; also interesting - should the exn be here or next line?
        dgdf3.sample(frac=0.0001).compute() # force; collect tiny from all
    except:
        1
    del dgdf2
    del dgdf3
    !nvidia-smi

Expected behavior

The above process can be run indefinitely without running out of memory

Environment overview (please complete the following information)

Ubuntu 18 / 10.2 -> Docker Ubuntu 18 10.2 -> Conda rapids 0.17

Smaller pascal GPU (2GB)

The text was updated successfully, but these errors were encountered:

quasiben · 2021-01-08T18:00:56Z

Dask will eventually clean things up. Can you try:

client.cancel(dgdf2)

xref: https://distributed.dask.org/en/latest/memory.html#aggressively-clearing-data

lmeyerov · 2021-01-08T19:20:30Z

@quasiben Cool, confirming that explicitly canceling dgdf2 + dgdf3 works between outer-loop iterations.

I'm still stuck on: How would that look within one iteration of dgdf.map_partitions(...).persist()?

AFAICT CPU-oriented dask preserves working partitions so users can retry/inspect/etc. That's safe for CPUs because of swap/vmem. For GPU kernels, even with CPU<>GPU mapping (as was enabled in our testbed here), that seems more system-halting-prone. Maybe there's a mode/flag to free-partitions-on-fail or otherwise halt?

github-actions · 2021-02-16T20:19:41Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions · 2021-05-17T21:04:28Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

harrism · 2021-07-20T08:49:33Z

@quasiben can this be closed or can you answer the most recent question from @lmeyerov ?

quasiben · 2021-08-26T22:49:35Z

@lmeyerov sorry I didn't answer the last question. Many things have changed since you first opened this issue. Recently, @madsbk has been exploring heterogenous computing and I would suggest we move the conversation to this Dask issue:
dask/distributed#5201

lmeyerov added Needs Triage Need team to review and classify bug Something isn't working labels Jan 7, 2021

kkraus14 added Python Affects Python cuDF API. dask Dask issue and removed Needs Triage Need team to review and classify labels Jan 8, 2021

github-actions bot added the stale label Feb 16, 2021

github-actions bot added the inactive-90d label May 17, 2021

quasiben closed this as completed Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dask gpu memory leaking #7092

[BUG] dask gpu memory leaking #7092

lmeyerov commented Jan 7, 2021 •

edited

Loading

quasiben commented Jan 8, 2021

lmeyerov commented Jan 8, 2021 •

edited

Loading

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

harrism commented Jul 20, 2021

quasiben commented Aug 26, 2021

[BUG] dask gpu memory leaking #7092

[BUG] dask gpu memory leaking #7092

Comments

lmeyerov commented Jan 7, 2021 • edited Loading

quasiben commented Jan 8, 2021

lmeyerov commented Jan 8, 2021 • edited Loading

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

harrism commented Jul 20, 2021

quasiben commented Aug 26, 2021

lmeyerov commented Jan 7, 2021 •

edited

Loading

lmeyerov commented Jan 8, 2021 •

edited

Loading