[BUG] Rebalance with dask-cuda does not rebalance effectively #698

VibhuJawa · 2021-08-04T19:23:53Z

Problem: Rebalance with dask-cuda does not rebalance effectively

Problem Context

We can not seem to rebalance with dask-cuda effectively which leads to imbalanced GPU usage.

This becomes a problem across a bunch of workflows especially those that involve a mix of ETL and machine learning .

How a lot of machine learning algorithms like (XGboost, cuml.dask.knn etc, cuGraph) work today is that they run a local portion of the algorithm on the data on each GPU and then do an all reduce like operation. If you have imbalanced data on one of the GPUs you get memory limited on that GPU which leads to memory failures. If the data was balanced equally we wont have these issues.

Minimal Example:

Start Cluster

import numpy as np
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import cupy as cp
from distributed import progress,wait


cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=[0,1])
client = Client(cluster)

Create Imbalanced Data on workers

ar = np.ones(shape=(10_000_000,512),dtype=np.float32)
dask_ar = da.from_array(ar,chunks=400_000).map_blocks(cp.asarray)

dask_ar = dask_ar.persist()
wait(dask_ar);

Try and Fail with Rebalancing

client.has_what()
wait(client.rebalance(dask_ar))
client.has_what()

GPU Usage for context:

Notebook Example:

https://gist.github.com/VibhuJawa/eb2d25c0c6fddeebf0e104b82eb8ef3e

CC: @randerzander , @quasiben , @ayushdg .

The text was updated successfully, but these errors were encountered:

quasiben · 2021-08-04T19:30:24Z

@VibhuJawa apologies, for not responding earlier. When I tried this the workflow took +45 minutes at which point I killed it. Is this your experience ?

VibhuJawa · 2021-08-04T19:33:06Z

@VibhuJawa apologies, for not responding earlier. When I tried this the workflow took +45 minutes at which point I killed it. Is this your experience ?

Nope, It takes 1min 34.9s for me . Just tried it again.

It will also reproduce it at 1/10th the factor , it takes 7s locally for me.

ar = np.ones(shape=(1_000_000,512),dtype=np.float32)
dask_ar = da.from_array(ar,chunks=40_000).map_blocks(cp.asarray)

dask_ar = dask_ar.persist()
progress(dask_ar)

client.has_what()
wait(client.rebalance(dask_ar))
client.has_what()

quasiben · 2021-08-10T21:24:02Z

@VibhuJawa sorry again for the delay. The current rebalancing scheme only takes into account memory consumption on the host: https://github.com/dask/distributed/blob/ac55f25d230c7144ed618d1d4374f254303a4e0a/distributed/scheduler.py#L6104-L6113

This will take some time to think through how to rebalance while taking into account current GPU consumption and any device memory limit settings. Device memory limits are set in dask-cuda will all the rebalancing occurs in the scheduler

github-actions · 2021-11-23T20:02:59Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-11-23T20:03:09Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

randerzander assigned quasiben Aug 9, 2021

madsbk mentioned this issue Aug 11, 2021

Heterogeneous Computing Design dask/distributed#5201

Open

github-actions bot added the inactive-90d label Nov 23, 2021

github-actions bot added the inactive-30d label Nov 23, 2021

caryr35 added this to dask-cuda Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Rebalance with dask-cuda does not rebalance effectively #698

[BUG] Rebalance with dask-cuda does not rebalance effectively #698

VibhuJawa commented Aug 4, 2021 •

edited

Loading

quasiben commented Aug 4, 2021

VibhuJawa commented Aug 4, 2021 •

edited

Loading

quasiben commented Aug 10, 2021

github-actions bot commented Nov 23, 2021

github-actions bot commented Nov 23, 2021

[BUG] Rebalance with dask-cuda does not rebalance effectively #698

[BUG] Rebalance with dask-cuda does not rebalance effectively #698

Comments

VibhuJawa commented Aug 4, 2021 • edited Loading

Problem: Rebalance with dask-cuda does not rebalance effectively

Problem Context

Minimal Example:

Start Cluster

Create Imbalanced Data on workers

Try and Fail with Rebalancing

GPU Usage for context:

Notebook Example:

quasiben commented Aug 4, 2021

VibhuJawa commented Aug 4, 2021 • edited Loading

quasiben commented Aug 10, 2021

github-actions bot commented Nov 23, 2021

github-actions bot commented Nov 23, 2021

VibhuJawa commented Aug 4, 2021 •

edited

Loading

VibhuJawa commented Aug 4, 2021 •

edited

Loading