-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Rebalance with dask-cuda does not rebalance effectively #698
Comments
@VibhuJawa apologies, for not responding earlier. When I tried this the workflow took +45 minutes at which point I killed it. Is this your experience ? |
Nope, It takes It will also reproduce it at ar = np.ones(shape=(1_000_000,512),dtype=np.float32)
dask_ar = da.from_array(ar,chunks=40_000).map_blocks(cp.asarray)
dask_ar = dask_ar.persist()
progress(dask_ar)
|
@VibhuJawa sorry again for the delay. The current rebalancing scheme only takes into account memory consumption on the host: https://github.com/dask/distributed/blob/ac55f25d230c7144ed618d1d4374f254303a4e0a/distributed/scheduler.py#L6104-L6113 This will take some time to think through how to rebalance while taking into account current GPU consumption and any device memory limit settings. Device memory limits are set in dask-cuda will all the rebalancing occurs in the scheduler |
This issue has been labeled |
This issue has been labeled |
Problem: Rebalance with dask-cuda does not rebalance effectively
Problem Context
We can not seem to rebalance with
dask-cuda
effectively which leads to imbalanced GPU usage.This becomes a problem across a bunch of workflows especially those that involve a mix of ETL and machine learning .
How a lot of machine learning algorithms like (XGboost, cuml.dask.knn etc, cuGraph) work today is that they run a local portion of the algorithm on the data on each GPU and then do an all reduce like operation. If you have imbalanced data on one of the GPUs you get memory limited on that GPU which leads to memory failures. If the data was balanced equally we wont have these issues.
Minimal Example:
Start Cluster
Create Imbalanced Data on workers
Try and Fail with Rebalancing
GPU Usage for context:
Notebook Example:
https://gist.github.com/VibhuJawa/eb2d25c0c6fddeebf0e104b82eb8ef3e
CC: @randerzander , @quasiben , @ayushdg .
The text was updated successfully, but these errors were encountered: