dask-jobqueue scaling and adaptive scaling do not work well with slurm/oar #261

sjdv1982 · 2024-11-28T14:22:17Z

With adaptive scaling, a lot of jobs are cancelled before doing any work. Still, a job that is thus cancelled counts to the error count of a task, causing tasks to be rejected with 4 errors.

With .scale, finished jobs are never restarted, at least on OAR.

sjdv1982 · 2024-11-28T14:40:14Z

I suspect that these scaling methods were developed for the cloud and work rather well for those cases.
This means that we probably have to cook up our own solution, in combination with resources.

Allow seamless jobs (i.e. tasks) to give a memory requirement (Implement "memory" resource #240) , as well as a walltime estimate, on top of the current ncores requirement.
Using that, we can have the jobscheduler calculate how many tasks can run simultaneously, and estimate how many clusterjob-hours are needed (do a bit of knapsacking if needed).
At any time, we know how many tasks the current running clusterjobs can finish. Call that X.
If X is smaller than the number of submitted tasks (count young tasks with a lesser weight, e.g 0 for < 10s ago submitted, 0.1 for 10-30s ago submitted, 0.5 for 30-60 ago submitted), increase the number of clusterjobs.
If X is greater than the number of submitted tasks, retire clusterjobs. Start with the ones that are waiting in the queue, then mark others as "no more new tasks => retire" (can Dask schedulers do that?). Probably take some margin to account for: tasks taking longer than expected, tasks being restarted because of errors, etc.

sjdv1982 · 2024-12-05T10:49:07Z

Things go fine when you increase distributed.scheduler.unknown-task-duration to 1m (default: 500ms)

sjdv1982 added bug dependencies Pull requests that update a dependency file high priority labels Nov 28, 2024

sjdv1982 closed this as completed Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask-jobqueue scaling and adaptive scaling do not work well with slurm/oar #261

dask-jobqueue scaling and adaptive scaling do not work well with slurm/oar #261

sjdv1982 commented Nov 28, 2024

sjdv1982 commented Nov 28, 2024 •

edited

Loading

sjdv1982 commented Dec 5, 2024

dask-jobqueue scaling and adaptive scaling do not work well with slurm/oar #261

dask-jobqueue scaling and adaptive scaling do not work well with slurm/oar #261

Comments

sjdv1982 commented Nov 28, 2024

sjdv1982 commented Nov 28, 2024 • edited Loading

sjdv1982 commented Dec 5, 2024

sjdv1982 commented Nov 28, 2024 •

edited

Loading