Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-jobqueue scaling and adaptive scaling do not work well with slurm/oar #261

Closed
sjdv1982 opened this issue Nov 28, 2024 · 2 comments
Closed
Labels
bug dependencies Pull requests that update a dependency file high priority

Comments

@sjdv1982
Copy link
Owner

With adaptive scaling, a lot of jobs are cancelled before doing any work. Still, a job that is thus cancelled counts to the error count of a task, causing tasks to be rejected with 4 errors.

With .scale, finished jobs are never restarted, at least on OAR.

@sjdv1982 sjdv1982 added bug dependencies Pull requests that update a dependency file high priority labels Nov 28, 2024
@sjdv1982
Copy link
Owner Author

sjdv1982 commented Nov 28, 2024

I suspect that these scaling methods were developed for the cloud and work rather well for those cases.
This means that we probably have to cook up our own solution, in combination with resources.

  1. Allow seamless jobs (i.e. tasks) to give a memory requirement (Implement "memory" resource #240) , as well as a walltime estimate, on top of the current ncores requirement.
  2. Using that, we can have the jobscheduler calculate how many tasks can run simultaneously, and estimate how many clusterjob-hours are needed (do a bit of knapsacking if needed).
  3. At any time, we know how many tasks the current running clusterjobs can finish. Call that X.
  4. If X is smaller than the number of submitted tasks (count young tasks with a lesser weight, e.g 0 for < 10s ago submitted, 0.1 for 10-30s ago submitted, 0.5 for 30-60 ago submitted), increase the number of clusterjobs.
  5. If X is greater than the number of submitted tasks, retire clusterjobs. Start with the ones that are waiting in the queue, then mark others as "no more new tasks => retire" (can Dask schedulers do that?). Probably take some margin to account for: tasks taking longer than expected, tasks being restarted because of errors, etc.

@sjdv1982
Copy link
Owner Author

sjdv1982 commented Dec 5, 2024

Things go fine when you increase distributed.scheduler.unknown-task-duration to 1m (default: 500ms)

@sjdv1982 sjdv1982 closed this as completed Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dependencies Pull requests that update a dependency file high priority
Projects
None yet
Development

No branches or pull requests

1 participant