You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With adaptive scaling, a lot of jobs are cancelled before doing any work. Still, a job that is thus cancelled counts to the error count of a task, causing tasks to be rejected with 4 errors.
With .scale, finished jobs are never restarted, at least on OAR.
The text was updated successfully, but these errors were encountered:
I suspect that these scaling methods were developed for the cloud and work rather well for those cases.
This means that we probably have to cook up our own solution, in combination with resources.
Allow seamless jobs (i.e. tasks) to give a memory requirement (Implement "memory" resource #240) , as well as a walltime estimate, on top of the current ncores requirement.
Using that, we can have the jobscheduler calculate how many tasks can run simultaneously, and estimate how many clusterjob-hours are needed (do a bit of knapsacking if needed).
At any time, we know how many tasks the current running clusterjobs can finish. Call that X.
If X is smaller than the number of submitted tasks (count young tasks with a lesser weight, e.g 0 for < 10s ago submitted, 0.1 for 10-30s ago submitted, 0.5 for 30-60 ago submitted), increase the number of clusterjobs.
If X is greater than the number of submitted tasks, retire clusterjobs. Start with the ones that are waiting in the queue, then mark others as "no more new tasks => retire" (can Dask schedulers do that?). Probably take some margin to account for: tasks taking longer than expected, tasks being restarted because of errors, etc.
With adaptive scaling, a lot of jobs are cancelled before doing any work. Still, a job that is thus cancelled counts to the error count of a task, causing tasks to be rejected with 4 errors.
With .scale, finished jobs are never restarted, at least on OAR.
The text was updated successfully, but these errors were encountered: