Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655

moonrail · 2020-07-20T13:48:25Z

ISSUE TYPE

Bug Report

SUMMARY

With a lot of pending Jobs (e.g. if Instances were disabled/down for maintenance, Cluster was restarted, Failover, ...) the Job-Scheduler/Dispatcher "run_task_manager" will not finish in under 5 Minutes and be killed by: https://github.com/ansible/awx/blob/devel/awx/main/dispatch/pool.py#L385
From our observation Jobs will only be executed, if Scheduler/Dispatcher ran through all pending/new Jobs and dispatched/skipped them.
Therefore no Jobs will be executed, and "run_callback_receiver" will never be run.

We've tried it with following setups:

kubernetes with 15 instances, dedicated db
just for testing: docker on linux (5 instances, with modified settings.py for unique UUIDs & Hostnames, dedicated db)

ENVIRONMENT

AWX version: 13.0.0
AWX install method: minishift, docker on linux
Ansible version: 2.9.10
Operating System: CentOS8
Web Browser: Firefox, Chromium

STEPS TO REPRODUCE

First method:

have a cluster with e.g. 10 instances
spawn >600 Jobs (depends on latency to DB & instance resources)
shut down DB or all instances
start DB or all instances again

Second method:

have a cluster with e.g. 10 instances
set instances to disabled
spawn >600 Jobs (depends on latency to DB & instance resources)
enable instances again

EXPECTED RESULTS

Either:

Jobs are being run
Or:
Job is being run right after Scheduler/Dispatcher dispatched it, not after Scheduler/Dispatcher is done
- this way at least some Jobs will be executed

ACTUAL RESULTS

no jobs are executed, as Scheduler/Dispatcher never finishes processing queue before 5 minutes

ADDITIONAL INFORMATION

Example log output of the Scheduler/Dispatcher being killed:

awx_task | 2020-07-16T12:05:24.565256659Z 2020-07-16 12:05:24,564 DEBUG    awx.main.scheduler job 657142 (pending) couldn't be scheduled on graph, waiting for next cycle
awx_task | 2020-07-16T12:05:25.381709968Z 2020-07-16 12:05:25,381 DEBUG    awx.main.scheduler No instance available in group tower to run job job 657143 (pending) w/ capacity requirement 2
awx_task | 2020-07-16T12:05:25.382505862Z 2020-07-16 12:05:25,382 DEBUG    awx.main.scheduler job 657143 (pending) couldn't be scheduled on graph, waiting for next cycle
awx_task | 2020-07-16T12:05:25.397228851Z 2020-07-16 12:05:25,286 ERROR    awx.main.dispatch run_task_manager has held the advisory lock for >5m, sending SIGTERM to 8784
awx_task | 2020-07-16T12:05:25.399474966Z 2020-07-16 12:05:25,398 WARNING  awx.main.dispatch scaling down worker pid:8789
awx_task | 2020-07-16T12:05:25.402610294Z 2020-07-16 12:05:25,401 WARNING  awx.main.dispatch worker exiting gracefully pid:8789

The text was updated successfully, but these errors were encountered:

moonrail · 2020-07-21T08:13:32Z

Probably this also also affects running jobs.

In #7659 we've described how we spawn 10-30 jobs per minute.
The cluster now reached a deadlock without any restart or interference - we've just let it run.

In logs I can see, that jobs are being processed less and less when the number of pending jobs increases.
I guess this has the same cause we've opened this issue for: Scheduler/Dispatcher runs a long time and/or does not finish in time, so no/only few new jobs are dispatched.

Our cluster now reached 5900 pending jobs & 75 running jobs.
No new jobs are being dispatched, no running jobs are finished.
Most running jobs are frozen in "Gathering Facts" state.

Processes of ansible-playbook on instances are still running (note cpu time):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                  
17527 root      20   0 3332888 197832   8920 S  73.4   1.6   1:45.29 awx-manage                                                                                                                                                                               
 4205 root      20   0  316044  74520  18176 R  11.0   0.6 114:15.41 ansible-playboo                                                                                                                                                                          
 3785 root      20   0  316288  74620  18336 R   9.6   0.6 115:25.10 ansible-playboo                                                                                                                                                                          
 4401 root      20   0  316716  75108  18192 S   9.6   0.6 113:40.32 ansible-playboo                                                                                                                                                                          
 4846 root      20   0  316020  74448  18260 S   9.6   0.6 114:24.98 ansible-playboo                                                                                                                                                                          
 3443 root      20   0  316280  74996  18504 S   9.3   0.6 115:12.61 ansible-playboo                                                                                                                                                                          
 4672 root      20   0  316032  74396  18248 S   9.3   0.6 112:54.25 ansible-playboo                                                                                                                                                                          
27606 root      20   0  312872  70880  18396 R   8.6   0.6 123:02.78 ansible-playboo                                                                                                                                                                          
25336 root      20   0  312360  70452  18244 S   8.3   0.6 124:37.39 ansible-playboo                                                                                                                                                                          
29227 root      20   0  312104  69888  18184 S   8.3   0.6 123:39.88 ansible-playboo

Edit 1:
Locks seem not to be always the problem:

test_awx=# select count(*) from pg_locks where not granted;
 count 
-------
     1

Although sometimes they're about 300+ seconds, as initially described and some queries now run pretty long:

2020-07-21 10:09:32.709 CEST [12830] LOG:  process 12830 acquired ShareLock on transaction 29056175 after 351824.684 ms
2020-07-21 10:09:32.709 CEST [12830] CONTEXT:  while updating tuple (1,61) in relation "main_unifiedjobtemplate"
2020-07-21 10:09:32.709 CEST [12830] STATEMENT:  UPDATE "main_unifiedjobtemplate" SET "next_job_run" = '2020-07-21T08:03:45+00:00'::timestamptz, "next_schedule_id" = 33 WHERE "main_unifiedjobtemplate"."id" = 16
2020-07-21 10:09:32.709 CEST [12830] LOG:  duration: 351824.990 ms  statement: UPDATE "main_unifiedjobtemplate" SET "next_job_run" = '2020-07-21T08:03:45+00:00'::timestamptz, "next_schedule_id" = 33 WHERE "main_unifiedjobtemplate"."id" = 16
2020-07-21 10:09:43.338 CEST [14284] LOG:  process 14284 still waiting for ShareLock on transaction 29056971 after 1000.114 ms
2020-07-21 10:09:43.338 CEST [14284] DETAIL:  Process holding the lock: 14270. Wait queue: 14284.
2020-07-21 10:09:43.338 CEST [14284] CONTEXT:  while updating tuple (1,68) in relation "main_unifiedjobtemplate"
2020-07-21 10:09:43.338 CEST [14284] STATEMENT:  UPDATE "main_unifiedjobtemplate" SET "current_job_id" = 790128 WHERE "main_unifiedjobtemplate"."id" = 16
2020-07-21 10:10:11.511 CEST [14382] LOG:  duration: 1906.025 ms  statement: SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_
unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule
_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unified
job"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob".
"result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."organization_id", "main_job"."unifiedjob_ptr_id", "main_job"."survey_passwords", "main_job"."custom_virtualenv", "main_job"."webhook_service
", "main_job"."webhook_credential_id", "main_job"."webhook_guid", "main_job"."diff_mode", "main_job"."job_type", "main_job"."inventory_id", "main_job"."project_id", "main_job"."playbook", "main_job"."scm_branch", "main_job"."forks", "main_job"."limit", "
main_job"."verbosity", "main_job"."extra_vars", "main_job"."job_tags", "main_job"."force_handlers", "main_job"."skip_tags", "main_job"."start_at_task", "main_job"."become_enabled", "main_job"."allow_simultaneous", "main_job"."timeout", "main_job"."use_fa
ct_cache", "main_job"."job_template_id", "main_job"."artifacts", "main_job"."scm_revision", "main_job"."project_update_id", "main_job"."job_slice_number", "main_job"."job_slice_count", "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "
auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined", T4."id", T4."password", T4."last_login", T4."is_superus
er", T4."username", T4."first_name", T4."last_name", T4."email", T4."is_staff", T4."is_active", T4."date_joined", "main_inventory"."id", "main_inventory"."created", "main_inventory"."modified", "main_inventory"."description", "main_inventory"."created_by
_id", "main_inventory"."modified_by_id", "main_inventory"."name", "main_inventory"."organization_id", "main_inventory"."variables", "main_inventory"."has_active_failures", "main_inventory"."total_hosts", "main_inventory"."hosts_with_active_failures", "ma
in_inventory"."total_groups", "main_inventory"."has_inventory_sources", "main_inventory"."total_inventory_sources", "main_inventory"."inventory_sources_with_failures", "main_inventory"."kind", "main_inventory"."host_filter", "main_inventory"."admin_role_
id", "main_inventory"."update_role_id", "main_inventory"."adhoc_role_id", "main_inventory"."use_role_id", "main_inventory"."read_role_id", "main_inventory"."insights_credential_id", "main_inventory"."pending_deletion", "main_unifiedjobtemplate"."id", "ma
in_unifiedjobtemplate"."polymorphic_ctype_id", "main_unifiedjobtemplate"."created", "main_unifiedjobtemplate"."modified", "main_unifiedjobtemplate"."description", "main_unifiedjobtemplate"."created_by_id", "main_unifiedjobtemplate"."modified_by_id", "mai
n_unifiedjobtemplate"."name", "main_unifiedjobtemplate"."old_pk", "main_unifiedjobtemplate"."current_job_id", "main_unifiedjobtemplate"."last_job_id", "main_unifiedjobtemplate"."last_job_failed", "main_unifiedjobtemplate"."last_job_run", "main_unifiedjob
template"."next_job_run", "main_unifiedjobtemplate"."next_schedule_id", "main_unifiedjobtemplate"."status", "main_unifiedjobtemplate"."organization_id", "main_project"."unifiedjobtemplate_ptr_id", "main_project"."custom_virtualenv", "main_project"."local
_path", "main_project"."scm_type", "main_project"."scm_url", "main_project"."scm_branch", "main_project"."scm_refspec", "main_project"."scm_clean", "main_project"."scm_delete_on_update", "main_project"."credential_id", "main_project"."timeout", "main_pro
ject"."scm_update_on_launch", "main_project"."scm_update_cache_timeout", "main_project"."allow_override", "main_project"."scm_revision", "main_project"."playbook_files", "main_project"."inventory_files", "main_project"."admin_role_id", "main_project"."us
e_role_id", "main_project"."update_role_id", "main_project"."read_role_id", T9."id", T9."polymorphic_ctype_id", T9."created", T9."modified", T9."description", T9."created_by_id", T9."modified_by_id", T9."name", T9."old_pk", T9."current_job_id", T9."last_
job_id", T9."last_job_failed", T9."last_job_run", T9."next_job_run", T9."next_schedule_id", T9."status", T9."organization_id", "main_jobtemplate"."unifiedjobtemplate_ptr_id", "main_jobtemplate"."survey_enabled", "main_jobtemplate"."survey_spec", "main_jo
btemplate"."ask_variables_on_launch", "main_jobtemplate"."custom_virtualenv", "main_jobtemplate"."webhook_service", "main_jobtemplate"."webhook_key", "main_jobtemplate"."webhook_credential_id", "main_jobtemplate"."diff_mode", "main_jobtemplate"."inventor
y_id", "main_jobtemplate"."project_id", "main_jobtemplate"."playbook", "main_jobtemplate"."scm_branch", "main_jobtemplate"."forks", "main_jobtemplate"."limit", "main_jobtemplate"."verbosity", "main_jobtemplate"."extra_vars", "main_jobtemplate"."job_tags"
, "main_jobtemplate"."force_handlers", "main_jobtemplate"."skip_tags", "main_jobtemplate"."start_at_task", "main_jobtemplate"."become_enabled", "main_jobtemplate"."allow_simultaneous", "main_jobtemplate"."timeout", "main_jobtemplate"."use_fact_cache", "m
ain_jobtemplate"."job_type", "main_jobtemplate"."host_config_key", "main_jobtemplate"."ask_diff_mode_on_launch", "main_jobtemplate"."ask_limit_on_launch", "main_jobtemplate"."ask_tags_on_launch", "main_jobtemplate"."ask_skip_tags_on_launch", "main_jobtem
plate"."ask_job_type_on_launch", "main_jobtemplate"."ask_verbosity_on_launch", "main_jobtemplate"."ask_inventory_on_launch", "main_jobtemplate"."ask_credential_on_launch", "main_jobtemplate"."ask_scm_branch_on_launch", "main_jobtemplate"."job_slice_count
", "main_jobtemplate"."admin_role_id", "main_jobtemplate"."execute_role_id", "main_jobtemplate"."read_role_id", T11."id", T11."polymorphic_ctype_id", T11."modified", T11."description", T11."created_by_id", T11."modified_by_id", T11."name", T11."old_pk", 
T11."emitted_events", T11."unified_job_template_id", T11."created", T11."launch_type", T11."schedule_id", T11."execution_node", T11."controller_node", T11."cancel_flag", T11."status", T11."failed", T11."started", T11."dependencies_processed", T11."finish
ed", T11."canceled_on", T11."elapsed", T11."job_args", T11."job_cwd", T11."job_env", T11."job_explanation", T11."start_args", T11."result_traceback", T11."celery_task_id", T11."instance_group_id", T11."organization_id", "main_projectupdate"."unifiedjob_p
tr_id", "main_projectupdate"."local_path", "main_projectupdate"."scm_type", "main_projectupdate"."scm_url", "main_projectupdate"."scm_branch", "main_projectupdate"."scm_refspec", "main_projectupdate"."scm_clean", "main_projectupdate"."scm_delete_on_updat
e", "main_projectupdate"."credential_id", "main_projectupdate"."timeout", "main_projectupdate"."project_id", "main_projectupdate"."job_type", "main_projectupdate"."job_tags", "main_projectupdate"."scm_revision" FROM "main_job" INNER JOIN "main_unifiedjob
" ON ("main_job"."unifiedjob_ptr_id" = "main_unifiedjob"."id") LEFT OUTER JOIN "auth_user" ON ("main_unifiedjob"."created_by_id" = "auth_user"."id") LEFT OUTER JOIN "auth_user" T4 ON ("main_unifiedjob"."modified_by_id" = T4."id") LEFT OUTER JOIN "main_in
ventory" ON ("main_job"."inventory_id" = "main_inventory"."id") LEFT OUTER JOIN "main_project" ON ("main_job"."project_id" = "main_project"."unifiedjobtemplate_ptr_id") LEFT OUTER JOIN "main_unifiedjobtemplate" ON ("main_project"."unifiedjobtemplate_ptr_
id" = "main_unifiedjobtemplate"."id") LEFT OUTER JOIN "main_jobtemplate" ON ("main_job"."job_template_id" = "main_jobtemplate"."unifiedjobtemplate_ptr_id") LEFT OUTER JOIN "main_unifiedjobtemplate" T9 ON ("main_jobtemplate"."unifiedjobtemplate_ptr_id" = 
T9."id") LEFT OUTER JOIN "main_projectupdate" ON ("main_job"."project_update_id" = "main_projectupdate"."unifiedjob_ptr_id") LEFT OUTER JOIN "main_unifiedjob" T11 ON ("main_projectupdate"."unifiedjob_ptr_id" = T11."id") WHERE "main_unifiedjob"."status" I
N ('new', 'pending', 'waiting') ORDER BY "main_job"."unifiedjob_ptr_id" ASC  LIMIT 1

Database-Server is chilling:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                  
15740 postgres  20   0  338152 136232 129384 S  12,5   0,8   0:15.95 postmaster                                                                                                                                                                               
25768 postgres  20   0  335944 148932 144288 S   6,2   0,9   0:14.51 postmaster                                                                                                                                                                               
    1 root      20   0  241844   7632   5516 S   0,0   0,0   0:18.46 systemd

Connections are also fine:

test_awx=# select count(*) from pg_stat_activity;
 count 
-------
   238
(1 Zeile)

$ grep conn 12/data/postgresql.conf 
max_connections = '1000'

Latency between nodes & db is as good as it gets:

bash-4.4# ping 172.30.1.100
PING 172.30.1.100 (172.30.1.100) 56(84) bytes of data.
64 bytes from 172.30.1.100: icmp_seq=1 ttl=63 time=0.243 ms
64 bytes from 172.30.1.100: icmp_seq=2 ttl=63 time=0.899 ms

Edit 2:
Strace of a process of awx-manage on first instance in cluster:

wait4(4205, 0x7ffdc2e32a14, WNOHANG, NULL) = 0
wait4(4205, 0x7ffdc2e32a14, WNOHANG, NULL) = 0
sendto(55, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}])
recvfrom(55, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(55, "Q\0\0\7\371SELECT \"main_unifiedjob\".\"i"..., 2042, MSG_NOSIGNAL, NULL, 0) = 2042
poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}])
recvfrom(55, "T\0\0\7u\0>id\0\0\0U\235\0\1\0\0\0\27\0\4\377\377\377\377\0\0poly"..., 16384, 0, NULL, NULL) = 5235
sendto(55, "Q\0\0\0\vCOMMIT\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}])
recvfrom(55, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 18
wait4(4205, 0x7ffdc2e33594, WNOHANG, NULL) = 0
wait4(4205, 0x7ffdc2e33594, WNOHANG, NULL) = 0
poll([{fd=60, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
wait4(4205, 0x7ffdc2e32a14, WNOHANG, NULL) = 0
wait4(4205, 0x7ffdc2e32a14, WNOHANG, NULL) = 0
poll([{fd=60, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 5000) = 0 (Timeout)
wait4(4205, 0x7ffdc2e32a14, WNOHANG, NULL) = 0
wait4(4205, 0x7ffdc2e32a14, WNOHANG, NULL) = 0
sendto(55, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}])
recvfrom(55, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(55, "Q\0\0\7\371SELECT \"main_unifiedjob\".\"i"..., 2042, MSG_NOSIGNAL, NULL, 0) = 2042
poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}])
recvfrom(55, "T\0\0\7u\0>id\0\0\0U\235\0\1\0\0\0\27\0\4\377\377\377\377\0\0poly"..., 16384, 0, NULL, NULL) = 5235
sendto(55, "Q\0\0\0\vCOMMIT\0", 12, MSG_NOSIGNAL, NULL, 0) = 12

Strace of a process of ansible-playbook on first instance in cluster:

[pid 25617] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid 25617] wait4(29304, 0x7ffe2c522304, WNOHANG, NULL) = 0
[pid 25617] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid 25617] wait4(29304, 0x7ffe2c522304, WNOHANG, NULL) = 0
[pid 25617] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid 25617] wait4(29304, 0x7ffe2c522304, WNOHANG, NULL) = 0
[pid 25617] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid 25617] wait4(29304, 0x7ffe2c522304, WNOHANG, NULL) = 0

So it looks like, the awx-manage process opens a lot of transactions & just executes one "SELECT" in it.
The ansible-playbook process seems to wait for something on filesystem to become ready.

Edit 3:
We may have found the issue of the hung ansible-playbook-jobs (not resolving scheduler issues): While fact-gathering sq_inq throws an kernel exception & fact gathering never ends:

Jul 20 16:09:01 vm32954 kernel: INFO: task sg_inq:2052 blocked for more than 120 seconds.
Jul 20 16:09:01 vm32954 kernel:      Tainted: G                 ---------r-  - 4.18.0-147.8.1.el8_1.x86_64 #1
Jul 20 16:09:01 vm32954 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 16:09:01 vm32954 kernel: sg_inq          D    0  2052   1876 0x00000080
Jul 20 16:09:01 vm32954 kernel: Call Trace:
(...)

This could be caused by the amount of parallel calls to sq_inq by running our test playbooks against the same hosts.

blomquisg · 2020-07-24T17:19:03Z

@ryanpetrello please take a look at this.

kdelee · 2020-07-24T19:05:36Z

@psuriset of interest to scale team

ryanpetrello · 2020-07-24T20:03:31Z

@fosterseth if you've got some available time next week, can you take a peek at this?

kdelee · 2020-07-24T21:51:18Z

@fosterseth I'm interested in this issue so let me know if you want help testing

fosterseth · 2020-07-31T16:43:15Z

@moonrail

In your test setup, are the jobs you are launching all from the same job template?

Do you experience the same issues if you have concurrent jobs disabled in the job template? (set api/v2/job_templates/7/allow_simultaneous to false, or set in UI)

This would mean only 1 job would run per instance at a time. The scheduler is very optimized in this case, and I would expect it to behave well even with thousands of pending jobs.

I am still looking into the other case -- jobs can run concurrently.

moonrail · 2020-08-03T12:01:43Z

@fosterseth
Nope, there are 8 different job templates with each their own project & inventory.
Its the same setup as described in: #7659
All jobs are started via schedules.

We did not test with non-concurrent job templates as in production we will have single job templates that may be run for a lot of organizations & inventories.
Like e.g. Patch-Jobs, that may as well run for ~1k organization-inventories & >1k hosts with potentially each job template run with different variables per host.
This is the thing we are evaluating AWX/Ansible Tower for, as well as regular "small", less frequent jobs for deployments, of course.

Would test results for the following two tests help you?
1.:

create 1k job templates (each with different project & inventory)
launch one job for each job template

2.:

create 1k job templates (each with different project & inventory)
disable all nodes to not let them start jobs
launch one job for each job template
enable all nodes again to let them process jobs

fosterseth · 2020-08-03T18:11:55Z

@moonrail

I think we need to find the simplest test scenario that also replicates the issue.

Use your 10 instance cluster, with all instances disabled to start with
Set up a project and job template with this this simple JT (https://github.com/ansible/test-playbooks/blob/master/sleep.yml)
Set project Update on Launch to false
Set Job Enable Concurrent Jobs to True
Launch 1000 -2000 jobs using curl (let's avoid using schedules for now)

example.sh

for i in {1..1000}; do
    curl -H "Content-Type: application/json" -X POST  -d '{}' http://username:password@172.17.0.3:32765/api/v2/job_templates/7/launch/
done

Enable all instances
Confirm that jobs are running.

if the above goes okay, then swap the above JT with one you use and retest.

if it does not work (jobs don't go from pending to running), then try it again but this time with Enable Concurrent Jobs set to False

The above test will help me determine if the problem is task manager related.

moonrail · 2020-08-04T09:16:31Z

@fosterseth

I've followed your instructions, but used 15 instead of 10 instances.

First test with concurrent jobs enabled

1001 pending jobs in instance group "tower" and all instances disabled:

awx_status_total{status="canceled"} 8696.0
awx_status_total{status="error"} 11785.0
awx_status_total{status="failed"} 156783.0
awx_status_total{status="new"} 11.0
awx_status_total{status="pending"} 1001.0
awx_status_total{status="successful"} 539214.0
(...)
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="6e580039-e7ca-5377-b0d7-70717dcd8f89",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="d209f151-9007-5bb7-b5de-722697a9ff6c",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="d32d0c41-e95a-5def-8b66-a12bc4b5ee5c",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="576938e8-176e-5b79-a630-93c988ea42bd",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="3c94c2b7-e0ed-55f4-99a7-51892c0942bd",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="ec1f670b-0d12-5d39-a398-c29fe0e6691a",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="ee2bf0a7-eafe-511f-81b4-6bc6f4d611e8",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="34121811-decb-5416-9f08-4f70b074e900",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="831c8f4b-6601-5f47-ae3b-0aa78ebdbbcf",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="9650c0c2-f0c4-57ba-a696-2acb161ad2f9",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="228b653b-ada8-584f-8977-c64a7bc094f7",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="1d286cbf-21b5-5635-9061-f60c114c6b79",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="66fcb2b2-5aca-5feb-9a84-df459bba5396",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="cfb4eb92-da8c-5b6a-b86b-869ee409ccc8",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0
awx_instance_info{enabled="False",hostname="$instance",instance_uuid="97167ac0-6e84-56d6-a5f8-a6894219279d",last_isolated_check="None",managed_by_policy="True",version="13.0.0"} 1.0

The scheduler takes around 37 seconds to skip all 1001 pending jobs:

awx_task | 2020-08-04T08:35:14.804977712Z 2020-08-04 08:35:14,803 DEBUG    awx.main.dispatch task 52f232ef-bcec-42eb-84a1-85b16af784af starting awx.main.scheduler.tasks.run_task_manager(*[])
awx_task | 2020-08-04T08:35:14.805803836Z 2020-08-04 08:35:14,805 DEBUG    awx.main.scheduler Running Tower task manager.
awx_task | 2020-08-04T08:35:14.841795449Z 2020-08-04 08:35:14,840 DEBUG    awx.main.scheduler Starting Scheduler
awx_task | 2020-08-04T08:35:15.523381479Z 2020-08-04 08:35:15,522 DEBUG    awx.main.scheduler Skipping group tower, remaining_capacity 0 <= 0
awx_task | 2020-08-04T08:35:15.525246355Z 2020-08-04 08:35:15,524 DEBUG    awx.main.scheduler job 867237 (pending) couldn't be scheduled on graph, waiting for next cycle
(...)
awx_task | 2020-08-04T08:35:51.197915236Z 2020-08-04 08:35:51,197 DEBUG    awx.main.scheduler Skipping group tower, remaining_capacity 0 <= 0
awx_task | 2020-08-04T08:35:51.198534169Z 2020-08-04 08:35:51,198 DEBUG    awx.main.scheduler job 868218 (pending) couldn't be scheduled on graph, waiting for next cycle
awx_task | 2020-08-04T08:35:51.206625820Z 2020-08-04 08:35:51,206 DEBUG    awx.main.scheduler Finishing Scheduler

After enabling all instances via API (finished at 08:44:49), the node that was running the scheduler task did not detect available instances. The scheduler ran several times with the following output, until 08:47:55:

awx_task | 2020-08-04T08:46:44.547140704Z 2020-08-04 08:46:44,545 DEBUG    awx.main.tasks Starting periodic scheduler
awx_task | 2020-08-04T08:46:44.564111668Z 2020-08-04 08:46:44,560 DEBUG    awx.main.tasks Last scheduler run was: 2020-08-04 08:46:27.021672+00:00
awx_task | 2020-08-04T08:46:44.589824728Z 2020-08-04 08:46:44,589 DEBUG    awx.main.scheduler Skipping group tower, remaining_capacity 0 <= 0
awx_task | 2020-08-04T08:46:44.590371417Z 2020-08-04 08:46:44,590 DEBUG    awx.main.scheduler job 868144 (pending) couldn't be scheduled on graph, waiting for next cycle
(...)
awx_task | 2020-08-04T08:47:18.799686234Z 2020-08-04 08:47:18,799 DEBUG    awx.main.scheduler job 868237 (pending) couldn't be scheduled on graph, waiting for next cycle
awx_task | 2020-08-04T08:47:18.807138323Z 2020-08-04 08:47:18,806 DEBUG    awx.main.scheduler Finishing Scheduler

No jobs were started in this period.

Following this, the scheduler was started on another node - this node then started scheduling correctly:

awx_task | 2020-08-04T08:47:14.973419353Z 2020-08-04 08:47:14,973 DEBUG    awx.main.tasks Last scheduler run was: 2020-08-04 08:46:57.281227+00:00
awx_task | 2020-08-04T08:47:24.980960132Z 2020-08-04 08:47:24,979 DEBUG    awx.main.dispatch task 41b56cc8-3025-4336-909a-66343fb100e2 starting awx.main.scheduler.tasks.run_task_manager(*[])
awx_task | 2020-08-04T08:47:24.981421670Z 2020-08-04 08:47:24,981 DEBUG    awx.main.scheduler Running Tower task manager.
awx_task | 2020-08-04T08:47:25.007084473Z 2020-08-04 08:47:25,006 DEBUG    awx.main.scheduler Starting Scheduler
awx_task | 2020-08-04T08:47:25.500038146Z 2020-08-04 08:47:25,499 DEBUG    awx.main.scheduler Starting job 867237 (pending) in group tower instance $instance (remaining_capacity=1485)
awx_task | 2020-08-04T08:47:25.506277447Z 2020-08-04 08:47:25,505 DEBUG    awx.main.scheduler Submitting job 867237 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:47:25.553673232Z 2020-08-04 08:47:25,553 DEBUG    awx.main.scheduler job 867237 (waiting) consumed 2 capacity units from tower with prior total of 0
awx_task | 2020-08-04T08:47:25.692858765Z 2020-08-04 08:47:25,691 DEBUG    awx.main.scheduler Starting job 867238 (pending) in group tower instance $instance (remaining_capacity=1483)
awx_task | 2020-08-04T08:47:25.699047039Z 2020-08-04 08:47:25,698 DEBUG    awx.main.scheduler Submitting job 867238 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:47:25.740714476Z 2020-08-04 08:47:25,740 DEBUG    awx.main.scheduler job 867238 (waiting) consumed 2 capacity units from tower with prior total of 2
awx_task | 2020-08-04T08:47:25.894125770Z 2020-08-04 08:47:25,893 DEBUG    awx.main.scheduler Starting job 867239 (pending) in group tower instance $instance (remaining_capacity=1481)
awx_task | 2020-08-04T08:47:25.898509978Z 2020-08-04 08:47:25,898 DEBUG    awx.main.scheduler Submitting job 867239 (waiting) to <instance group, instance> <1,$instance>.

No jobs were started at this time, they were just assigned to instances.

But this scheduler run never finished correctly, it was killed:

awx_task | 2020-08-04T08:52:41.192274116Z 2020-08-04 08:52:41,191 DEBUG    awx.main.scheduler Starting job 867533 (pending) in group tower instance $instance (remaining_capacity=893)
awx_task | 2020-08-04T08:52:41.196401529Z 2020-08-04 08:52:41,196 DEBUG    awx.main.scheduler Submitting job 867533 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:52:41.231520051Z 2020-08-04 08:52:41,231 DEBUG    awx.main.scheduler job 867533 (waiting) consumed 2 capacity units from tower with prior total of 592
awx_task | 2020-08-04T08:52:43.033659172Z 2020-08-04 08:52:43,033 DEBUG    awx.main.scheduler Starting job 867534 (pending) in group tower instance $instance (remaining_capacity=891)
awx_task | 2020-08-04T08:52:43.037697316Z 2020-08-04 08:52:43,037 DEBUG    awx.main.scheduler Submitting job 867534 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:52:43.074008317Z 2020-08-04 08:52:43,073 DEBUG    awx.main.scheduler job 867534 (waiting) consumed 2 capacity units from tower with prior total of 594
awx_task | 2020-08-04T08:52:44.838841344Z 2020-08-04 08:52:44,838 DEBUG    awx.main.scheduler Starting job 867535 (pending) in group tower instance $instance (remaining_capacity=889)
awx_task | 2020-08-04T08:52:44.842485821Z 2020-08-04 08:52:44,842 DEBUG    awx.main.scheduler Submitting job 867535 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:52:44.877194318Z 2020-08-04 08:52:44,876 DEBUG    awx.main.scheduler job 867535 (waiting) consumed 2 capacity units from tower with prior total of 596
awx_task | 2020-08-04T08:52:45.699907691Z 2020-08-04 08:52:45,698 ERROR    awx.main.dispatch run_task_manager has held the advisory lock for >5m, sending SIGTERM to 143

And the scheduler started from the beginning:

awx_task | 2020-08-04T08:52:45.700695880Z 2020-08-04 08:52:45,699 WARNING  awx.main.dispatch scaling down worker pid:152
awx_task | 2020-08-04T08:52:45.702413432Z 2020-08-04 08:52:45,701 WARNING  awx.main.dispatch worker exiting gracefully pid:152
awx_task | 2020-08-04T08:52:45.855908779Z 2020-08-04 08:52:45,853 DEBUG    awx.main.dispatch task 78114f0e-ee2d-4072-9e82-a838db05fadf starting awx.main.tasks.cluster_node_heartbeat(*[])
awx_task | 2020-08-04T08:52:45.856400821Z 2020-08-04 08:52:45,854 DEBUG    awx.main.dispatch task 75ab192a-f980-45b2-91b6-a752cd9511b9 starting awx.main.tasks.awx_k8s_reaper(*[])
awx_task | 2020-08-04T08:52:45.856434450Z 2020-08-04 08:52:45,855 DEBUG    awx.main.tasks Cluster node heartbeat task.
awx_task | 2020-08-04T08:52:45.856643454Z 2020-08-04 08:52:45,855 DEBUG    awx.main.dispatch task 8f9a2fa1-b6a1-412d-98e9-5145086b27c8 starting awx.main.tasks.awx_periodic_scheduler(*[])
awx_task | 2020-08-04T08:52:45.868449800Z 2020-08-04 08:52:45,864 WARNING  awx.main.dispatch scaling up worker pid:164
awx_task | 2020-08-04T08:52:45.870750057Z 2020-08-04 08:52:45,870 DEBUG    awx.main.tasks Starting periodic scheduler
awx_task | 2020-08-04T08:52:45.870781718Z 2020-08-04 08:52:45,869 DEBUG    awx.main.dispatch task a494ea3e-e582-47d5-8547-4dc9274d53a3 starting awx.main.scheduler.tasks.run_task_manager(*[])
awx_task | 2020-08-04T08:52:45.871201383Z 2020-08-04 08:52:45,870 DEBUG    awx.main.scheduler Running Tower task manager.
awx_task | 2020-08-04T08:52:45.875664491Z 2020-08-04 08:52:45,873 DEBUG    awx.main.tasks Last scheduler run was: 2020-08-04 08:52:28.209843+00:00
awx_task | 2020-08-04T08:52:45.887304242Z 2020-08-04 08:52:45,886 DEBUG    awx.main.scheduler Starting Scheduler
awx_task | 2020-08-04T08:52:46.187491877Z 2020-08-04 08:52:46,187 DEBUG    awx.main.scheduler Starting job 867237 (pending) in group tower instance $instance (remaining_capacity=1485)
awx_task | 2020-08-04T08:52:46.194106166Z 2020-08-04 08:52:46,193 DEBUG    awx.main.scheduler Submitting job 867237 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:52:46.233755064Z 2020-08-04 08:52:46,233 DEBUG    awx.main.scheduler job 867237 (waiting) consumed 2 capacity units from tower with prior total of 0
awx_task | 2020-08-04T08:52:46.354977258Z 2020-08-04 08:52:46,354 DEBUG    awx.main.scheduler Starting job 867238 (pending) in group tower instance $instance (remaining_capacity=1483)
awx_task | 2020-08-04T08:52:46.359016392Z 2020-08-04 08:52:46,358 DEBUG    awx.main.scheduler Submitting job 867238 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T08:52:46.394438886Z 2020-08-04 08:52:46,393 DEBUG    awx.main.scheduler job 867238 (waiting) consumed 2 capacity units from tower with prior total of 2

Second test with concurrent jobs disabled

Output is nearly the same, scheduler does submit jobs, but is killed:

awx_task | 2020-08-04T09:04:47.469783905Z 2020-08-04 09:04:47,469 DEBUG    awx.main.tasks Starting periodic scheduler
awx_task | 2020-08-04T09:04:47.470085646Z 2020-08-04 09:04:47,468 DEBUG    awx.main.dispatch task f9263dc3-091b-4392-be83-66cab5df42bb starting awx.main.scheduler.tasks.run_task_manager(*[])
awx_task | 2020-08-04T09:04:47.470580309Z 2020-08-04 09:04:47,470 DEBUG    awx.main.scheduler Running Tower task manager.
awx_task | 2020-08-04T09:04:47.472491525Z 2020-08-04 09:04:47,472 DEBUG    awx.main.tasks Last scheduler run was: 2020-08-04 09:04:30.464804+00:00
awx_task | 2020-08-04T09:04:47.498514788Z 2020-08-04 09:04:47,497 DEBUG    awx.main.scheduler Starting Scheduler
awx_task | 2020-08-04T09:04:47.865511711Z 2020-08-04 09:04:47,864 DEBUG    awx.main.scheduler Starting job 867237 (pending) in group tower instance $instance (remaining_capacity=1485)
awx_task | 2020-08-04T09:04:47.871736686Z 2020-08-04 09:04:47,871 DEBUG    awx.main.scheduler Submitting job 867237 (waiting) to <instance group, instance> <1,$instance>.
(...)
awx_task | 2020-08-04T09:10:45.879995700Z 2020-08-04 09:10:45,879 DEBUG    awx.main.scheduler job 867555 (waiting) consumed 2 capacity units from tower with prior total of 636
awx_task | 2020-08-04T09:10:47.989529684Z 2020-08-04 09:10:47,988 DEBUG    awx.main.scheduler Starting job 867556 (pending) in group tower instance $instance (remaining_capacity=847)
awx_task | 2020-08-04T09:10:47.993990653Z 2020-08-04 09:10:47,993 DEBUG    awx.main.scheduler Submitting job 867556 (waiting) to <instance group, instance> <1,$instance>.
awx_task | 2020-08-04T09:10:48.034836599Z 2020-08-04 09:10:48,033 DEBUG    awx.main.scheduler job 867556 (waiting) consumed 2 capacity units from tower with prior total of 638
awx_task | 2020-08-04T09:10:48.106847233Z 2020-08-04 09:10:48,105 DEBUG    awx.main.dispatch task 5c79ab8c-0a29-45e8-9292-f53fdf68c604 starting awx.main.tasks.awx_isolated_heartbeat(*[])
awx_task | 2020-08-04T09:10:48.106885764Z 2020-08-04 09:10:48,106 DEBUG    awx.main.tasks Controlling node checking for any isolated management tasks.
awx_task | 2020-08-04T09:10:48.119120389Z 2020-08-04 09:10:48,117 DEBUG    awx.main.dispatch task 4d317e52-c0b8-4d72-ad2e-e45664e53a27 starting awx.main.tasks.gather_analytics(*[])
awx_task | 2020-08-04T09:10:48.128893035Z 2020-08-04 09:10:48,126 ERROR    awx.main.dispatch run_task_manager has held the advisory lock for >5m, sending SIGTERM to 199
awx_task | 2020-08-04T09:10:48.129785751Z 2020-08-04 09:10:48,128 WARNING  awx.main.dispatch scaling down worker pid:204
awx_task | 2020-08-04T09:10:48.132183207Z 2020-08-04 09:10:48,130 WARNING  awx.main.dispatch worker exiting gracefully pid:204

Why does the scheduler work sequentially? This seems to me as the wrong concept for clustered active/active-Environments.

Could instances not take jobs themselves from a queue and process them at will - why does the scheduler have to control any processing?

moonrail · 2020-08-21T10:21:30Z

Any updates? We are still experiencing this with 14.0.0.

kdelee · 2020-08-21T20:54:09Z

@moonrail was just talking to @fosterseth about this ...think we should have update early next week.

fosterseth · 2020-08-24T15:00:05Z

@moonrail thanks for running the tests and the detailed response

A while back we had an issue with a cpython bug that was causing task manager hangs, so we put some code in place to reap the task manager after it runs for 5 minutes.
#6071

In your case the task manager is not hanging, it's just that the loop for starting jobs is taking > 5 minutes. Our transactions are atomic, so reaping the task manager doesn't commit the transaction, which is why jobs aren't actually starting (they don't run) and the next task manager run starts over with the first job.

Some possible solutions

We can make the task manager start-job code faster -- there are couple of areas that are probably unoptimized (methods fit_task_to_most_remaining_capacity_instance and preferred_instance_groups).
We have since solved the cpython bug, so perhaps we no longer need the task manager reaper.
switch the periodic scheduler to a background process (instead of a thread) to avoid a cpython bug #6093

@ryanpetrello do you think it's safe to remove the task manager reaper that is in place?

ryanpetrello · 2020-08-24T18:38:16Z

I'm still a little nervous about removing that - if the task manager takes 5+ minutes to start jobs, it makes me wonder when/if it would ever recover. I do like the failsafe that it gives us to prevent deadlocks.

fosterseth · 2020-09-02T22:06:00Z

@ryanpetrello

one quick fix could be to limit the number of jobs a task manager can start on a given run. When it reaches that limit, it won't start any more pending jobs until the next run.

a quick implementation
devel...fosterseth:fix-7655_task_manager_times_out

If this limit is reached, we can immediately schedule a task_manager to run right after the current cycle ends, that way there isn't a delay (which is around 20 seconds or so) between runs

I'd say a limit of around 150 jobs seems low enough that even at scale (5k jobs) the task manager should get through all pending jobs within a couple of minutes.

ryanpetrello · 2020-09-03T12:15:50Z

@fosterseth yea I've thought on it some, and I think I like this idea - could you turn it into a PR?

kdelee · 2020-09-11T16:42:44Z

@mcharanrm re-created this issue on a Tower 3.7.2 and then upgraded to a build of devel that included this patch.

At beginning of time scale we see that no jobs were running on the 3.7.2 instance and that over 1k jobs were in pending. The blank part is where we were running the upgrade. Then after upgrade it started scheduling 100 jobs at a time and running them to completion.

While something bad could happen that could still cause task manager to timeout while scheduling 100 jobs, if this were the case the user could adjust START_TASK_LIMIT to meet their needs.

I'm going to say this is verified and will be released in next release of AWX.

moonrail · 2020-10-05T17:05:10Z

@fosterseth @ryanpetrello

Sorry for being late on testing.

After testing in combination with UI_LIVE_UPDATES_ENABLED = False now I'm content with reachable performance.

But even with UI_LIVE_UPDATES_ENABLED = True performance is good, if START_TASK_LIMIT is sized accordingly.
With the example Sleep-Playbook we reach following results in our 15 node cluster:

START_TASK_LIMIT	jobs per minute (10 minute-average)	time to start jobs after task manager start
10	80	3 seconds
25	74	8 seconds
50	63	20 seconds
100	51	42 seconds on the first run, following runs were variing from ~25s to 1m and one single run over 5 minutes, that was killed

Now I'm beginning to wonder if the default START_TASK_LIMIT should be lower. The example Playbook ist fairly short (30s + 7s overhead), so it should provide a solid baseline for job-latency.

Few ansible-playbooks will be shorter, but many will be longer and consume instance capacity over a longer time.

If AWX nodes are being added/enabled, or if "big" capacity jobs finish after the Task Manager is started, it will not use their capacity in this run. So higher START_TASK_LIMIT values will also lead to longer delay in using nodes available capacity.

Why not reduce the default value of START_TASK_LIMIT from 100 to a way more responsive 10?

This should improve the user experience by quite a margin and looks less like "AWX is frozen/deadlocked", as the time to wait until jobs are being executed/assigned is significantly lowered.

From my understanding this should also make more use of available capacity and therefore scale a bit better.

chrismeyersfsu · 2020-10-06T18:37:31Z

If you get a chance @moonrail https://github.com/ansible/awx/compare/devel...chrismeyersfsu:fix-tm_slow_fit?expand=1 This reduces the time of starting an individual job from 1-2 seconds down to .05 seconds. So you should see a 20x + speedup.

moonrail · 2020-10-07T12:50:33Z

@kdelee
START_TASK_LIMIT is no issue anymore, when PR #8333 is merged.

awxbot added the type:bug label Jul 20, 2020

wenottingham added the component:api label Jul 20, 2020

ryanpetrello assigned fosterseth Jul 24, 2020

fosterseth mentioned this issue Sep 3, 2020

Prevent task manager timeout by limiting number of jobs to start #8074

Merged

kdelee added the state:needs_test label Sep 11, 2020

kdelee closed this as completed Sep 11, 2020

kdelee self-assigned this Oct 6, 2020

kdelee removed the state:needs_test label Oct 6, 2020

moonrail mentioned this issue Oct 7, 2020

reduce per-job database query count #8333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655

Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655

moonrail commented Jul 20, 2020

moonrail commented Jul 21, 2020 •

edited

Loading

blomquisg commented Jul 24, 2020

kdelee commented Jul 24, 2020

ryanpetrello commented Jul 24, 2020

kdelee commented Jul 24, 2020

fosterseth commented Jul 31, 2020 •

edited

Loading

moonrail commented Aug 3, 2020

fosterseth commented Aug 3, 2020 •

edited

Loading

moonrail commented Aug 4, 2020

moonrail commented Aug 21, 2020

kdelee commented Aug 21, 2020

fosterseth commented Aug 24, 2020

ryanpetrello commented Aug 24, 2020

fosterseth commented Sep 2, 2020

ryanpetrello commented Sep 3, 2020

kdelee commented Sep 11, 2020

moonrail commented Oct 5, 2020

chrismeyersfsu commented Oct 6, 2020

moonrail commented Oct 7, 2020 •

edited

Loading

Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655

Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655

Comments

moonrail commented Jul 20, 2020

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

moonrail commented Jul 21, 2020 • edited Loading

blomquisg commented Jul 24, 2020

kdelee commented Jul 24, 2020

ryanpetrello commented Jul 24, 2020

kdelee commented Jul 24, 2020

fosterseth commented Jul 31, 2020 • edited Loading

moonrail commented Aug 3, 2020

fosterseth commented Aug 3, 2020 • edited Loading

moonrail commented Aug 4, 2020

First test with concurrent jobs enabled

Second test with concurrent jobs disabled

moonrail commented Aug 21, 2020

kdelee commented Aug 21, 2020

fosterseth commented Aug 24, 2020

ryanpetrello commented Aug 24, 2020

fosterseth commented Sep 2, 2020

ryanpetrello commented Sep 3, 2020

kdelee commented Sep 11, 2020

moonrail commented Oct 5, 2020

chrismeyersfsu commented Oct 6, 2020

moonrail commented Oct 7, 2020 • edited Loading

moonrail commented Jul 21, 2020 •

edited

Loading

fosterseth commented Jul 31, 2020 •

edited

Loading

fosterseth commented Aug 3, 2020 •

edited

Loading

moonrail commented Oct 7, 2020 •

edited

Loading