remove multiprocessing.Queue usage from the callback receiver #8191

ryanpetrello · 2020-09-21T17:27:15Z

instead, just have each worker connect directly to redis
this has a few benefits:

it's simpler to explain and debug
back pressure on the queue keeps messages in redis (which is
observable, and survives the restart of Python processes)
it's likely more performant at high loads than a single consumer reading from redis and distributing to workers via per-worker IPC pipes

softwarefactory-project-zuul · 2020-09-21T17:42:27Z

Build failed.

awx-api-lint : FAILURE in 9m 39s
awx-api : SUCCESS in 7m 59s
awx-ui : SUCCESS in 3m 27s
awx-ui-next : SUCCESS in 8m 07s
awx-swagger : SUCCESS in 14m 50s
awx-detect-schema-change : SUCCESS in 9m 27s (non-voting)
awx-ansible-modules : SUCCESS in 10m 09s

softwarefactory-project-zuul · 2020-09-22T16:00:07Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 19s
awx-api : SUCCESS in 6m 47s
awx-ui : SUCCESS in 3m 22s
awx-ui-next : SUCCESS in 8m 33s
awx-swagger : SUCCESS in 8m 33s
awx-detect-schema-change : SUCCESS in 7m 57s (non-voting)
awx-ansible-modules : SUCCESS in 2m 25s

ryanpetrello · 2020-09-22T22:07:54Z

awx/main/dispatch/worker/callback.py

@@ -123,6 +163,8 @@ def perform_work(self, body):
                        job_identifier = body[key]
                        break

+                self.last_event = f'\n\t- {cls.__name__} for #{job_identifier} ({body.get("event", "")} {body.get("uuid", "")})'  # noqa


👀 @gamuniz @chrismeyersfsu @AlanCoding @ghjm @jakemcdermott

This leaves me wanting more. Having a snapshot of the last job event processed is interesting but I suppose I want a log of all events processed. Not saying this is the place for that. But maybe this could be more of a summary. Last 10-20 processed?

softwarefactory-project-zuul · 2020-09-22T22:18:38Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 56s
awx-api : SUCCESS in 9m 55s
awx-ui : SUCCESS in 4m 19s
awx-ui-next : SUCCESS in 12m 22s
awx-swagger : SUCCESS in 8m 01s
awx-detect-schema-change : SUCCESS in 8m 03s (non-voting)
awx-ansible-modules : SUCCESS in 2m 15s

AlanCoding · 2020-09-23T12:52:31Z

awx/main/dispatch/worker/base.py

-                    res = self.redis.blpop(self.queues)
-                    time_to_sleep = 1
-                    res = json.loads(res[1])
-                    self.process_task(res)


After this diff, process_task still exists and is being used, apparently by the postgres consumer. This method writes to the multiprocessing queue.

So let me echo my understanding - you are removing use of the multiprocessing queue only for the redis-connected callback receiver. However, other messages, such as "delete this inventory" or "start running this job" are sent through the postgres messaging system, and still go through the dispatcher's multiprocessing queue. So one node in the cluster may have 4 or 8 redis connections, but 1 connection to the postgres message bus.

I'm just trying to state facts, and hopefully get the right. I don't recall any evidence that the main dispatcher ever had significant buildup of messages in the queue. Perhaps this is because autoscaling is enabled for the ordinary dispatcher but not for the callback receiver (going by my memory, it may be the other way around).

Yep, you've got it right.

So under this new model, process_task, and the model where we utilize a multiprocessing.Queue to dispatch messages to the worker processes is really only used but the dispatcher. The callback receiver still has a "main process", but it only exists to fork its children and hang out until they exit.

I don't recall any evidence that the main dispatcher ever had significant buildup of messages in the queue. Perhaps this is because autoscaling is enabled for the ordinary dispatcher but not for the callback receiver.

This is mostly because the volume is way lower - you might be running hundreds (or maybe thousands) of parallel jobs/tasks, but it's not uncommon for the callback receiver to deal in far more events in parallel.

The greatest volume we had there was the computed fields task. That did fail things in fun ways. I think those tended to be related to the autoscaling, memory, database connection limit, and so on. So filling up that IPC queue didn't tend to come up much, and computed fields should be muzzled now anyway. So I agree with your assessment about volume.

Yea, the dispatcher, unlike the callback receiver, can detect when all the workers are busy, and can autoscale up new workers to avoid a backlog in the IPC queues. There is a limit to this, of course, but on well-provisioned hardware and clusters, it's difficult to hit (we're talking hundreds to thousands of tasks).

AlanCoding · 2020-09-23T13:58:23Z

awx/main/dispatch/worker/callback.py

+    def record_statistics(self):
+        if time.time() - self.last_stats > 1:  # buffer stat recording to once per second
+            try:
+                self.redis.set(f'awx_callback_receiver_statistics_{os.getpid()}', self.debug())


This seems a little non-ideal. The fraction of the time that a human is actively watching the --status command is minuscule. I don't think it's particularly expensive, maybe like 2e-4 seconds (per every second), but it's adding to the overall background noise for messages that will never be read. I don't have an alternative idea that's not ugly, so it's just a passing thought.

Yep, it does have a cost. That said, I think it's a drop in the bucket compared to other bottlenecks in event processing like (for example), the overhead of the Django ORM.

How often are workers recycled, and do the keys get cleared? Should this line pass kwarg ex=~1?

https://redis-py.readthedocs.io/en/stable/_modules/redis/client.html

Callback workers don't get recycled; they exist until the process exits.

At process startup time, I unset all of these keys to wipe out the stats from the previous workers:

https://github.com/ansible/awx/pull/8191/files#diff-f37b92a11438678a6a32ac23a7790f05R54

softwarefactory-project-zuul · 2020-09-23T15:25:51Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 03s
awx-api : SUCCESS in 7m 17s
awx-ui : SUCCESS in 3m 43s
awx-ui-next : SUCCESS in 9m 03s
awx-swagger : SUCCESS in 8m 21s
awx-detect-schema-change : SUCCESS in 8m 18s (non-voting)
awx-ansible-modules : SUCCESS in 2m 40s

softwarefactory-project-zuul · 2020-09-23T16:13:36Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 18s
awx-api : SUCCESS in 6m 44s
awx-ui : SUCCESS in 3m 45s
awx-ui-next : SUCCESS in 9m 22s
awx-swagger : SUCCESS in 8m 06s
awx-detect-schema-change : SUCCESS in 9m 53s (non-voting)
awx-ansible-modules : SUCCESS in 2m 34s

ryanpetrello · 2020-09-23T17:31:22Z

Some more metrics on the current version of this PR, single node install, m5.16xlarge, processing a queue backlog of 5M+ small events, external logging disabled:

[root@ip-10-0-15-214 ec2-user]# awx-manage print_settings JOB_EVENT_WORKERS
JOB_EVENT_WORKERS                        = 16
[root@ip-10-0-15-214 ec2-user]# awx-manage print_settings UI_LIVE_UPDATES_ENABLED
UI_LIVE_UPDATES_ENABLED                  = False

[root@ip-10-0-15-214 ec2-user]# awx-manage callback_stats
main_jobevent
↳  last minute 1835721
main_inventoryupdateevent
↳  last minute 0
main_projectupdateevent
↳  last minute 0
main_adhoccommandevent
↳  last minute 0

Obviously your mileage will vary depending on the actual size of your stdout and other parallel work, and things like database IOPS; these numbers represent "best case" performance in a single-node system that otherwise has no other load or CPU contention.

Turning on external logging generally adds a 20-30% cost due to the all of the string munging/formatting we do in Python, and the overhead of cpython's logging module for record generation, including the overhead our various log handlers and filters. There might be another 5-10% optimization we could squeeze out of this if we really wanted to.

This could probably be made slightly faster by throwing more IOPs at a separate distinct database VM, but at this point we're coming up against the overhead of the Django ORM and its bulk_create implementation. I suspect that re-writing the insertion in vanilla psycopg2 and copy_from is one of the last remaining drastic improvements we could make to this model in Python.

softwarefactory-project-zuul · 2020-09-23T17:34:04Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 04s
awx-api : SUCCESS in 7m 10s
awx-ui : SUCCESS in 3m 29s
awx-ui-next : SUCCESS in 7m 57s
awx-swagger : SUCCESS in 8m 02s
awx-detect-schema-change : SUCCESS in 7m 59s (non-voting)
awx-ansible-modules : SUCCESS in 2m 25s

chrismeyersfsu · 2020-09-24T17:42:54Z

awx/main/dispatch/worker/callback.py

@@ -26,7 +30,7 @@

 # the number of seconds to buffer events in memory before flushing
 # using JobEvent.objects.bulk_create()
-BUFFER_SECONDS = .1
+BUFFER_SECONDS = 1


put in settings/defaults.py

chrismeyersfsu · 2020-09-24T17:43:54Z

awx/main/dispatch/worker/callback.py

+        self.pid = os.getpid()
+        self.redis = redis.Redis.from_url(settings.BROKER_URL)
+        for key in self.redis.keys('awx_callback_receiver_statistics_*'):
+            self.redis.delete(key)


could race other callback workers on init and cause a traceback and would then get re-spawned? Should at least try except this.

Key miss on delete don't raise a traceback:

In [4]: r.delete('foo') Out[4]: 0

chrismeyersfsu · 2020-09-24T17:45:30Z

awx/main/dispatch/worker/callback.py

+        return {'event': 'FLUSH'}
+
+    def record_statistics(self):
+        if time.time() - self.last_stats > 5:  # buffer stat recording to once per 5s


make the 5 a setting in settings/defaults.py

chrismeyersfsu · 2020-09-24T17:46:22Z

awx/main/dispatch/worker/callback.py

+    @property
+    def mb(self):
+        return '{:0.3f}'.format(
+            psutil.Process(os.getpid()).memory_info().rss / 1024.0 / 1024.0


replace os.getpid() with self.pid.

instead, just have each worker connect directly to redis this has a few benefits: - it's simpler to explain and debug - back pressure on the queue keeps messages around in redis (which is observable, and survives the restart of Python processes) - it's likely notably more performant at high loads

softwarefactory-project-zuul · 2020-09-24T18:03:56Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 11s
awx-api : SUCCESS in 9m 11s
awx-ui : SUCCESS in 3m 33s
awx-ui-next : SUCCESS in 8m 27s
awx-swagger : SUCCESS in 8m 39s
awx-detect-schema-change : SUCCESS in 7m 59s (non-voting)
awx-ansible-modules : SUCCESS in 2m 14s

softwarefactory-project-zuul · 2020-09-24T18:21:48Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 1m 59s
awx-api : SUCCESS in 7m 04s
awx-ui : SUCCESS in 3m 53s
awx-ui-next : SUCCESS in 8m 33s
awx-swagger : SUCCESS in 8m 23s
awx-detect-schema-change : SUCCESS in 8m 05s (non-voting)
awx-ansible-modules : SUCCESS in 2m 48s
awx-push-new-schema : SUCCESS in 8m 23s (non-voting)

ryanpetrello requested a review from chrismeyersfsu September 21, 2020 17:27

ryanpetrello requested review from jakemcdermott and matburt September 22, 2020 02:01

ryanpetrello changed the title ~~WIP: remove multiprocessing.Queue usage from the callback receiver~~ remove multiprocessing.Queue usage from the callback receiver Sep 22, 2020

ryanpetrello force-pushed the callback-directly-to-redis branch from c6ef967 to 4bce21b Compare September 22, 2020 15:51

ryanpetrello force-pushed the callback-directly-to-redis branch from a962d9f to ad8b955 Compare September 22, 2020 22:05

ryanpetrello commented Sep 22, 2020

View reviewed changes

AlanCoding reviewed Sep 23, 2020

View reviewed changes

AlanCoding approved these changes Sep 23, 2020

View reviewed changes

ryanpetrello force-pushed the callback-directly-to-redis branch from ad8b955 to 0247c69 Compare September 23, 2020 15:16

ryanpetrello force-pushed the callback-directly-to-redis branch from 0247c69 to 4143064 Compare September 23, 2020 16:03

chrismeyersfsu reviewed Sep 24, 2020

View reviewed changes

ryanpetrello force-pushed the callback-directly-to-redis branch from 06c2055 to cd0b9de Compare September 24, 2020 17:54

chrismeyersfsu approved these changes Sep 24, 2020

View reviewed changes

ryanpetrello added the mergeit label Sep 24, 2020

softwarefactory-project-zuul bot merged commit ce65ed0 into ansible:devel Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove multiprocessing.Queue usage from the callback receiver #8191

remove multiprocessing.Queue usage from the callback receiver #8191

ryanpetrello commented Sep 21, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Sep 21, 2020

softwarefactory-project-zuul bot commented Sep 22, 2020

ryanpetrello Sep 22, 2020

chrismeyersfsu Sep 24, 2020

softwarefactory-project-zuul bot commented Sep 22, 2020

AlanCoding Sep 23, 2020

ryanpetrello Sep 23, 2020 •

edited

Loading

ryanpetrello Sep 23, 2020

AlanCoding Sep 23, 2020

ryanpetrello Sep 23, 2020

AlanCoding Sep 23, 2020

ryanpetrello Sep 23, 2020

AlanCoding Sep 23, 2020

ryanpetrello Sep 23, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Sep 23, 2020

softwarefactory-project-zuul bot commented Sep 23, 2020

ryanpetrello commented Sep 23, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Sep 23, 2020

chrismeyersfsu Sep 24, 2020

ryanpetrello Sep 24, 2020

chrismeyersfsu Sep 24, 2020 •

edited

Loading

ryanpetrello Sep 24, 2020

chrismeyersfsu Sep 24, 2020

ryanpetrello Sep 24, 2020

chrismeyersfsu Sep 24, 2020

ryanpetrello Sep 24, 2020

softwarefactory-project-zuul bot commented Sep 24, 2020

softwarefactory-project-zuul bot commented Sep 24, 2020

remove multiprocessing.Queue usage from the callback receiver #8191

remove multiprocessing.Queue usage from the callback receiver #8191

Conversation

ryanpetrello commented Sep 21, 2020 • edited Loading

softwarefactory-project-zuul bot commented Sep 21, 2020

softwarefactory-project-zuul bot commented Sep 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Sep 22, 2020

Choose a reason for hiding this comment

ryanpetrello Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Sep 23, 2020

softwarefactory-project-zuul bot commented Sep 23, 2020

ryanpetrello commented Sep 23, 2020 • edited Loading

softwarefactory-project-zuul bot commented Sep 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrismeyersfsu Sep 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Sep 24, 2020

softwarefactory-project-zuul bot commented Sep 24, 2020

ryanpetrello commented Sep 21, 2020 •

edited

Loading

ryanpetrello Sep 23, 2020 •

edited

Loading

ryanpetrello Sep 23, 2020 •

edited

Loading

ryanpetrello commented Sep 23, 2020 •

edited

Loading

chrismeyersfsu Sep 24, 2020 •

edited

Loading