Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve event processing performance by passing host_id through event data #5514

Closed
AlanCoding opened this issue Dec 16, 2019 · 6 comments
Closed

Comments

@AlanCoding
Copy link
Member

ISSUE TYPE
  • Feature Idea
SUMMARY

Proposal to eliminate the need to do a query on every job event save to obtain the database id of the host it is operating on.

This logic would be replaced:

if self.job.inventory.kind == 'smart':
# optimization to avoid calling inventory.hosts, which
# can take a long time to run under some circumstances
from awx.main.models.inventory import SmartInventoryMembership
membership = SmartInventoryMembership.objects.filter(
inventory=self.job.inventory, host__name=self.host_name
).first()
if membership:
host_id = membership.host_id
else:
host_id = None
else:
host_qs = self.job.inventory.hosts.filter(name=self.host_name)
host_id = host_qs.only('id').values_list('id', flat=True).first()

The host id can be obtained in the script data if using these parameters /api/v2/inventories/45/script/?towervars=1&hostvars=1, example:

    "_meta": {
        "hostvars": {
            "change_of_vars": {
                "dynamic_2019_10_29_15_23_48.194654": "host_static_value",
                "static_key": "host_dynamic_2019_10_29_15_28_08.940019",
                "dynamic_2019_10_29_15_24_26.098715": "host_static_value",
                "dynamic_2019_10_29_15_28_08.940019": "host_static_value",
                "remote_tower_enabled": "true",
                "remote_tower_id": 370
            },
            "moover-2019_10_29_15_28_08.940019": {
                "static_var": "static_value",
                "remote_tower_enabled": "true",
                "remote_tower_id": 373
            }
        }
    }
}

This proposal is to apply those parameters to the normal pre-job-run inventory script generation. Then when saving events, obtain remote_tower_id from hostvars. This will have to be done in ansible-runner. That will need to be added into the event data, then discovered in the event saving code, and used instead of the query to obtain the host id.

Related to or maybe part of #4209 #5443. This can be done and tested separately before the main feature work. Ping @chrismeyersfsu

@chrismeyersfsu
Copy link
Member

remote_tower_id sounds too much like a variable intended for the multiple tower inventory feature. How about tower_host_id

@ryanpetrello ryanpetrello self-assigned this Jan 9, 2020
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 9, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 9, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 9, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 9, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 9, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 10, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 10, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 10, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 11, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 13, 2020
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 13, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 13, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jan 14, 2020
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: ansible#5514
@kdelee
Copy link
Member

kdelee commented Jan 21, 2020

Could this change have caused this traceback in logs while saving events?

2020-01-21 17:56:54,831 ERROR    awx.main.commands.run_callback_receiver Database Error Saving Job Event
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 240, in _commit
    return self.connection.commit()
psycopg2.errors.ForeignKeyViolation: insert or update on table "main_inventoryupdateevent" violates foreign key constraint "main_inventoryupdate_inventory_update_id_8974f1f7_fk_main_inve"
DETAIL:  Key (inventory_update_id)=(2038) is not present in table "main_inventoryupdate".


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/worker/callback.py", line 57, in flush
    cls.objects.bulk_create(events)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 480, in bulk_create
    obj_without_pk._state.db = self.db
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/transaction.py", line 240, in __exit__
    connection.commit()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 262, in commit
    self._commit()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 240, in _commit
    return self.connection.commit()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 240, in _commit
    return self.connection.commit()
django.db.utils.IntegrityError: insert or update on table "main_inventoryupdateevent" violates foreign key constraint "main_inventoryupdate_inventory_update_id_8974f1f7_fk_main_inve"
DETAIL:  Key (inventory_update_id)=(2038) is not present in table "main_inventoryupdate".


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ForeignKeyViolation: insert or update on table "main_inventoryupdateevent" violates foreign key constraint "main_inventoryupdate_inventory_update_id_8974f1f7_fk_main_inve"
DETAIL:  Key (inventory_update_id)=(2038) is not present in table "main_inventoryupdate".


..... lots of tracebacks ... eventually stops and things go on working....

@AlanCoding
Copy link
Member Author

that looks more related to teardown code. Inventory updates don't link from events to the host (only jobs and ad hoc commands do). What you're seeing sounds like someone had a running inventory update and deleted the inventory update, or related inventory, or related organization. We have logic to prevent that, but it might not be airtight.

Or it could be that the inventory update finished and was deleted, but not all of its events had finished processing. In that case, the tracebacks are not necessarily bad.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Jan 21, 2020

@kdelee that error is what @AlanCoding suggested, and isn't unique to this change - it occurs if the callback receiver tries to save an inventory update event that's related to an inventory update that no longer exists

you can reproduce this pretty easily by:

  1. Stopping the callback receiver
  2. Launching an inventory update
  3. After state='successful (though no stdout will show), delete the inventory update
  4. Start the callback receiver and observe these errors

as long as things recover and stdout still flows for new jobs after this, errors in the logs for this situation are fine/expected

@kdelee
Copy link
Member

kdelee commented Jan 21, 2020

ok cool

@elyezer elyezer self-assigned this Feb 12, 2020
@elyezer
Copy link
Member

elyezer commented Feb 12, 2020

Did the following to verify this:

  1. Created an inventory with about 4 hosts
  2. Ran a job template that interacted with all the hosts
  3. Checked if the runner related events had the host information present on the summary_fields. Also checked that the event fields host and host_name had the information set accordingly.

See below an example of the following event for one of the hosts:

        {
            "id": 559,
            "type": "job_event",
            "url": "/api/v2/job_events/559/",
            "related": {
                "job": "/api/v2/jobs/34/",
                "children": "/api/v2/job_events/559/children/",
                "host": "/api/v2/hosts/1/"
            },
            "summary_fields": {
                "host": {
                    "id": 1,
                    "name": "localhost",
                    "description": ""
                },
                "job": {
                    "id": 34,
                    "name": "Demo Job Template",
                    "description": "",
                    "status": "successful",
                    "failed": false,
                    "elapsed": 3.594,
                    "type": "job",
                    "job_template_id": 7,
                    "job_template_name": "Demo Job Template"
                },
                "role": {}
            },
            "created": "2020-02-12T16:12:21.842052Z",
            "modified": "2020-02-12T16:12:21.946018Z",
            "job": 34,
            "event": "runner_on_ok",
            "counter": 8,
            "event_display": "Host OK",
            "event_data": {
                "playbook": "hello_world.yml",
                "playbook_uuid": "dfb499d2-c070-425e-9d01-882e606180e7",
                "play": "Hello World Sample",
                "play_uuid": "0242ac12-0005-6c79-06b4-000000000006",
                "play_pattern": "all",
                "task": "Hello Message",
                "task_uuid": "0242ac12-0005-6c79-06b4-000000000008",
                "task_action": "debug",
                "task_args": "",
                "task_path": "/tmp/awx_34_ezqkhwpa/project/hello_world.yml:4",
                "host": "localhost",
                "remote_addr": "localhost",
                "res": {
                    "msg": "Hello World!",
                    "_ansible_verbose_always": true,
                    "_ansible_no_log": false,
                    "changed": false
                },
                "event_loop": null,
                "uuid": "c681c3d4-83ce-423d-8a11-46bd2c616e37"
            },
            "event_level": 3,
            "failed": false,
            "changed": false,
            "uuid": "c681c3d4-83ce-423d-8a11-46bd2c616e37",
            "parent_uuid": "0242ac12-0005-6c79-06b4-000000000008",
            "host": 1,
            "host_name": "localhost",
            "playbook": "hello_world.yml",
            "play": "Hello World Sample",
            "task": "Hello Message",
            "role": "",
            "stdout": "\u001b[0;32mok: [localhost] => {\u001b[0m\r\n\u001b[0;32m    \"msg\": \"Hello World!\"\u001b[0m\r\n\u001b[0;32m}\u001b[0m",
            "start_line": 7,
            "end_line": 10,
            "verbosity": 0
        }

With this we can consider this issue as verified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants