-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register system again if deleted by another pod #12494
Conversation
self.instance_name = Instance.objects.me().hostname | ||
except Exception as e: | ||
self.instance_name = settings.CLUSTER_HOST_ID | ||
logger.info(f'Instance {self.instance_name} seems to be unregistered, error: {e}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fosterseth help me out here please. This subsystem metrics code is front-running a million other scenarios by throwing an exception before we get deeper into the service details. For an example, we have this log which is beautifully constructed:
tools_awx_1 | 2022-07-08 17:40:57,095 INFO [-] awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id, retry in 5s...
tools_awx_1 | make[1]: Leaving directory '/awx_devel'
tools_awx_1 | 2022-07-08 17:41:02,473 INFO exited: awx-wsbroadcast (exit status 0; expected)
tools_awx_1 | 2022-07-08 17:41:03,477 INFO spawned: 'awx-wsbroadcast' with pid 27242
tools_awx_1 | make[1]: Entering directory '/awx_devel'
tools_awx_1 | 2022-07-08 17:41:04,498 INFO success: awx-wsbroadcast entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
tools_awx_1 | 2022-07-08 17:41:05,510 INFO [-] awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id, retry in 5s...
tools_awx_1 | make[1]: Leaving directory '/awx_devel'
tools_awx_1 | 2022-07-08 17:41:10,892 INFO exited: awx-wsbroadcast (exit status 0; expected)
tools_awx_1 | 2022-07-08 17:41:10,894 INFO spawned: 'awx-wsbroadcast' with pid 27275
tools_awx_1 | make[1]: Entering directory '/awx_devel'
tools_awx_1 | 2022-07-08 17:41:11,903 INFO success: awx-wsbroadcast entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
tools_awx_1 | 2022-07-08 17:41:12,996 INFO [-] awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id, retry in 5s...
I can get this if I suppress errors from the code here, as I'm trying to do here.
But this begs the obvious question, why do this at all? Why did we not start out referencing settings.CLUSTER_HOST_ID
? We don't need the model, just the name. Did we get ourselves into this situation by cargo-culting the Instance.objects.me()
call?
Avoid cases where missing instance would throw error on startup this gives time for heartbeat to register it
SUMMARY
Solution for #12471
As of opening this, I can confirm it works, but causes churn in some other services.
ISSUE TYPE
COMPONENT NAME