Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No instance found with the current cluster host id #7100

Closed
rchaud opened this issue May 20, 2020 · 10 comments
Closed

No instance found with the current cluster host id #7100

rchaud opened this issue May 20, 2020 · 10 comments

Comments

@rchaud
Copy link
Contributor

rchaud commented May 20, 2020

ISSUE TYPE
  • Bug Report
SUMMARY

I am currently running AWX 9.3.0 without issues. I have tried to install 10.0.0 , 11.0.0 , 11.1.0 , 11.2.0, but they are all buggy with the same error message "No instance found with the current cluster host id "

There are many reports of this error without a solid solution. Is there a solution to that error message in AWX versions 10.0.0 , 11.0.0 , 11.1.0 , 11.2.0 .

Does the Redis version matters at this point for the broker deployment?

ENVIRONMENT
  • AWX version: 10.0.0 , 11.0.0 , 11.1.0 , 11.2.0
  • AWX install method: Kubernetes YAML, using Docker hub images.
@ryanpetrello
Copy link
Contributor

@chinochao do you have any logs you can share here?

@ryanpetrello
Copy link
Contributor

I wonder if you're encountering some version of this:

#7000

@Seb0042
Copy link

Seb0042 commented Aug 3, 2020

I have the same issue. Not the one from the web container, web UI is fine.
But the awx-task doesn't work anymore.
I have the same environnement: Kubernetes YAML, using Docker hub images.
Here is the log from the awx-task container:

2020-08-03 15:16:54,656 WARNING awx.main.dispatch.periodic periodic beat started
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/init.py", line 154, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 381, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 55, in handle
reaper.reap()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/reaper.py", line 38, in reap
(changed, me) = Instance.objects.get_or_register()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 158, in get_or_register
return (False, self.me())
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 116, in me
raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

If I look at the table "main_instance" I saw that I have an instance each time I stop/start the pod.
I migrated from 9.3 to 10, then to 11.0 then 11.2 then 12. The issue started to happen with 12.0 version.
I tried to delete old instance with "awx-manage deprovision_instance --hostname=[name of the pods]" but it still not working.

How can I fix that so the Awx will work again ?

@rchaud
Copy link
Contributor Author

rchaud commented Aug 3, 2020

@Seb0042 I still havent found a solution for this. I was forced to stay in version 9.3 for the moment. Hopefully AWX developers can see what the issue is. My logs are same as the one you provided.

@Seb0042
Copy link

Seb0042 commented Aug 4, 2020

I found a workaround. The original name of the deployment is awx. I was renaming it awx-online for my needs. I tried with the original name and it works again.
I will stay like this until we have a solution.

@rchaud
Copy link
Contributor Author

rchaud commented Aug 4, 2020

I found a workaround. The original name of the deployment is awx. I was renaming it awx-online for my needs. I tried with the original name and it works again.
I will stay like this until we have a solution.

Where did you make this change exactly? settings.py or another file?

@Seb0042
Copy link

Seb0042 commented Aug 4, 2020

No, I'm talking about the deployment's name in kubernets. My pods were named awx-online-xxxx-yyyy but it seems that for the moment they have to be named awx-xxxx-yyyy. So I reversed the changes I made in the deployment.yml file (I deleted the deployment and then recreated it).

@rchaud
Copy link
Contributor Author

rchaud commented Aug 4, 2020

No, I'm talking about the deployment's name in kubernets. My pods were named awx-online-xxxx-yyyy but it seems that for the moment they have to be named awx-xxxx-yyyy. So I reversed the changes I made in the deployment.yml file (I deleted the deployment and then recreated it).

Cool, I will see if that works for me. My deployment is named awx-web-xxxx-yyyy and awx-task-xxxx-yyyy.

@timhaak
Copy link

timhaak commented Oct 27, 2020

For anyone else wanting to do this.

The hostname needs to match CLUSTER_HOST_ID in /etc/tower/awx_settings.py (default is 'awx')

If you have already started everything and it's now broken.

Connect to the Postgres DB and change hostname in public.main_instance to whatever you have set the main hostname to which matches the setting above.

@minsis
Copy link

minsis commented Feb 1, 2021

I was running into similar issues today with this. Looking in the code by default the Instance object is supposed to get settings.CLUSTER_HOST_ID if no hostname argument was passed in. But for some reason somewhere along the line when starting up a fresh container the register process is registering the task's container's hostname instead.

The reason I needed to change the hostname is because of DNS issues. The task container has the same hostname as the machine's hostname. In a clustered environment the hostnames need to be valid so the websocket communication can happen. However, I have a playbook that needs to connect to each cluster server to do some stuff, but the hostname inside the task engine resolves to the container's IP when it comes time to connect its own machine.

In any case because of the issue above I needed to try and change the hostname, but because the registering process registers the task engine hostname in the instances table (and awx_web is trying to connect to this hostname) instead of the cluster_host_id from settings.py. Because of clustering instances the hostname should be coming from settings.py that way you can set the proper hostname for websocket communication to work and the hostname of the container doesn't, and shoudln't, matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants