Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task container unreachable from the web container #7404

Closed
zigaSRC opened this issue Jun 22, 2020 · 16 comments
Closed

Task container unreachable from the web container #7404

zigaSRC opened this issue Jun 22, 2020 · 16 comments
Labels

Comments

@zigaSRC
Copy link

zigaSRC commented Jun 22, 2020

ISSUE TYPE
  • Bug Report
SUMMARY

After an upgrade from 11.2.0 to 12.0.0 the awx_web container can no longer reach the awx_task container. The consequence is that there are plenty of errors in the UI and some things just don't work.

ENVIRONMENT
  • AWX version: 12.0.0
  • AWX install method: docker on linux
  • Ansible version: 2.9.9
  • Operating System: CentOS 7.7
  • Web Browser: Independent/All (Chrome, firefox, edge, ...)
STEPS TO REPRODUCE

Upgrade from 11.2.0 to 12.0.0 by running the newer install playbook with appropriate inventory.

EXPECTED RESULTS

The upgrade to be successful and that the containers are able to talk to each other.

ACTUAL RESULTS

UI not fully working with lots of GET: -1 errors.
Errors in awx_web logs:

2020-06-20 00:14:05,041 WARNING  awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Name or service not known]'.
2020-06-20 00:14:05,043 DEBUG    awx.main.wsbroadcast Connection from awx to awxtask attempt number 10885.
ADDITIONAL INFORMATION

Manually adding an entry in the /etc/hosts file for the awxtask container results in the following error:

2020-06-22 12:46:04,456 WARNING  awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Connect call failed ('172.18.0.5', 443)]'.
2020-06-22 12:46:04,458 DEBUG    awx.main.wsbroadcast Connection from awx to awxtask attempt number 54201.

The task container is reachable via awx_task since that's the actual name of the container, but that's besides the point since renaming it to awxtask won't solve the problem as seen above.

@ryanpetrello
Copy link
Contributor

I'm not totally sure what's up here @zigaSRC, but I'm not able to reproduce this in my vanilla 12.0.0 and 13.0.0 AWX local Docker installs. These containers definitely must be able to reach each other to broadcast stdout messages.

Can you share your inventory?

@zigaSRC
Copy link
Author

zigaSRC commented Jun 29, 2020

Sadly I can't share our inventories but if there's anything you would like me to check just let me know. Besides I don't think it's related to that, since it's not tied to running any playbook or inventory sync, but you probably know best.

BTW I upgraded to 13.0.0 since you mentioned it, but that didn't help the issue:

Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | SUCCESS => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/libexec/platform-python"
    },
    "changed": false,
    "elapsed": 0,
    "match_groupdict": {},
    "match_groups": [],
    "path": null,
    "port": 5432,
    "search_regex": null,
    "state": "started"
}
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | SUCCESS => {
    "ansible_facts": {
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 83)
spawned uWSGI worker 1 (pid: 87, cores: 1)
spawned uWSGI worker 2 (pid: 88, cores: 1)
spawned uWSGI worker 3 (pid: 89, cores: 1)
spawned uWSGI worker 4 (pid: 90, cores: 1)
spawned uWSGI worker 5 (pid: 91, cores: 1)
WSGI app 0 (mountpoint='') ready in 17 seconds on interpreter 0x1df9530 pid: 87 (default app)
[pid: 87|app: 0|req: 1/1] 172.25.7.90 () {56 vars in 993 bytes} [Mon Jun 29 06:15:15 2020] GET / => generated 11460 bytes in 4402 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
WSGI app 0 (mountpoint='') ready in 21 seconds on interpreter 0x1df9530 pid: 88 (default app)
WSGI app 0 (mountpoint='') ready in 22 seconds on interpreter 0x1df9530 pid: 91 (default app)
2020-06-29 06:15:22,077 INFO     daphne.cli Starting server at tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,077 INFO     Starting server at tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,083 INFO     daphne.server HTTP/2 support not enabled (install the http2 and tls Twisted extras)
2020-06-29 06:15:22,083 INFO     HTTP/2 support not enabled (install the http2 and tls Twisted extras)
2020-06-29 06:15:22,083 INFO     daphne.server Configuring endpoint tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,083 INFO     Configuring endpoint tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,102 INFO     daphne.server Listening on TCP address 127.0.0.1:8051
2020-06-29 06:15:22,102 INFO     Listening on TCP address 127.0.0.1:8051
WSGI app 0 (mountpoint='') ready in 24 seconds on interpreter 0x1df9530 pid: 89 (default app)
WSGI app 0 (mountpoint='') ready in 25 seconds on interpreter 0x1df9530 pid: 90 (default app)
[pid: 91|app: 0|req: 1/2] 172.25.7.90 () {54 vars in 860 bytes} [Mon Jun 29 06:15:21 2020] GET /api/ => generated 21204 bytes in 2847 msecs (HTTP/1.1 200) 11 headers in 400 bytes (1 switches on core 0)
2020-06-29 06:15:25,387 WARNING  awx.main.wsbroadcast Adding {'awxtask'} to websocket broadcast list
2020-06-29 06:15:25,394 DEBUG    awx.main.wsbroadcast Connection from awx to awxtask attempt number 0.
2020-06-29 06:15:25,730 WARNING  awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Name or service not known]'.
2020-06-29 06:15:25,734 DEBUG    awx.main.wsbroadcast Connection from awx to awxtask attempt number 1.
[pid: 89|app: 0|req: 1/3] 172.25.7.90 () {54 vars in 906 bytes} [Mon Jun 29 06:15:24 2020] GET /api/v2/auth/ => generated 2 bytes in 1751 msecs (HTTP/1.1 200) 10 headers in 285 bytes (1 switches on core 0)
[pid: 90|app: 0|req: 1/4] 172.25.7.90 () {56 vars in 912 bytes} [Mon Jun 29 06:15:24 2020] GET /api/ => generated 21204 bytes in 2314 msecs (HTTP/1.1 200) 11 headers in 400 bytes (1 switches on core 0)
[pid: 89|app: 0|req: 2/5] 172.25.7.90 () {56 vars in 918 bytes} [Mon Jun 29 06:15:26 2020] GET /api/v2/ => generated 1688 bytes in 570 msecs (HTTP/1.1 200) 10 headers in 288 bytes (1 switches on core 0)
2020-06-29 06:15:30,750 WARNING  awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Name or service not known]'.
2020-06-29 06:15:30,752 DEBUG    awx.main.wsbroadcast Connection from awx to awxtask attempt number 2.

Edit: We could talk on IRC to see if we can figure something out. Just let me know where I can reach you.

@ryanpetrello
Copy link
Contributor

Have you customized your container names or hostnames in some way? With a vanilla 13.0.0 install, I'm unable to reproduce this. Where is awxtask coming from?

~/dev/awx/installer docker ps -a
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS                  NAMES
55e39345ced2        ansible/awx:13.0.0   "tini -- /usr/bin/la…"   4 minutes ago       Up 4 minutes        8052/tcp               awx_task
8eea8e20de10        ansible/awx:13.0.0   "tini -- /bin/sh -c …"   4 minutes ago       Up 4 minutes        0.0.0.0:80->8052/tcp   awx_web
3faf2fe40ff6        postgres:10          "docker-entrypoint.s…"   4 minutes ago       Up 4 minutes        5432/tcp               awx_postgres
429c8448aa12        redis                "docker-entrypoint.s…"   4 minutes ago       Up 4 minutes        6379/tcp               awx_redis
~/dev/awx/installer docker exec -it 55e39345ced2 hostname
awx
~/dev/awx/installer docker exec -it 8eea8e20de10 hostname
awxweb
~/dev/awx/installer docker exec -it 55e39345ced2 bash
bash-4.4# awx-manage dbshell
now exiting InteractiveConsole...
bash-4.4# awx-manage dbshell
psql (10.6, server 10.13 (Debian 10.13-1.pgdg90+1))
Type "help" for help.

awx=# SELECT hostname FROM main_instance;
 hostname
----------
 awx
(1 row)

@zigaSRC
Copy link
Author

zigaSRC commented Jun 30, 2020

Yes, but just through the install playbook or rather inventory. It contains the following changes to hostnames:

awx_task_hostname=awx
awx_web_hostname=ansible.example.com

As such the awx_web hostname is set to ansible. The task container has it's default awx hostname.

I have not changed the container names in any way and have not touched anything outside the install inventory file (except after the errors for debugging and trying to solve the connectivity problems).

awx=# SELECT hostname FROM main_instance;
 hostname
----------
 awxtask
 awx
(2 rows)

I have no idea where the awxtask is comming from...

@ryanpetrello
Copy link
Contributor

Can you share this?

SELECT * FROM main_instance;

@zigaSRC
Copy link
Author

zigaSRC commented Jul 3, 2020

Sorry for the late response:

awx=# SELECT * FROM main_instance;
 id |                 uuid                 | hostname |            created            |           modified            | capacity | version | last_isolated_check | capacity_adjustment | cpu |   memory   | cpu_capacity | mem_capacity | enabled | managed_by_policy | ip_address
----+--------------------------------------+----------+-------------------------------+-------------------------------+----------+---------+---------------------+---------------------+-----+------------+--------------+--------------+---------+-------------------+------------
  1 | 00000000-0000-0000-0000-000000000000 | awxtask  | 2020-01-13 10:21:13.472856+00 | 2020-01-13 10:21:13.472944+00 |        0 |         |                     |                1.00 |   0 |          0 |            0 |            0 | t       | t                 |
  2 | 00000000-0000-0000-0000-000000000000 | awx      | 2020-01-13 10:38:43.953606+00 | 2020-07-03 05:36:28.104388+00 |       17 | 13.0.0  |                     |                1.00 |   2 | 3973791744 |            8 |           17 | t       | t                 |
(2 rows)

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Jul 6, 2020

@zigaSRC,

I'm not really sure why you've got that second instance in there called awxtask, but I'd just delete that row entirely; the modified time shows that it hasn't actually had a successful heartbeat since Jan 13th, e.g.,

DELETE FROM main_instance WHERE hostname='awxtask'

@zigaSRC
Copy link
Author

zigaSRC commented Jul 6, 2020

Can't really do that since it would violate foreign key restraints in the database.

Using awx-manage deprovision_instance --hostname=awxtask did work though and the other instance is gone. There are no more errors in the logs which is great. Hopefully this fixes the issues we've been having after the upgrade. We'll just have to wait and see...

For now: Thanks for the help!

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Jul 6, 2020

@zigaSRC yep, deprovision_instance works. I expect your errors to go away - you had an old instance laying around that didn't reflect reality. I'm unsure how it got there, but I expect it was just something we goofed up in a prior/older version of AWX that's now lost to the sands of time 😄.

Let me know if you spot any other issues.

@zigaSRC
Copy link
Author

zigaSRC commented Jul 9, 2020

The problems we were having are still there so that wasn't the root cause sadly. I created another issue since it apparently doesn't relate to the connectivity issue discussed here. I will just include the link here in case it has any relevance.

#7588

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Jul 9, 2020

Hey @zigaSRC I get those pretty regularly, too, and I'm not sure what causes them (@mabashian, @jakemcdermott or @marshmalien might know?)

That said, I think they're unrelated to whatever issues you're still having.

@nicoherbigde
Copy link

Hi, I have just updated our Ansible AWX instance from 9.1.1 to 13.0.0 and I am also getting the warning message that the task container cannot be reached by the web container.

awx.main.wsbroadcast Connection from awx to awx-task failed: 'Cannot connect to host awx-task:443 ssl:False [Connect call failed ('172.19.0.5', 443)]'.

I also renamed the containers according to the possibilities in the inventory of the installation tool.

Contrary to the warning message, however, running the Ansible Playbooks etc. works for me.

The strange thing for me is that the web container tries to contact the task container on port 443 with SSL off. Shouldn't he rather try the request on port 80? Maybe this is where the error lies.

In my opinion the ticket should be opened again, because the logs are filled with this messages.

@zigaSRC
Copy link
Author

zigaSRC commented Jul 20, 2020

Did you check that a container with that name exists? It's most likely just a remnant from before the update/s.

Check the date when it was last modified in the DB and delete it if it's before the update (follow the steps we went through to resolve it).

@nicoherbigde
Copy link

nicoherbigde commented Jul 20, 2020

Hi @zigaSRC, thanks for your response. I never changed the names of the docker containers during the last upgrades. The name of the web container is awx-web and the name of the task container is awx-task. I have changed the names of the containers when updating from version 3.0 to 5.0 long time ago.

I also checked via docker container ls if the containers are named accordingly and yes they are.

eef3327e0a04   ansible/awx:13.0.0   awx-task
9f624d48d474   ansible/awx:13.0.0   awx-web
26b9c667a45c   postgres:10          awx-postgres
76bc29bf94c6   redis                awx-redis

When I execute the command SELECT * FROM main_instance; in the database I get these two entries:

id uuid hostname created modified capacity version last_isolated_check capacity_adjustment cpu memory cpu_capacity mem_capacity enabled managed_by_policy ip_address
1 00000000-0000-0000-0000-000000000000 awx 2019-03-04 10:08:45.432112 2020-07-20 08:30:41.105110 18 13.0.0 NULL 1.00 2 4136701952 8 18 true true NULL
2 00000000-0000-0000-0000-000000000000 awx-task 2019-06-25 10:10:44.397751 2019-06-25 10:10:44.397804 0 NULL 1.00 0 0 0 0 true true NULL

After I executed the command awx-manage deprovision_instance --hostname=awx-task, the database entry with ID = 2 was successfully deleted. But after a restart of the whole Docker containers a new entry with the hostname awx-task was created again, which again had no values for CPU, memory etc. So the same error message appeared in the logs again, that the container with the name awx-task could not be found.

It is interesting that Ansible AWX executes the playbooks under the hostname awx, although a container with the name awx does not exist.

Thank you very much.

@nicoherbigde
Copy link

nicoherbigde commented Jul 20, 2020

Now, I have also executed the command awx-manage deprovision_instance --hostname=awx to delete the instance with the name awx. But then I get the following error message No instance found with the current cluster host id.

There is already a ticket #3959 #7100 for this error message and it seems that the name of the task container is hard coded in some configuration files. As it looks like it is not a good idea to rename the container names, although the installation routine allows it.

How should we proceed from here?

@nicoherbigde
Copy link

So, now I have renamed the Docker hostname (not the Docker service name) from the task container to awx and executed the command awx-manage deprovision_instance --hostname=awx-task again. Now the system seems to run stable and only one instance appears.

For me the problem is solved, but in my opinion the configuration option should be removed in the installation routine or fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants