-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modules not available for some proxy minions #35443
Comments
ping @cro could I get your input on this issue? Any advice as to why this might be occuring? Or maybe somewhere he can look to track this issue down? My only suggestion for right now is to check the debug logs when this occurs and see if there is any relevant logs. Thanks |
Hi @Ch3LL, Being in a production environment, the log level was set to warning therefore the logs did not catch anything suspicious. I decreased to debug level and I will wait till this happens again when I will provide some more data for this ticket. Thanks, |
Sounds great thanks @mirceaulinic |
This issue happened today again. Please find the logs section below:
Is there any other information I can provide? |
I confirm that's all the information provided in the logs: Executing: # salt edge01.lax01 transit.depref dummy
edge01.lax01:
'transit.depref' is not available.
ERROR: Minions returned with non-zero exit code Logs:
Although the minion replied for other requests such: # salt edge01.lax01 net.ping 8.8.8.8 count=1
edge01.lax01:
----------
comment:
out:
----------
success:
----------
packet_loss:
0
probes_sent:
1
results:
|_
----------
ip:
8.8.8.8
rtt:
0.576
rtt_avg:
0.576
rtt_max:
0.576
rtt_min:
0.576
rtt_stddev:
0.0
result:
True |
@mirceaulinic I will look into this tomorrow. |
Thanks @cro! |
@mirceaulinic I've been looking over this and I know that pre #35178 proxies cache all the modules in the same place, but I am suspicious that #35178 might alleviate this. Is there any way for you to try that? (as an aside, thank you for the excellent bug reports.) |
As a followup, I installed the code on a testing platform and I confirm it works as expected: # ls -la /var/cache/salt/proxy/
total 8
drwxr-xr-x 7 root root 200 Aug 23 14:55 .
drwxr-xr-x 8 root root 160 Apr 29 16:42 ..
drwxr-xr-x 4 root root 100 Aug 23 14:50 edge01.sfo04
drwxr-xr-x 4 root root 100 Aug 23 14:55 edge01.sjc01 One directory per minion - nice! Tomorrow I will push it into production. |
Hi @cro, Today I upgraded our Salt production instance - please find versions report below: $ sudo salt --versions-report
Salt Version:
Salt: 2016.3.2-104-g962e493
Dependency Versions:
cffi: 1.7.0
cherrypy: Not Installed
dateutil: 2.2
gitdb: 0.6.4
gitpython: 2.0.5
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.21.1
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: 1.2.5
pycparser: 2.14
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.9 (default, Mar 1 2015, 12:57:24)
python-gnupg: 0.3.8
PyYAML: 3.11
PyZMQ: 15.4.0
RAET: Not Installed
smmap: 0.9.0
timelib: Not Installed
Tornado: 4.4.1
ZMQ: 4.1.5
System Versions:
dist: debian 8.5
machine: x86_64
release: 4.1.3-cloudflare
system: Linux
version: debian 8.5 As on the testing server, the changes from #35178 created separate caching directories for each minion - which confirms yet again that works as expected (truncated output): # ls -la /var/cache/salt/proxy/
total 3652
drwxr-xr-x 82 root root 4096 Aug 24 09:17 .
drwxr-xr-x 6 root root 63 Feb 25 09:56 ..
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.akl01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.ams01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.arn01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.atl01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.bkk01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.bom01
drwxr-xr-x 4 root root 50 Aug 24 09:17 edge01.bos01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.bru01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.cai01 Unfortunately, just a few minutes ago, a colleague paste me the following: $ sudo salt edge01.iad02 transit.disable gtt
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.iad02:
'transit.disable' is not available.
ERROR: Minions returned with non-zero exit code I do not have any log entries (as the log level has been increased again to warning), but I assume it should be similar as above. What would you suggest to do now, in order to help debugging? Also, upgrading to this version, on can notice that Thank you! |
Hi @mirceaulinic, Quick idea as I dive into this a little more--is the |
The source modules are located under module_dirs:
- /etc/salt/proxy-master I will expand a bit more on the steps I performed earlier today:
I did not run sync_modules, but I was able to execute the modules immediately after restarting the proxy processes. Please let me know if that was correct. If not, what is the right way to sync_modules? I would like to remind you that this box is also a straight minion itself and Thank you! |
|
Thank you for clarifying this @cro! I sync'ed them all now. |
Did that fix the original problem you had with modules being reported missing? |
Yes, right now they are all available, but usually they become unavailable after a couple of hours/days. |
Looking into the caching directory, I notice for all proxies that last change is since yesterday at the very first run after installing the new salt version (truncated output): # ls -la /var/cache/salt/proxy
total 3652
drwxr-xr-x 82 root root 4096 Aug 24 09:17 .
drwxr-xr-x 6 root root 63 Feb 25 09:56 ..
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.akl01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.ams01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.arn01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.atl01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.bkk01
drwxr-xr-x 4 root root 50 Aug 24 09:09 edge01.bom01 Hope this information is helpful. |
@cro About 14h hours after I run # salt edge01.ewr01 pop.disabled_elements
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
'pop.disabled_elements' is not available. Although the minion is up an running # salt edge01.ewr01 net.connected
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
----------
out:
True
# salt edge01.ewr01 ntp.servers
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
----------
comment:
out:
- 1.1.1.1
- 2.2.2.2
result:
True (again obfuscated the real IP addresses of the NTP servers) And the module was available for other minions: # salt edge01.jnb01 pop.disabled_elements
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.jnb01:
----------
comment:
out:
- False
|_
|_
|_
|_
- 2
|_
- 2
result:
True After running # salt edge01.ewr01 saltutil.sync_modules
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01: I was able to run correctly: # salt edge01.ewr01 pop.disabled_elements
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
----------
comment:
out:
- False
|_
- TELIA|4
- TELIA|6
|_
|_
|_
- TELIA|4
- TELIA|6
- 10
|_
- 10
result:
True Please, give me some more ideas how I would be able to identify the steps to reproduce this issue. Also, probably worths mentioning that we did not see that very often a couple of months ago. It is also true that we had a cron restarting the proxies every hour (I know it sounds terrible but we had to for the reasons explained in #32918. Meanwhile we found a solution for this - probably not the best, but works well and I am going to raise a PR for that). # ls -la /var/cache/salt/proxy/edge01.sjc01/files/base
total 8
drwx------ 8 root root 108 Aug 25 20:37 .
drwx------ 3 root root 17 Aug 24 09:09 ..
drwx------ 2 root root 29 Aug 24 09:09 _engines
drwx------ 2 root root 30 Aug 24 09:09 _grains
drwx------ 2 root root 4096 Aug 24 09:09 _modules
drwx------ 2 root root 22 Aug 24 09:09 _proxy
drwx------ 2 root root 26 Aug 24 09:09 _returners
drwx------ 2 root root 83 Aug 24 09:09 _states
-rw------- 1 root root 34 Aug 25 20:37 top.sls
# ls -la /var/cache/salt/proxy/edge01.sjc01/files/base/_modules
total 96
drwx------ 2 root root 4096 Aug 24 09:09 .
drwx------ 8 root root 108 Aug 25 20:37 ..
-rw------- 1 root root 3699 Aug 24 09:09 napalm_anycast.py
-rw------- 1 root root 7079 Aug 24 09:09 napalm_bgp.py
-rw------- 1 root root 12117 Aug 24 09:09 napalm_cfnet.py
-rw------- 1 root root 28109 Aug 24 09:09 napalm_network.py
-rw------- 1 root root 5541 Aug 24 09:09 napalm_ntp.py
-rw------- 1 root root 2775 Aug 24 09:09 napalm_pop.py
-rw------- 1 root root 2509 Aug 24 09:09 napalm_probes.py
-rw------- 1 root root 1217 Aug 24 09:09 napalm_route.py
-rw------- 1 root root 3878 Aug 24 09:09 napalm_snmp.py
-rw------- 1 root root 10908 Aug 24 09:09 napalm_transit.py
-rw------- 1 root root 2972 Aug 24 09:09 napalm_users.py Is this any relevant in this case? Thank you! |
Drat. I am off work today with almost no cell service. I will dig into this when I am back on the grid on Sunday. |
No worries @cro! Thank you very much for your interest! |
One more information I can provide and hope it is helpful is that happens for other types of modules too, not only for execution modules: # salt -L 'edge01.mrs01,edge01.dus01,edge01.yyz01' state.sls router.probes
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.dus01:
----------
ID: cf_probes
Function: probes.managed
Result: False
Comment: State 'probes.managed' was not found in SLS 'router.probes'
Reason: 'probes.managed' is not available.
Started:
Duration:
Changes:
Summary for edge01.dus01
------------
Succeeded: 0
Failed: 1
------------ After sync_all it worked properly |
We've implemented a workaround for this till the potential bug is fixed:
It's not the most elegant, but seems pretty efficient so far. |
@cro would you mind having a look in this thread please? |
@mirceaulinic I think this is now fixed in the carbon branch. The PR for that only got merged yesterday. Any way you can try that our and see if it's still broken there? |
That's great news @cro! Yes, for sure I would like to test it - most probably next week I should be able to install it on the test platform. |
I raised a separate issue, as this may or may be not related to massive CPU and memory consumption: #38990. |
While investigating your other ticket I realized that this might actually be a problem with having |
Hi @cro - I am still seeing this issue. I really do want to help identifying the root cause, but I don't know where I should start from: mircea@salt-master:~$ sudo salt -L 'edge01.bjm01, edge01.sjc01' transit.test
edge01.bjm01:
'transit.test' is not available.
edge01.sjc01:
True The the def __virtual__():
"""
NAPALM library must be installed for this module to work.
Also, the key proxymodule must be set in the __opts___ dictionary.
"""
if HAS_NAPALM and 'proxy' in __opts__:
return __virtualname__
else:
return (False, 'The module CloudFlare transit cannot be loaded: \
NAPALM lib or proxy could not be loaded.') In the master log I only see:
When doing mircea@salt-master:~$ sudo salt edge01.bjm01 saltutil.sync_all
edge01.bjm01:
----------
beacons:
engines:
- engines.http_logstash
grains:
- grains.cfgrains
- grains.network_device
log_handlers:
modules:
- modules.napalm_anycast
- modules.napalm_bgp
- modules.napalm_cfnet
- modules.napalm_macdb
- modules.napalm_network
- modules.napalm_ntp
- modules.napalm_pop
- modules.napalm_prefixlist
- modules.napalm_probes
- modules.napalm_route
- modules.napalm_snmp
- modules.napalm_transit
- modules.napalm_users
- modules.statuspage
output:
- output.table_out
proxymodules:
- proxy.napalm
renderers:
returners:
- returners.traceroute
sdb:
states:
- states.aggroutes
- states.bgp
- states.netconfig
- states.netntp
- states.netusers
- states.prefixlist
- states.probes
- states.snmp
- states.statuspage
utils: as like the module never existed before. Which makes me think that your explanation above makes sense. After the mircea@salt-master:~$ sudo salt edge01.sin01 transit.test
edge01.sin01:
'transit.test' is not available.
ERROR: Minions returned with non-zero exit code
mircea@salt-master:~$ sudo salt edge01.sin01 saltutil.sync_all
edge01.sin01:
----------
beacons:
engines:
grains:
log_handlers:
modules:
output:
proxymodules:
renderers:
returners:
sdb:
states:
utils:
mircea@salt-master:~$ sudo salt edge01.sin01 transit.test
edge01.sin01:
True
mircea@salt-master:~$ Please let's have a closer look at this and let me know what else I should test. Thanks, |
Just in case someone would like to look into this problem at some point: When this happened today reveals that the modules are there, under the caching dir correctly and
EDIT: forgot to mention (although self-explanatory given the long story above) - all good after |
@cro After the discussion from #41024, I am wondering if this may be related by any chance to the way we are defining the
But in that case why does |
As a followup (obivously to myself): checked when this happens and the |
Is this issue still scheduled to be solved in Nitrogen? |
This Pull Request #39948 might finally solve this issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue. |
Description of Issue/Question
From time to time, when executing a specific function against a groups of network devices managed through proxy minions using the NAPALM proxy, we encounter the following error:
In the example above, I have executed the function
peers
from thenapalm_ntp
module, having the virtual namentp
: https://github.com/saltstack/salt/blob/develop/salt/modules/napalm_ntp.py#L77Executing the same function against a different target:
# salt edge01.akl01 ntp.peers edge01.akl01: ---------- comment: out: - 172.17.17.1 - 172.17.17.2 result: True
(I have obfuscated the IP addresses in previous output, otherwise nothing was changed).
But executing a different function (e.g.
snmp.config
- soon to be published) against the same minion as previously:Setup
The setup is the same as described in #34446, without the changes from #35178.
Steps to Reproduce Issue
Unfortunately I am not sure how to reproduce this issue. Given that I am still not aware of deeper Salt internals, I would need your guidance to identify the steps.
Initially I thought it is a caching issue - but the modules are cached in only one place for all proxies and they become unavailable only for some of them.
Restarting the proxy process fixes the problem, but this is not a comfortable solution since this kind of events are not deterministic (or, at least, I was not able to identify a pattern).
Sometimes we get the message
* is not available
a couple of hours after the proxy is started, sometimes after a couple of days.Versions Report
# salt --versions-report Salt Version: Salt: 2016.3.1 Dependency Versions: cffi: 1.6.0 cherrypy: Not Installed dateutil: 2.2 gitdb: 0.6.4 gitpython: 2.0.5 ioflo: Not Installed Jinja2: 2.8 libgit2: Not Installed libnacl: Not Installed M2Crypto: 0.21.1 Mako: Not Installed msgpack-pure: Not Installed msgpack-python: 0.4.7 mysql-python: 1.2.5 pycparser: 2.14 pycrypto: 2.6.1 pygit2: Not Installed Python: 2.7.9 (default, Mar 1 2015, 12:57:24) python-gnupg: 0.3.8 PyYAML: 3.11 PyZMQ: 15.2.0 RAET: Not Installed smmap: 0.9.0 timelib: Not Installed Tornado: 4.3 ZMQ: 4.0.5 System Versions: dist: debian 8.5 machine: x86_64 release: 4.1.3-cloudflare system: Linux version: debian 8.5
The text was updated successfully, but these errors were encountered: