Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modules not available for some proxy minions #35443

Closed
mirceaulinic opened this issue Aug 15, 2016 · 51 comments
Closed

Modules not available for some proxy minions #35443

mirceaulinic opened this issue Aug 15, 2016 · 51 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior P1 Priority 1 Proxy-Minion severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale
Milestone

Comments

@mirceaulinic
Copy link
Contributor

mirceaulinic commented Aug 15, 2016

Description of Issue/Question

From time to time, when executing a specific function against a groups of network devices managed through proxy minions using the NAPALM proxy, we encounter the following error:

# salt edge01.yyz01 ntp.peers
edge01.yyz01:
    'ntp.peers' is not available.
ERROR: Minions returned with non-zero exit code

In the example above, I have executed the function peers from the napalm_ntp module, having the virtual name ntp: https://github.com/saltstack/salt/blob/develop/salt/modules/napalm_ntp.py#L77

Executing the same function against a different target:

# salt edge01.akl01 ntp.peers
edge01.akl01:
    ----------
    comment:
    out:
        - 172.17.17.1
        - 172.17.17.2
    result:
        True

(I have obfuscated the IP addresses in previous output, otherwise nothing was changed).

But executing a different function (e.g. snmp.config - soon to be published) against the same minion as previously:

# salt edge01.akl01 snmp.config
edge01.akl01:
    'snmp.config' is not available.

Setup

The setup is the same as described in #34446, without the changes from #35178.

Steps to Reproduce Issue

Unfortunately I am not sure how to reproduce this issue. Given that I am still not aware of deeper Salt internals, I would need your guidance to identify the steps.
Initially I thought it is a caching issue - but the modules are cached in only one place for all proxies and they become unavailable only for some of them.
Restarting the proxy process fixes the problem, but this is not a comfortable solution since this kind of events are not deterministic (or, at least, I was not able to identify a pattern).
Sometimes we get the message * is not available a couple of hours after the proxy is started, sometimes after a couple of days.

Versions Report

# salt --versions-report
Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: 1.6.0
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: 0.6.4
      gitpython: 2.0.5
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: 1.2.5
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.3
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.5
        machine: x86_64
        release: 4.1.3-cloudflare
         system: Linux
        version: debian 8.5
@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 15, 2016

ping @cro could I get your input on this issue? Any advice as to why this might be occuring? Or maybe somewhere he can look to track this issue down?

My only suggestion for right now is to check the debug logs when this occurs and see if there is any relevant logs. Thanks

@Ch3LL Ch3LL added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Aug 15, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Aug 15, 2016
@mirceaulinic
Copy link
Contributor Author

Hi @Ch3LL,

Being in a production environment, the log level was set to warning therefore the logs did not catch anything suspicious. I decreased to debug level and I will wait till this happens again when I will provide some more data for this ticket.

Thanks,
Mircea

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 16, 2016

Sounds great thanks @mirceaulinic

@mirceaulinic
Copy link
Contributor Author

This issue happened today again.

Please find the logs section below:

2016-08-19 17:56:03,636 [salt.minion      ][INFO    ][19680] User sudo_mircea Executing command cfnet.anycast_route with jid 20160819175603626696
2016-08-19 17:56:03,636 [salt.minion      ][DEBUG   ][19680] Command details {'tgt_type': 'glob', 'jid': '20160819175603626696', 'tgt': 'edge01.sjc01', 'ret': '', 'user': 'sudo_mircea', 'arg': ['del', '172.17.17.1/32', {'test': True, '__kwarg__': True}], 'fun': 'cfnet.anycast_route'}
2016-08-19 17:56:03,676 [salt.minion      ][INFO    ][19680] Starting a new job with PID 19680
2016-08-19 17:56:03,679 [salt.utils.lazy  ][DEBUG   ][19680] Could not LazyLoad cfnet.anycast_route
2016-08-19 17:56:03,680 [salt.minion      ][DEBUG   ][19680] Minion return retry timer set to 8 seconds (randomized)
2016-08-19 17:56:03,680 [salt.minion      ][INFO    ][19680] Returning information for job: 20160819175603626696
2016-08-19 17:56:03,680 [salt.transport.zeromq][DEBUG   ][19680] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/proxy', 'edge01.sjc01', 'tcp://127.0.0.1:4506', 'aes')
2016-08-19 17:56:03,680 [salt.crypt       ][DEBUG   ][19680] Initializing new AsyncAuth for ('/etc/salt/pki/proxy', 'edge01.sjc01', 'tcp://127.0.0.1:4506')

Is there any other information I can provide?

@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Aug 22, 2016

@Ch3LL @cro

I confirm that's all the information provided in the logs:

Executing:

# salt edge01.lax01 transit.depref dummy
edge01.lax01:
    'transit.depref' is not available.
ERROR: Minions returned with non-zero exit code

Logs:

2016-08-22 00:13:10,164 [salt.minion      ][INFO    ][19950] User sudo_mircea Executing command transit.depref with jid 20160822001310157711
2016-08-22 00:13:10,165 [salt.minion      ][DEBUG   ][19950] Command details {'tgt_type': 'glob', 'jid': '20160822001310157711', 'tgt': 'edge01.lax01', 'ret': '', 'user': 'sudo_mircea', 'arg': ['dummy'], 'fun': 'transit.depref'}
2016-08-22 00:13:10,207 [salt.minion      ][INFO    ][19950] Starting a new job with PID 19950
2016-08-22 00:13:10,210 [salt.utils.lazy  ][DEBUG   ][19950] Could not LazyLoad transit.depref
2016-08-22 00:13:10,211 [salt.minion      ][DEBUG   ][19950] Minion return retry timer set to 10 seconds (randomized)
2016-08-22 00:13:10,211 [salt.minion      ][INFO    ][19950] Returning information for job: 20160822001310157711
2016-08-22 00:13:10,211 [salt.transport.zeromq][DEBUG   ][19950] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/proxy', 'edge01.lax01', 'tcp://127.0.0.1:4506', 'aes')
2016-08-22 00:13:10,212 [salt.crypt       ][DEBUG   ][19950] Initializing new AsyncAuth for ('/etc/salt/pki/proxy', 'edge01.lax01', 'tcp://127.0.0.1:4506')

Although the minion replied for other requests such:

# salt edge01.lax01 net.ping 8.8.8.8 count=1
edge01.lax01:
    ----------
    comment:
    out:
        ----------
        success:
            ----------
            packet_loss:
                0
            probes_sent:
                1
            results:
                |_
                  ----------
                  ip:
                      8.8.8.8
                  rtt:
                      0.576
            rtt_avg:
                0.576
            rtt_max:
                0.576
            rtt_min:
                0.576
            rtt_stddev:
                0.0
    result:
        True

@cro
Copy link
Contributor

cro commented Aug 22, 2016

@mirceaulinic I will look into this tomorrow.

@mirceaulinic
Copy link
Contributor Author

Thanks @cro!

@cro
Copy link
Contributor

cro commented Aug 22, 2016

@mirceaulinic I've been looking over this and I know that pre #35178 proxies cache all the modules in the same place, but I am suspicious that #35178 might alleviate this. Is there any way for you to try that?

(as an aside, thank you for the excellent bug reports.)

@mirceaulinic
Copy link
Contributor Author

Great! Anyway I had in mind to put this week in the production the code from #35178 - most probably on Wednesday. Then I will come back with a feedback. Thank you!

@cro thanks for appreciation! I am glad I can provide useful information :)

@mirceaulinic
Copy link
Contributor Author

As a followup, I installed the code on a testing platform and I confirm it works as expected:

# ls -la /var/cache/salt/proxy/
total 8
drwxr-xr-x 7 root root 200 Aug 23 14:55 .
drwxr-xr-x 8 root root 160 Apr 29 16:42 ..
drwxr-xr-x 4 root root 100 Aug 23 14:50 edge01.sfo04
drwxr-xr-x 4 root root 100 Aug 23 14:55 edge01.sjc01

One directory per minion - nice!

Tomorrow I will push it into production.

@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Aug 24, 2016

Hi @cro,

Today I upgraded our Salt production instance - please find versions report below:

$ sudo salt --versions-report
Salt Version:
           Salt: 2016.3.2-104-g962e493

Dependency Versions:
           cffi: 1.7.0
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: 0.6.4
      gitpython: 2.0.5
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: 1.2.5
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.4.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.4.1
            ZMQ: 4.1.5

System Versions:
           dist: debian 8.5
        machine: x86_64
        release: 4.1.3-cloudflare
         system: Linux
        version: debian 8.5

As on the testing server, the changes from #35178 created separate caching directories for each minion - which confirms yet again that works as expected (truncated output):

# ls -la /var/cache/salt/proxy/
total 3652
drwxr-xr-x 82 root root    4096 Aug 24 09:17 .
drwxr-xr-x  6 root root      63 Feb 25 09:56 ..
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.akl01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.ams01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.arn01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.atl01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.bkk01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.bom01
drwxr-xr-x  4 root root      50 Aug 24 09:17 edge01.bos01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.bru01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.cai01

Unfortunately, just a few minutes ago, a colleague paste me the following:

$  sudo salt edge01.iad02 transit.disable gtt
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.iad02:
    'transit.disable' is not available.
ERROR: Minions returned with non-zero exit code

I do not have any log entries (as the log level has been increased again to warning), but I assume it should be similar as above.

What would you suggest to do now, in order to help debugging?

Also, upgrading to this version, on can notice that DeprecationWarning is raised whenever a command is executed. Is there a way to turn it off?

Thank you!

@cro
Copy link
Contributor

cro commented Aug 24, 2016

Hi @mirceaulinic,

Quick idea as I dive into this a little more--is the transit module located in /srv/salt/_modules? If so, did you do a sync_modules?

@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Aug 24, 2016

The source modules are located under /etc/salt/proxy-master/, which is specified in the master config file as:

module_dirs:
  - /etc/salt/proxy-master

I will expand a bit more on the steps I performed earlier today:

  • stop salt-proxy processes
  • stop salt-master
  • upgrade salt using pip install -U git+https://github.com/saltstack/salt.git@2016.3#egg=salt inside the Salt venv
  • start salt-master
  • start proxies

I did not run sync_modules, but I was able to execute the modules immediately after restarting the proxy processes.

Please let me know if that was correct. If not, what is the right way to sync_modules? I would like to remind you that this box is also a straight minion itself and salt-call saltutil.sync_modules would synchronise its modules instead of the proxy modules.
Sometimes I use salt <proxy_name> saltutil.sync_modules, but my understanding was that will sync only for <proxy_name>. I am not sure about this though, would be actually quite good news if I am wrong :)

Thank you!

@cro
Copy link
Contributor

cro commented Aug 25, 2016

salt <proxy_name> saltutil.sync_modules is the correct way to make sure the modules are sync'ed from master to minion. <proxy_name> can be any valid Salt target, so if you want to sync for all proxies or some subset, you probably can find a good target to get what you need.

@mirceaulinic
Copy link
Contributor Author

Thank you for clarifying this @cro! I sync'ed them all now.

@cro
Copy link
Contributor

cro commented Aug 25, 2016

Did that fix the original problem you had with modules being reported missing?

@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Aug 25, 2016

Yes, right now they are all available, but usually they become unavailable after a couple of hours/days.

@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Aug 25, 2016

Looking into the caching directory, I notice for all proxies that last change is since yesterday at the very first run after installing the new salt version (truncated output):

# ls -la /var/cache/salt/proxy
total 3652
drwxr-xr-x 82 root root    4096 Aug 24 09:17 .
drwxr-xr-x  6 root root      63 Feb 25 09:56 ..
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.akl01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.ams01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.arn01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.atl01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.bkk01
drwxr-xr-x  4 root root      50 Aug 24 09:09 edge01.bom01

Hope this information is helpful.

@mirceaulinic
Copy link
Contributor Author

@cro About 14h hours after I run sync_modules it happened again:

# salt edge01.ewr01 pop.disabled_elements
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
    'pop.disabled_elements' is not available.

Although the minion is up an running

# salt edge01.ewr01 net.connected
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
    ----------
    out:
        True
# salt edge01.ewr01 ntp.servers
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
    ----------
    comment:
    out:
        - 1.1.1.1
        - 2.2.2.2
    result:
        True

(again obfuscated the real IP addresses of the NTP servers)

And the module was available for other minions:

# salt edge01.jnb01 pop.disabled_elements
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.jnb01:
    ----------
    comment:
    out:
        - False
        |_
        |_
        |_
        |_
        - 2
        |_
        - 2
    result:
        True

After running sync_modules:

# salt edge01.ewr01 saltutil.sync_modules
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:

I was able to run correctly:

# salt edge01.ewr01 pop.disabled_elements
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.ewr01:
    ----------
    comment:
    out:
        - False
        |_
          - TELIA|4
          - TELIA|6
        |_
        |_
        |_
          - TELIA|4
          - TELIA|6
        - 10
        |_
        - 10
    result:
        True

Please, give me some more ideas how I would be able to identify the steps to reproduce this issue.

Also, probably worths mentioning that we did not see that very often a couple of months ago. It is also true that we had a cron restarting the proxies every hour (I know it sounds terrible but we had to for the reasons explained in #32918. Meanwhile we found a solution for this - probably not the best, but works well and I am going to raise a PR for that).
We have quite a few custom modules:

# ls -la /var/cache/salt/proxy/edge01.sjc01/files/base
total 8
drwx------ 8 root root  108 Aug 25 20:37 .
drwx------ 3 root root   17 Aug 24 09:09 ..
drwx------ 2 root root   29 Aug 24 09:09 _engines
drwx------ 2 root root   30 Aug 24 09:09 _grains
drwx------ 2 root root 4096 Aug 24 09:09 _modules
drwx------ 2 root root   22 Aug 24 09:09 _proxy
drwx------ 2 root root   26 Aug 24 09:09 _returners
drwx------ 2 root root   83 Aug 24 09:09 _states
-rw------- 1 root root   34 Aug 25 20:37 top.sls
# ls -la /var/cache/salt/proxy/edge01.sjc01/files/base/_modules
total 96
drwx------ 2 root root  4096 Aug 24 09:09 .
drwx------ 8 root root   108 Aug 25 20:37 ..
-rw------- 1 root root  3699 Aug 24 09:09 napalm_anycast.py
-rw------- 1 root root  7079 Aug 24 09:09 napalm_bgp.py
-rw------- 1 root root 12117 Aug 24 09:09 napalm_cfnet.py
-rw------- 1 root root 28109 Aug 24 09:09 napalm_network.py
-rw------- 1 root root  5541 Aug 24 09:09 napalm_ntp.py
-rw------- 1 root root  2775 Aug 24 09:09 napalm_pop.py
-rw------- 1 root root  2509 Aug 24 09:09 napalm_probes.py
-rw------- 1 root root  1217 Aug 24 09:09 napalm_route.py
-rw------- 1 root root  3878 Aug 24 09:09 napalm_snmp.py
-rw------- 1 root root 10908 Aug 24 09:09 napalm_transit.py
-rw------- 1 root root  2972 Aug 24 09:09 napalm_users.py

Is this any relevant in this case?

Thank you!

@cro
Copy link
Contributor

cro commented Aug 26, 2016

Drat. I am off work today with almost no cell service. I will dig into this when I am back on the grid on Sunday.

@mirceaulinic
Copy link
Contributor Author

No worries @cro! Thank you very much for your interest!

@mirceaulinic
Copy link
Contributor Author

One more information I can provide and hope it is helpful is that happens for other types of modules too, not only for execution modules:

# salt -L 'edge01.mrs01,edge01.dus01,edge01.yyz01' state.sls router.probes
/usr/local/salt/virtualenv/lib/python2.7/site-packages/salt/grains/core.py:1493: DeprecationWarning: The "osmajorrelease" will be a type of an integer.
edge01.dus01:
----------
          ID: cf_probes
    Function: probes.managed
      Result: False
     Comment: State 'probes.managed' was not found in SLS 'router.probes'
              Reason: 'probes.managed' is not available.
     Started:
    Duration:
     Changes:

Summary for edge01.dus01
------------
Succeeded: 0
Failed:    1
------------

After sync_all it worked properly

@mirceaulinic
Copy link
Contributor Author

We've implemented a workaround for this till the potential bug is fixed:

  • added a function called test in each custom module, returning True
  • a runner executes this function from each module. If at least one replies with '* is not available', execute sync_all
  • schedule the runner to be executed every 15min

It's not the most elegant, but seems pretty efficient so far.

@mirceaulinic
Copy link
Contributor Author

@cro would you mind having a look in this thread please?

@cro
Copy link
Contributor

cro commented Oct 12, 2016

@mirceaulinic I think this is now fixed in the carbon branch. The PR for that only got merged yesterday. Any way you can try that our and see if it's still broken there?

@cro cro self-assigned this Oct 12, 2016
@mirceaulinic
Copy link
Contributor Author

That's great news @cro! Yes, for sure I would like to test it - most probably next week I should be able to install it on the test platform.
Can you please point out what was the issue in this case?
Thanks!

@mirceaulinic
Copy link
Contributor Author

I raised a separate issue, as this may or may be not related to massive CPU and memory consumption: #38990.

@cro
Copy link
Contributor

cro commented Feb 2, 2017

While investigating your other ticket I realized that this might actually be a problem with having multiprocessing: False Since all the module execution on the minion occurs in the same process and it never forks like minions with multiprocessing: True there is much more of a chance that the proxy will step on itself. I need to think about this some more and devise a better way of ferreting out what exactly is going on.

@meggiebot meggiebot modified the milestones: Nitrogen 4, Nitrogen 3 Feb 6, 2017
@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Feb 20, 2017

Hi @cro - I am still seeing this issue. I really do want to help identifying the root cause, but I don't know where I should start from:

mircea@salt-master:~$ sudo salt -L 'edge01.bjm01, edge01.sjc01' transit.test
edge01.bjm01:
    'transit.test' is not available.
edge01.sjc01:
    True

The transit execution module is available for some minions, but it does not for few others, e.g.: edge01.bjm01.

the __virtual__ function being:

def __virtual__():

    """
    NAPALM library must be installed for this module to work.
    Also, the key proxymodule must be set in the __opts___ dictionary.
    """

    if HAS_NAPALM and 'proxy' in __opts__:
        return __virtualname__
    else:
        return (False, 'The module CloudFlare transit cannot be loaded: \
                NAPALM lib or proxy could not be loaded.')

In the master log I only see:

[DEBUG   ] Sending event: tag = salt/job/20170220103610258548/new; data = {'tgt_type': 'list', 'jid': '20170220103610258548', 'tgt': ['edge01.bjm01', 'edge01.sjc01'], '_stamp': '2017-02-20T10:36:10.259801', 'user': 'sudo_mircea', 'arg': [], 'fun': 'transit.test', 'minions': ['edge01.bjm01', 'edge01.sjc01']}
[DEBUG   ] Adding minions for job 20170220103610258548: ['edge01.bjm01', 'edge01.sjc01']
[INFO    ] User sudo_mircea Published command transit.test with jid 20170220103610258548
[DEBUG   ] Gathering reactors for tag 20170220103610258548
[DEBUG   ] Published command details {'tgt_type': 'list', 'jid': '20170220103610258548', 'tgt': ['edge01.bjm01', 'edge01.sjc01'], 'ret': '', 'user': 'sudo_mircea', 'arg': [], 'fun': 'transit.test'}
[DEBUG   ] Gathering reactors for tag salt/job/20170220103610258548/new
[INFO    ] Got return from edge01.bjm01 for job 20170220103610258548
[DEBUG   ] Sending event: tag = salt/job/20170220103610258548/ret/edge01.bjm01; data = {'fun_args': [], 'jid': '20170220103610258548', 'return': "'transit.test' is not available.", 'retcode': 254, 'success': False, 'cmd': '_return', '_stamp': '2017-02-20T10:36:10.392261', 'fun': 'transit.test', 'id': 'edge01.bjm01', 'out': 'nested'}
[DEBUG   ] Gathering reactors for tag salt/job/20170220103610258548/ret/edge01.bjm01
[INFO    ] Got return from edge01.sjc01 for job 20170220103610258548
[DEBUG   ] Compiling reactions for tag salt/job/20170220103610258548/ret/edge01.bjm01[DEBUG   ]

When doing sync_all:

mircea@salt-master:~$ sudo salt edge01.bjm01 saltutil.sync_all
edge01.bjm01:
    ----------
    beacons:
    engines:
        - engines.http_logstash
    grains:
        - grains.cfgrains
        - grains.network_device
    log_handlers:
    modules:
        - modules.napalm_anycast
        - modules.napalm_bgp
        - modules.napalm_cfnet
        - modules.napalm_macdb
        - modules.napalm_network
        - modules.napalm_ntp
        - modules.napalm_pop
        - modules.napalm_prefixlist
        - modules.napalm_probes
        - modules.napalm_route
        - modules.napalm_snmp
        - modules.napalm_transit
        - modules.napalm_users
        - modules.statuspage
    output:
        - output.table_out
    proxymodules:
        - proxy.napalm
    renderers:
    returners:
        - returners.traceroute
    sdb:
    states:
        - states.aggroutes
        - states.bgp
        - states.netconfig
        - states.netntp
        - states.netusers
        - states.prefixlist
        - states.probes
        - states.snmp
        - states.statuspage
    utils:

as like the module never existed before. Which makes me think that your explanation above makes sense.
Also, as you can see, the list of extension modules is pretty long. Would you suspect this may affect the caching somehow?

After the sync_all against edge01.bjm01, the modules are still not available for few other minions, e.g.: edge01.sin01. In that case the sync_all says that it had nothing to sync, which is bizarre:

mircea@salt-master:~$ sudo salt edge01.sin01 transit.test
edge01.sin01:
    'transit.test' is not available.
ERROR: Minions returned with non-zero exit code
mircea@salt-master:~$ sudo salt edge01.sin01 saltutil.sync_all
edge01.sin01:
    ----------
    beacons:
    engines:
    grains:
    log_handlers:
    modules:
    output:
    proxymodules:
    renderers:
    returners:
    sdb:
    states:
    utils:
mircea@salt-master:~$ sudo salt edge01.sin01 transit.test
edge01.sin01:
    True
mircea@salt-master:~$

Please let's have a closer look at this and let me know what else I should test.

Thanks,
Mircea

@Ch3LL Ch3LL modified the milestones: Nitrogen 5, Nitrogen 4 Feb 27, 2017
@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Mar 14, 2017

Just in case someone would like to look into this problem at some point:

When this happened today reveals that the modules are there, under the caching dir correctly and sync_all doesn't do anything in this direction:

  1. modules disappeared
root@salt-master:~# salt edge01.bjm01 transits.test
edge01.bjm01:
    'transits.test' is not available.
ERROR: Minions returned with non-zero exit code
root@salt-master:~# salt edge01.bjm01 transit.test
edge01.bjm01:
    'transit.test' is not available.
ERROR: Minions returned with non-zero exit code
root@salt-master:~# salt edge01.bjm01 net.connected
edge01.bjm01:
    'net.connected' is not available.
ERROR: Minions returned with non-zero exit code
  1. The cache dir content:
root@salt-master:~# ls -la /var/cache/salt/proxy/edge01.bjm01/files/base
total 12
drwx------ 11 root root  146 Mar 10 12:15 .
drwx------  3 root root   17 Nov  2 10:51 ..
drwx------  2 root root   29 Nov 30 10:21 _engines
drwx------  2 root root   48 Nov 30 10:21 _grains
drwx------  2 root root 4096 Feb 22 01:00 _modules
drwx------  2 root root   25 Nov 30 10:21 _output
drwx------  2 root root   22 Nov 30 10:21 _proxy
drwx------  2 root root   26 Nov 30 10:21 _returners
drwx------  2 root root 4096 Mar  6 13:00 _states
drwx------  2 root root   82 Mar 10 12:14 router
-rw-r--r--  1 root root   54 Feb 10 13:31 top.sls
drwx------  3 root root   22 Mar 10 12:15 util
root@salt-master:~# ls -la /var/cache/salt/proxy/edge01.bjm01/files/base/_modules/
total 216
drwx------  2 root root  4096 Feb 22 01:00 .
drwx------ 11 root root   146 Mar 10 12:15 ..
-rw-r--r--  1 root root 29921 Mar  9 13:00 capirca_acl.py
-rw-r--r--  1 root root 24381 Mar  9 13:00 napalm_acl.py
-rw-r--r--  1 root root  5948 Jan  7 01:00 napalm_anycast.py
-rw-r--r--  1 root root  7110 Nov 30 10:18 napalm_bgp.py
-rw-r--r--  1 root root 17040 Feb 23 01:00 napalm_cfnet.py
-rw-r--r--  1 root root  1515 Jan 19 11:52 napalm_macdb.py
-rw-r--r--  1 root root 44624 Nov 30 10:18 napalm_network.py
-rw-r--r--  1 root root  6369 Nov 30 10:18 napalm_ntp.py
-rw-r--r--  1 root root  2985 Nov 30 10:18 napalm_pop.py
-rw-r--r--  1 root root  6343 Nov 30 10:18 napalm_prefixlist.py
-rw-r--r--  1 root root  2782 Nov 30 10:19 napalm_probes.py
-rw-r--r--  1 root root  1247 Nov 30 10:19 napalm_route.py
-rw-r--r--  1 root root  4199 Nov 30 10:19 napalm_snmp.py
-rw-r--r--  1 root root 12128 Jan  3 13:45 napalm_transit.py
-rw-r--r--  1 root root  3007 Nov 30 10:20 napalm_users.py
-rw-r--r--  1 root root  3934 Feb 22 01:00 provisiondb.py
-rw-r--r--  1 root root 16052 Jan  6 13:00 statuspage.py
  1. sync_all says that many things have been resynced:
root@salt-master:~# salt edge01.bjm01 saltutil.sync_all
edge01.bjm01:
    ----------
    beacons:
    engines:
        - engines.http_logstash
    grains:
        - grains.cfgrains
        - grains.network_device
    log_handlers:
    modules:
        - modules.capirca_acl
        - modules.napalm_acl
        - modules.napalm_anycast
        - modules.napalm_bgp
        - modules.napalm_cfnet
        - modules.napalm_macdb
        - modules.napalm_network
        - modules.napalm_ntp
        - modules.napalm_pop
        - modules.napalm_prefixlist
        - modules.napalm_probes
        - modules.napalm_route
        - modules.napalm_snmp
        - modules.napalm_transit
        - modules.napalm_users
        - modules.provisiondb
        - modules.statuspage
    output:
        - output.table_out
    proxymodules:
        - proxy.napalm
    renderers:
    returners:
        - returners.traceroute
    sdb:
    states:
        - states.aggroutes
        - states.bgp
        - states.netacl
        - states.netconfig
        - states.netntp
        - states.netusers
        - states.prefixlist
        - states.probes
        - states.snmp
        - states.statuspage
    utils:
  1. The cachedir is actually the same:
root@salt-master:~# ls -la /var/cache/salt/proxy/edge01.bjm01/files/base/_modules/
total 216
drwx------  2 root root  4096 Feb 22 01:00 .
drwx------ 11 root root   146 Mar 10 12:15 ..
-rw-r--r--  1 root root 29921 Mar  9 13:00 capirca_acl.py
-rw-r--r--  1 root root 24381 Mar  9 13:00 napalm_acl.py
-rw-r--r--  1 root root  5948 Jan  7 01:00 napalm_anycast.py
-rw-r--r--  1 root root  7110 Nov 30 10:18 napalm_bgp.py
-rw-r--r--  1 root root 17040 Feb 23 01:00 napalm_cfnet.py
-rw-r--r--  1 root root  1515 Jan 19 11:52 napalm_macdb.py
-rw-r--r--  1 root root 44624 Nov 30 10:18 napalm_network.py
-rw-r--r--  1 root root  6369 Nov 30 10:18 napalm_ntp.py
-rw-r--r--  1 root root  2985 Nov 30 10:18 napalm_pop.py
-rw-r--r--  1 root root  6343 Nov 30 10:18 napalm_prefixlist.py
-rw-r--r--  1 root root  2782 Nov 30 10:19 napalm_probes.py
-rw-r--r--  1 root root  1247 Nov 30 10:19 napalm_route.py
-rw-r--r--  1 root root  4199 Nov 30 10:19 napalm_snmp.py
-rw-r--r--  1 root root 12128 Jan  3 13:45 napalm_transit.py
-rw-r--r--  1 root root  3007 Nov 30 10:20 napalm_users.py
-rw-r--r--  1 root root  3934 Feb 22 01:00 provisiondb.py
-rw-r--r--  1 root root 16052 Jan  6 13:00 statuspage.py
root@salt-master:~# ls -la /var/cache/salt/proxy/edge01.bjm01/files/base
total 12
drwx------ 11 root root  146 Mar 10 12:15 .
drwx------  3 root root   17 Nov  2 10:51 ..
drwx------  2 root root   29 Nov 30 10:21 _engines
drwx------  2 root root   48 Nov 30 10:21 _grains
drwx------  2 root root 4096 Feb 22 01:00 _modules
drwx------  2 root root   25 Nov 30 10:21 _output
drwx------  2 root root   22 Nov 30 10:21 _proxy
drwx------  2 root root   26 Nov 30 10:21 _returners
drwx------  2 root root 4096 Mar  6 13:00 _states
drwx------  2 root root   82 Mar 10 12:14 router
-rw-r--r--  1 root root   54 Feb 10 13:31 top.sls
drwx------  3 root root   22 Mar 10 12:15 util

EDIT: forgot to mention (although self-explanatory given the long story above) - all good after sync_all.

@mirceaulinic
Copy link
Contributor Author

BTW @Ch3LL @cro I would find "Bug" & "High Severity" labels more appropriate - maybe that way it will get more visibility given that after 7 months we're pretty much in the same place.

@cro cro added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P1 Priority 1 Proxy-Minion and removed Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged labels Mar 14, 2017
@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented May 26, 2017

@cro After the discussion from #41024, I am wondering if this may be related by any chance to the way we are defining the __virtual__ function. Do you think there could be a slight possibility to:

  • napalm becomes suddenly unavailable to be imported (I believe this is a very small chance)
  • the proxy key no longer available in the __opts__ (the proxy key is defined in the pillar.). Would you assume there might be a possibility that while the proxy process is alive, it can loose the pillar data merged in the opts?

But in that case why does sync_all reload the modules, as in my comment above #35443 (comment)

@mirceaulinic
Copy link
Contributor Author

mirceaulinic commented Jun 7, 2017

As a followup (obivously to myself): checked when this happens and the proxy key still exists in the __opts__ (verified using test.get_opts).

@mirceaulinic
Copy link
Contributor Author

Is this issue still scheduled to be solved in Nitrogen?
Once again, I am happy to help, but I really don't know where to start from and awaiting an expert's input on this.

@rallytime rallytime modified the milestones: Nitrogen 5, Approved Aug 30, 2017
@mirceaulinic
Copy link
Contributor Author

This Pull Request #39948 might finally solve this issue.

@stale
Copy link

stale bot commented Mar 25, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Mar 25, 2019
@stale stale bot closed this as completed Apr 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior P1 Priority 1 Proxy-Minion severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale
Projects
None yet
Development

No branches or pull requests

5 participants