Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status.master is slow, CPU intensive with high number of TCP connections #53580

Closed
cifvts opened this issue Jun 24, 2019 · 7 comments · Fixed by #55501
Closed

status.master is slow, CPU intensive with high number of TCP connections #53580

cifvts opened this issue Jun 24, 2019 · 7 comments · Fixed by #55501
Assignees
Labels
Bug broken, incorrect, or confusing behavior memory-leak
Milestone

Comments

@cifvts
Copy link

cifvts commented Jun 24, 2019

Description of Issue

When the number of TCP connections get high, salt.utils.network._netlink_tool_remote_on becomes really slow and CPU intensive.
The functions is used by status.master which is called as a scheduled function to test Salt Master availability.

Setup

We noticed the issue on servers with ~100k TCP connections (using ss -ant | wc -l

Versions Report

Salt Version:
           Salt: 2018.3.0
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.24.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: 1.3.7
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.13 (default, Sep 26 2018, 18:42:22)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1
 
System Versions:
           dist: debian 9.4 
         locale: UTF-8
        machine: x86_64
        release: 4.14.105-mptcp-kp2-3f01458b
         system: Linux
        version: debian 9.4 
@cmcmarrow
Copy link
Contributor

We have a TCP leak right now. Thank you for bring this up. I will try to fix the leak in status.master and bring down the TCP connection count. This should also bring down the CPU workload.

@cmcmarrow cmcmarrow self-assigned this Jun 24, 2019
@cmcmarrow cmcmarrow added Bug broken, incorrect, or confusing behavior memory-leak labels Jun 24, 2019
@cmcmarrow
Copy link
Contributor

cmcmarrow commented Jun 24, 2019

#53581 seems to be only part of the fix

@cmcmarrow cmcmarrow added this to the Approved milestone Jun 24, 2019
@cmcmarrow
Copy link
Contributor

once #53581 is merged I will fix the leak.

@cifvts
Copy link
Author

cifvts commented Jun 25, 2019

@cmcmarrow is there an issue open for the TCP leak? I would be interested to look at it and find if we hit the problem in our setup.

@cmcmarrow
Copy link
Contributor

@cifvts TCP does leak. We are currently trying to find and fix them. We have not made an issue for it. I know if you run the salt test on windows you can see the TCP count go way up. Right now we are trying to find all the tcp leaks and patch them. The other goal is to lower the number of made TCP connects by reusing them when it is safe. If we do this in salt.utils.network._netlink_tool_remote_on I believe this would help lower your cpu work load too.

@cifvts
Copy link
Author

cifvts commented Jun 28, 2019

I see what you mean. The workload with the patch is completely gone. For the TCP, I will be happy to look at it since we have thousands of them in our production env.

@cmcmarrow
Copy link
Contributor

A fix should come out in the next update! I'll link it once I get the fix

cifvts pushed a commit to cifvts/salt that referenced this issue Sep 25, 2019
Parsing the output of all TCP connections might be really slow for high
number of connections. This patch address the problem using the
filtering provided by `ss` to reduce the number of line returned by the
command.

Fixes saltstack#53580
cifvts pushed a commit to cifvts/salt that referenced this issue Sep 26, 2019
Parsing the output of all TCP connections might be really slow for high
number of connections. This patch address the problem using the
filtering provided by `ss` to reduce the number of line returned by the
command.

Fixes saltstack#53580
cifvts pushed a commit to cifvts/salt that referenced this issue Oct 4, 2019
Parsing the output of all TCP connections might be really slow for high
number of connections. This patch address the problem using the
filtering provided by `ss` to reduce the number of line returned by the
command.

Fixes saltstack#53580
cifvts pushed a commit to cifvts/salt that referenced this issue Dec 3, 2019
Parsing the output of all TCP connections might be really slow for high
number of connections. This patch address the problem using the
filtering provided by `ss` to reduce the number of line returned by the
command.

Fixes saltstack#53580
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior memory-leak
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants