Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Catalina domain specification to tomcat bean includes #7054

Merged
merged 1 commit into from
Jul 10, 2020

Conversation

binaryphile
Copy link
Contributor

What does this PR do?

It adds the "domain" specifier to conf.yaml.example and metrics.yaml in the tomcat integration. The domain is "Catalina", in keeping with what tomcat publishes its JMX metrics under.

Motivation

We encountered an issue where JMX data would stop being reported by some of the agents on our production instances. This would be accompanied by a spike in CPU usage. The jmxfetch process under top showed an accumulation of hours of CPU time and continuous 60%+ CPU usage for hours on end. Once JMX reporting failed in this manner, it would not come back without restarting both the agent as well as the tomcat instance it was monitoring. A case was opened with datadog support, but it went unresolved.

By running "datadog-agent jmx collect", it was determined that jmxfetch was scanning hundreds of thousands of attributes. Unable to finish, it would never reach the portion of the log where it should say that it was done scanning attributes and start applying the metric filters. By examining the tomcat instance with jconsole, it was found that there were hundreds of thousands of attributes under the kafka.producer domain. These were primarily the attributes causing jmxfetch to run for so long.

However, the tomcat integration should not have been scanning those attributes in the first place. While jmxfetch has a builtin configuration to limit the number of metrics to report (not the issue here), so it will not use up arbitrary resources. However, It does not have such a hard limit on the attributes it will scan prior to applying the include filters in the integration. Its only protection from scanning an arbitrary number of attributes is by determining the proper beans to scan by examining the "bean" and "domain" specifiers in the includes. jmxfetch has a routine which determines the minimal covering set of beans to scan by examining those specifiers (I don't recall the name of the method, but it is easily searchable).

If it is not able to define those beans (say, because an include doesn't define either of those specifiers), it then resorts to scanning attributes under all beans. You can see this exhibited in the jmx collect output when it says, "Querying bean names on scope: :".

In our scenario, we had no use for the problematic kafka.producer bean and would not be scanning it. So we were surprised to find it being scanned by the default datadog tomcat integration. We found out that this is because the integration is missing any "domain" specifier on the includes. By adding "domain: Catalina" to each of the includes, we were able to verify that the log no longer output "Querying bean names on scope: :". jmxfetch then no longer scanned the problematic kafka.producer bean and stopped going off into la-la-land with CPU, and jmx reporting has been working for us since.

Additional Notes

I inspected the rest of the integration configurations. The following also leave out domain or bean specifiers, and would exhibit the same issue:

  • activemq
  • solr

jboss_wildfly relies on "bean_regex", which i haven't verified, but I would imagine is ok if it contained the domain. Still, I'm not sure the routine in jmxfetch which determines the minimal covering list of beans is smart enough to extract that from a regex, so it bears verification.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached

@hithwen
Copy link
Contributor

hithwen commented Jul 9, 2020

Hallo @binaryphile Thanks a lot for your contribution. It seems like your change was not created from the master branch though as the conf section on instances has not been there since april (8aa6d4e)

You only need to edit metrics.yaml now.

@binaryphile
Copy link
Contributor Author

binaryphile commented Jul 9, 2020 via email

@binaryphile
Copy link
Contributor Author

Don't know how I got on the wrong branch in the first place but this update should tidy things up now.

@hithwen hithwen merged commit 3d9e6eb into DataDog:master Jul 10, 2020
@binaryphile binaryphile deleted the tomcat-includes branch July 10, 2020 13:35
This was referenced Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants