add Catalina domain specification to tomcat bean includes #7054
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
It adds the "domain" specifier to conf.yaml.example and metrics.yaml in the tomcat integration. The domain is "Catalina", in keeping with what tomcat publishes its JMX metrics under.
Motivation
We encountered an issue where JMX data would stop being reported by some of the agents on our production instances. This would be accompanied by a spike in CPU usage. The jmxfetch process under top showed an accumulation of hours of CPU time and continuous 60%+ CPU usage for hours on end. Once JMX reporting failed in this manner, it would not come back without restarting both the agent as well as the tomcat instance it was monitoring. A case was opened with datadog support, but it went unresolved.
By running "datadog-agent jmx collect", it was determined that jmxfetch was scanning hundreds of thousands of attributes. Unable to finish, it would never reach the portion of the log where it should say that it was done scanning attributes and start applying the metric filters. By examining the tomcat instance with jconsole, it was found that there were hundreds of thousands of attributes under the kafka.producer domain. These were primarily the attributes causing jmxfetch to run for so long.
However, the tomcat integration should not have been scanning those attributes in the first place. While jmxfetch has a builtin configuration to limit the number of metrics to report (not the issue here), so it will not use up arbitrary resources. However, It does not have such a hard limit on the attributes it will scan prior to applying the include filters in the integration. Its only protection from scanning an arbitrary number of attributes is by determining the proper beans to scan by examining the "bean" and "domain" specifiers in the includes. jmxfetch has a routine which determines the minimal covering set of beans to scan by examining those specifiers (I don't recall the name of the method, but it is easily searchable).
If it is not able to define those beans (say, because an include doesn't define either of those specifiers), it then resorts to scanning attributes under all beans. You can see this exhibited in the jmx collect output when it says, "Querying bean names on scope: :".
In our scenario, we had no use for the problematic kafka.producer bean and would not be scanning it. So we were surprised to find it being scanned by the default datadog tomcat integration. We found out that this is because the integration is missing any "domain" specifier on the includes. By adding "domain: Catalina" to each of the includes, we were able to verify that the log no longer output "Querying bean names on scope: :". jmxfetch then no longer scanned the problematic kafka.producer bean and stopped going off into la-la-land with CPU, and jmx reporting has been working for us since.
Additional Notes
I inspected the rest of the integration configurations. The following also leave out domain or bean specifiers, and would exhibit the same issue:
jboss_wildfly relies on "bean_regex", which i haven't verified, but I would imagine is ok if it contained the domain. Still, I'm not sure the routine in jmxfetch which determines the minimal covering list of beans is smart enough to extract that from a regex, so it bears verification.
Review checklist (to be filled by reviewers)
changelog/
andintegration/
labels attached