Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics-monitoring beats to resource monitoring #4326

Merged

Conversation

fearful-symmetry
Copy link
Contributor

What does this PR do?

Closes #4082

This PR is currently a draft, as the test borrows the integration setup code from #4150. Once that PR is merged, we can refactor this to use the same code to setup + install integrations

Why is it important?

This PR changes the behavior of the monitoring beats, so they also monitor and report metrics on themselves. This fixes an issue where the CPU and memory usage that agent reports to fleet can be deceptive, as they don't include all the beats that are running under agent.

I also did a light refactor of the monitoring setup so we use constants for the monitoring beats IDs.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Copy link
Contributor

mergify bot commented Feb 22, 2024

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@fearful-symmetry fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Feb 22, 2024
@cmacknz
Copy link
Member

cmacknz commented Feb 23, 2024

Do the beat/metrics and http/metrics inputs report the resource usage and stats of the Beat they themselves are running in? Not just the other monitoring beats, as in beat/metrics is reporting the CPU usage of beat/metrics-monitoring, http/metrics-monitoring and filestream-monitoring and not just the latter two?

@fearful-symmetry
Copy link
Contributor Author

Do the beat/metrics and http/metrics inputs report the resource usage and stats of the Beat they themselves are running in? Not just the other monitoring beats, as in beat/metrics is reporting the CPU usage of beat/metrics-monitoring, http/metrics-monitoring and filestream-monitoring and not just the latter two?

That's the goal of this PR, yeah. If you want, you can verify the negative, and comment-out the lines that map in v1_monitor.go, and the resulting integration test will fail.

@fearful-symmetry fearful-symmetry marked this pull request as ready for review March 7, 2024 14:53
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner March 7, 2024 14:53
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we verify that the Fleet memory and CPU usage calculations are actually using the fields from the new documents?

If I install an agent built from this branch the CPU and memory in Fleet are exactly the same as one built from main. Possibly this is because they aren't using enough resources for their to be a detectable difference, but I would have expected memory to be higher at a minimum.

Screenshot 2024-03-08 at 3 59 37 PM

if binary != "filebeat" && binary != "metricbeat" {
t.Errorf("expected monitoring compoent to be metricbeat or filebeat, got %s", binary)
}
if componentID != "filestream-monitoring" && componentID != "beat/metrics-monitoring" && componentID != "http/metrics-monitoring" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional. I'm wondering if using the const monitoringFilesUnitsID might be better than the string "filestream-monitoring"

res, err := estools.PerformQueryForRawQuery(ctx, query, "metrics-elastic_agent*", runner.info.ESClient)
require.NoError(runner.T(), err)
runner.T().Logf("Fetched metrics for %s, got %d hits", cid, res.Hits.Total.Value)
if res.Hits.Total.Value < 5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why < 5? Doesn't any amount of hits mean there was at least one document matching the query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, my concern was that we could end up in some freak accident where a test misconfiguration causes us to re-use an agent install, and thus an agent ID, leading to some overlap of results. Not sure if this is realistic, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent ID you get on enrollment is unique within Fleet, so shouldn't be possible unless Fleet is completely broken.

This isn't protecting against that properly anyway IMO, it is just a magic number.

A way that should actually work would be:

  1. Install and enroll an agent with monitoring disabled.
  2. Wait one full metrics collection cycle.
  3. Ensure there are no hits in the metrics-* indices for that agent.
  4. Turn monitoring on.
  5. Ensure there are metrics now.

I'm not sure if all of that is actually worth it though.

},
{
"exists": map[string]interface{}{
"field": "system.process.cpu.total.value", // make sure we fetch documents that have the metric field used by fleet monitoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check for the memory metric.

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

3 New issues
0 Security Hotspots
68.0% 68.0% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@fearful-symmetry fearful-symmetry merged commit 4594088 into elastic:main Mar 20, 2024
9 checks passed
rdner added a commit that referenced this pull request Mar 21, 2024
fearful-symmetry added a commit to fearful-symmetry/elastic-agent that referenced this pull request Mar 21, 2024
fearful-symmetry added a commit that referenced this pull request Mar 27, 2024
… fleet naming changes (#4462)

* Reapply "Add metrics-monitoring beats to resource monitoring (#4326)" (#4451)

This reverts commit 7f83ddd.

* revert removal, make it easier to adjust unit ID output name

* update unit names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent should collect and report CPU and memory usage of monitoring components
4 participants