-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics-monitoring beats to resource monitoring #4326
Add metrics-monitoring beats to resource monitoring #4326
Conversation
This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
NOTE: |
Do the |
That's the goal of this PR, yeah. If you want, you can verify the negative, and comment-out the lines that map in |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we verify that the Fleet memory and CPU usage calculations are actually using the fields from the new documents?
If I install an agent built from this branch the CPU and memory in Fleet are exactly the same as one built from main. Possibly this is because they aren't using enough resources for their to be a detectable difference, but I would have expected memory to be higher at a minimum.
data:image/s3,"s3://crabby-images/f8c65/f8c65e2b4d258baf1947df338451d2e0d5e4a59d" alt="Screenshot 2024-03-08 at 3 59 37 PM"
if binary != "filebeat" && binary != "metricbeat" { | ||
t.Errorf("expected monitoring compoent to be metricbeat or filebeat, got %s", binary) | ||
} | ||
if componentID != "filestream-monitoring" && componentID != "beat/metrics-monitoring" && componentID != "http/metrics-monitoring" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional. I'm wondering if using the const monitoringFilesUnitsID
might be better than the string "filestream-monitoring"
res, err := estools.PerformQueryForRawQuery(ctx, query, "metrics-elastic_agent*", runner.info.ESClient) | ||
require.NoError(runner.T(), err) | ||
runner.T().Logf("Fetched metrics for %s, got %d hits", cid, res.Hits.Total.Value) | ||
if res.Hits.Total.Value < 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why < 5
? Doesn't any amount of hits mean there was at least one document matching the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, my concern was that we could end up in some freak accident where a test misconfiguration causes us to re-use an agent install, and thus an agent ID, leading to some overlap of results. Not sure if this is realistic, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent ID you get on enrollment is unique within Fleet, so shouldn't be possible unless Fleet is completely broken.
This isn't protecting against that properly anyway IMO, it is just a magic number.
A way that should actually work would be:
- Install and enroll an agent with monitoring disabled.
- Wait one full metrics collection cycle.
- Ensure there are no hits in the metrics-* indices for that agent.
- Turn monitoring on.
- Ensure there are metrics now.
I'm not sure if all of that is actually worth it though.
}, | ||
{ | ||
"exists": map[string]interface{}{ | ||
"field": "system.process.cpu.total.value", // make sure we fetch documents that have the metric field used by fleet monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also check for the memory metric.
|
…#4326)" (elastic#4451) This reverts commit 7f83ddd.
What does this PR do?
Closes #4082
This PR is currently a draft, as the test borrows the integration setup code from #4150. Once that PR is merged, we can refactor this to use the same code to setup + install integrations
Why is it important?
This PR changes the behavior of the monitoring beats, so they also monitor and report metrics on themselves. This fixes an issue where the CPU and memory usage that agent reports to fleet can be deceptive, as they don't include all the beats that are running under agent.
I also did a light refactor of the monitoring setup so we use constants for the monitoring beats IDs.
Checklist
./changelog/fragments
using the changelog tool