Overhaul all metrics #81

isker · 2024-10-13T21:46:38Z

Fix names to comply with the official guidelines and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct.
Improve helps.
Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did.
Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB).
Correctly report container-level memory limits in all cases - the stats limit is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^).
Add metrics concerning Fargate ephemeral storage and task image pull timestamps.
Remove the task_arn label on task-level metrics, as it does not distinctly identify anything within the instance - the instance is the task! Users needing the task ARN in their timeseries labels may do so by joining to ecs_task_metadata_info.

I have tested these changes both in Fargate and EC2 and they look correct to me.

Closes #70 (by way of obsoleting it).
Closes #74.
Closes #69.
Closes #35.
Closes #16 (as far as I can tell).

README.md

ecscollector/collector.go

SuperQ

Thanks, looking good so far.

README.md

- Fix names to comply with the [official guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming) and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct. - Improve `help`s. - Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did. - Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB). - Correctly report container-level memory limits in all cases - the stats `limit` is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^). - Add metrics concerning Fargate ephemeral storage and task image pull timestamps. - Remove the `task_arn` label on task-level metrics, as it does not distinctly identify anything within the instance - the instance is the task! Users needing the task ARN in their timeseries labels may do so by joining to `ecs_task_metadata_info`. I have tested these changes both in Fargate and EC2 and they look correct to me. Signed-off-by: Ian Kerins <git@isk.haus>

SuperQ

Great work, thanks!

isker · 2024-10-17T11:56:16Z

Thanks for reviewing. We will begin relying on this exporter heavily within the next few months. So if anything ends up being off or incomplete with this overhaul, I will be following up.

If not, this exporter might be close to being feature complete, and we could consider cutting 1.0 after some time.

isker force-pushed the overhaul branch from 5c7aa78 to 18b5689 Compare October 13, 2024 21:49

isker commented Oct 13, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

ecscollector/collector.go Show resolved Hide resolved

SuperQ reviewed Oct 15, 2024

View reviewed changes

ecscollector/collector.go Outdated Show resolved Hide resolved

isker force-pushed the overhaul branch from 18b5689 to 9267824 Compare October 16, 2024 21:24

SuperQ reviewed Oct 16, 2024

View reviewed changes

ecscollector/collector.go Outdated Show resolved Hide resolved

SuperQ reviewed Oct 16, 2024

View reviewed changes

isker force-pushed the overhaul branch from 9267824 to 5c8ca62 Compare October 16, 2024 22:38

SuperQ reviewed Oct 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

isker force-pushed the overhaul branch from 5c8ca62 to 43b57c0 Compare October 17, 2024 11:41

SuperQ approved these changes Oct 17, 2024

View reviewed changes

SuperQ merged commit 64e73a4 into prometheus-community:main Oct 17, 2024
4 checks passed

isker deleted the overhaul branch October 17, 2024 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul all metrics #81

Overhaul all metrics #81

isker commented Oct 13, 2024 •

edited

Loading

SuperQ left a comment

SuperQ left a comment

isker commented Oct 17, 2024

Overhaul all metrics #81

Overhaul all metrics #81

Conversation

isker commented Oct 13, 2024 • edited Loading

SuperQ left a comment

Choose a reason for hiding this comment

SuperQ left a comment

Choose a reason for hiding this comment

isker commented Oct 17, 2024

isker commented Oct 13, 2024 •

edited

Loading