Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul all metrics #81

Merged
merged 1 commit into from
Oct 17, 2024
Merged

Overhaul all metrics #81

merged 1 commit into from
Oct 17, 2024

Conversation

isker
Copy link
Contributor

@isker isker commented Oct 13, 2024

  • Fix names to comply with the official guidelines and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct.
  • Improve helps.
  • Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did.
  • Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB).
  • Correctly report container-level memory limits in all cases - the stats limit is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^).
  • Add metrics concerning Fargate ephemeral storage and task image pull timestamps.
  • Remove the task_arn label on task-level metrics, as it does not distinctly identify anything within the instance - the instance is the task! Users needing the task ARN in their timeseries labels may do so by joining to ecs_task_metadata_info.

I have tested these changes both in Fargate and EC2 and they look correct to me.

Closes #70 (by way of obsoleting it).
Closes #74.
Closes #69.
Closes #35.
Closes #16 (as far as I can tell).

README.md Outdated Show resolved Hide resolved
ecscollector/collector.go Show resolved Hide resolved
ecscollector/collector.go Outdated Show resolved Hide resolved
ecscollector/collector.go Outdated Show resolved Hide resolved
Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking good so far.

README.md Outdated Show resolved Hide resolved
- Fix names to comply with the [official
  guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming)
  and to better mirror the names of similar timeseries from the
  much-more-popular cAdvisor, when reasonable. And don't use the word
  "svc" to refer to tasks, as it is just not correct.
- Improve `help`s.
- Stop reporting per-CPU usage metrics. They're empirically only
  available in Fargate, but the current collector implementation assumes
  they're available everywhere. (They were previously available in EC2 but
  that stopped being the case when ecs-agent was upgraded to use cgroups
  v2.)  Given that it's not clear why per-CPU numbers are useful in
  general, remove them everywhere instead of exposing disjoint metrics for
  Fargate and EC2. This will also prevent Fargate from potentially
  spontaneously breaking in the same way EC2 did.
- Fix task-level memory limit to actually be in bytes (it previously
  said "bytes" but was in fact MiB).
- Correctly report container-level memory limits in all cases - the
  stats `limit` is nonsense if, as in Fargate, there is no container-level
  limit configured in the task definition. While the right data for all
  cases is hiding in the stats response somewhere, I have instead opted to
  cut out the stats middleman and use the task metadata directly to drive
  this metric. I think it's substantially less likely that ECS fails to
  effect the configured limits upon cgroups correctly than it is that we
  fail to interrogate cgroups output correctly: the latter empirically
  happens with some frequency :^).
- Add metrics concerning Fargate ephemeral storage and task image pull
  timestamps.
- Remove the `task_arn` label on task-level metrics, as it does not
  distinctly identify anything within the instance - the instance is the
  task! Users needing the task ARN in their timeseries labels may do so
  by joining to `ecs_task_metadata_info`.

I have tested these changes both in Fargate and EC2 and they look
correct to me.

Signed-off-by: Ian Kerins <git@isk.haus>
Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!

@SuperQ SuperQ merged commit 64e73a4 into prometheus-community:main Oct 17, 2024
4 checks passed
@isker
Copy link
Contributor Author

isker commented Oct 17, 2024

Thanks for reviewing. We will begin relying on this exporter heavily within the next few months. So if anything ends up being off or incomplete with this overhaul, I will be following up.

If not, this exporter might be close to being feature complete, and we could consider cutting 1.0 after some time.

@isker isker deleted the overhaul branch October 17, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants