-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul all metrics #81
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looking good so far.
- Fix names to comply with the [official guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming) and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct. - Improve `help`s. - Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did. - Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB). - Correctly report container-level memory limits in all cases - the stats `limit` is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^). - Add metrics concerning Fargate ephemeral storage and task image pull timestamps. - Remove the `task_arn` label on task-level metrics, as it does not distinctly identify anything within the instance - the instance is the task! Users needing the task ARN in their timeseries labels may do so by joining to `ecs_task_metadata_info`. I have tested these changes both in Fargate and EC2 and they look correct to me. Signed-off-by: Ian Kerins <git@isk.haus>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks!
Thanks for reviewing. We will begin relying on this exporter heavily within the next few months. So if anything ends up being off or incomplete with this overhaul, I will be following up. If not, this exporter might be close to being feature complete, and we could consider cutting 1.0 after some time. |
help
s.limit
is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^).task_arn
label on task-level metrics, as it does not distinctly identify anything within the instance - the instance is the task! Users needing the task ARN in their timeseries labels may do so by joining toecs_task_metadata_info
.I have tested these changes both in Fargate and EC2 and they look correct to me.
Closes #70 (by way of obsoleting it).
Closes #74.
Closes #69.
Closes #35.
Closes #16 (as far as I can tell).