Skip to content

Commit

Permalink
Overhaul all metrics
Browse files Browse the repository at this point in the history
- Fix names to comply with the [official
  guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming)
  and to better mirror the names of similar timeseries from the
  much-more-popular cAdvisor, when reasonable. And don't use the word
  "svc" to refer to tasks, as it is just not correct.
- Improve `help`s.
- Stop reporting per-CPU usage metrics. They're empirically only
  available in Fargate, but the current collector implementation assumes
  they're available everywhere. (They were previously available in EC2 but
  that stopped being the case when ecs-agent was upgraded to use cgroups
  v2.)  Given that it's not clear why per-CPU numbers are useful in
  general, remove them everywhere instead of exposing disjoint metrics for
  Fargate and EC2. This will also prevent Fargate from potentially
  spontaneously breaking in the same way EC2 did.
- Fix task-level memory limit to actually be in bytes (it previously
  said "bytes" but was in fact MiB).
- Correctly report container-level memory limits in all cases - the
  stats `limit` is nonsense if, as in Fargate, there is no container-level
  limit configured in the task definition. While the right data for all
  cases is hiding in the stats response somewhere, I have instead opted to
  cut out the stats middleman and use the task metadata directly to drive
  this metric. I think it's substantially less likely that ECS fails to
  effect the configured limits upon cgroups correctly than it is that we
  fail to interrogate cgroups output correctly: the latter empirically
  happens with some frequency :^).
- Add metrics concerning Fargate ephemeral storage and task image pull
  timestamps.
- Remove the `task_arn` label on task-level metrics, as it does not
  distinctly identify anything within the instance - the instance is the
  task! Users needing the task ARN in their timeseries labels may do so
  by joining to `ecs_task_metadata_info`.

I have tested these changes both in Fargate and EC2 and they look
correct to me.

Signed-off-by: Ian Kerins <git@isk.haus>
  • Loading branch information
isker committed Oct 16, 2024
1 parent 593ea5f commit 5c8ca62
Show file tree
Hide file tree
Showing 2 changed files with 295 additions and 181 deletions.
241 changes: 149 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,52 +37,79 @@ from App Runner services.

## Labels

* **container**: Container associated with a metric.
* **cpu**: Available to CPU metrics, helps to breakdown metrics by CPU.
* **device**: Network interface device associated with the metric. Only
### On task-level metrics
None.

### On container-level metrics

* **container_name**: Name of the container (as in the ECS task definition) associated with a metric.
* **interface**: Network interface device associated with the metric. Only
available for several network metrics.

## Example output

```
# HELP ecs_cpu_seconds_total Total CPU usage in seconds.
# TYPE ecs_cpu_seconds_total counter
ecs_cpu_seconds_total{container="ecs-metadata-proxy",cpu="0"} 1.746774278e+08
ecs_cpu_seconds_total{container="ecs-metadata-proxy",cpu="1"} 1.7417992266e+08
# HELP ecs_memory_bytes Memory usage in bytes.
# TYPE ecs_memory_bytes gauge
ecs_memory_bytes{container="ecs-metadata-proxy"} 4.440064e+06
# HELP ecs_memory_limit_bytes Memory limit in bytes.
# TYPE ecs_memory_limit_bytes gauge
ecs_memory_limit_bytes{container="ecs-metadata-proxy"} 9.223372036854772e+18
# HELP ecs_memory_max_bytes Maximum memory usage in bytes.
# TYPE ecs_memory_max_bytes gauge
ecs_memory_max_bytes{container="ecs-metadata-proxy"} 9.023488e+06
# HELP ecs_network_receive_bytes_total Network received in bytes.
# TYPE ecs_network_receive_bytes_total counter
ecs_network_receive_bytes_total{container="ecs-metadata-proxy",device="eth1"} 4.2851757e+07
# HELP ecs_network_receive_dropped_total Network packets dropped in receiving.
# TYPE ecs_network_receive_dropped_total counter
ecs_network_receive_dropped_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_receive_errors_total Network errors in receiving.
# TYPE ecs_network_receive_errors_total counter
ecs_network_receive_errors_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_receive_packets_total Network packets received.
# TYPE ecs_network_receive_packets_total counter
ecs_network_receive_packets_total{container="ecs-metadata-proxy",device="eth1"} 516239
# HELP ecs_network_transmit_bytes_total Network transmitted in bytes.
# TYPE ecs_network_transmit_bytes_total counter
ecs_network_transmit_bytes_total{container="ecs-metadata-proxy",device="eth1"} 1.28412758e+08
# HELP ecs_network_transmit_dropped_total Network packets dropped in transmit.
# TYPE ecs_network_transmit_dropped_total counter
ecs_network_transmit_dropped_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_transmit_errors_total Network errors in transmit.
# TYPE ecs_network_transmit_errors_total counter
ecs_network_transmit_errors_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_transmit_packets_total Network packets transmitted.
# TYPE ecs_network_transmit_packets_total counter
ecs_network_transmit_packets_total{container="ecs-metadata-proxy",device="eth1"} 429472
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# HELP ecs_container_cpu_usage_seconds_total Cumulative total container CPU usage in seconds.
# TYPE ecs_container_cpu_usage_seconds_total counter
ecs_container_cpu_usage_seconds_total{container_name="ecs-exporter"} 0.027095748000000003
# HELP ecs_container_memory_limit_bytes Configured container memory limit in bytes, set from the container-level limit in the task definition if any, otherwise the task-level limit.
# TYPE ecs_container_memory_limit_bytes gauge
ecs_container_memory_limit_bytes{container_name="ecs-exporter"} 5.36870912e+08
# HELP ecs_container_memory_page_cache_size_bytes Current container memory page cache size in bytes. This is not a subset of used bytes.
# TYPE ecs_container_memory_page_cache_size_bytes gauge
ecs_container_memory_page_cache_size_bytes{container_name="ecs-exporter"} 0
# HELP ecs_container_memory_usage_bytes Current container memory usage in bytes.
# TYPE ecs_container_memory_usage_bytes gauge
ecs_container_memory_usage_bytes{container_name="ecs-exporter"} 4.452352e+06
# HELP ecs_container_network_receive_bytes_total Cumulative total size of container network packets received in bytes.
# TYPE ecs_container_network_receive_bytes_total counter
ecs_container_network_receive_bytes_total{container_name="ecs-exporter",interface="eth1"} 1.1112267e+07
# HELP ecs_container_network_receive_errors_total Cumulative total count of container network errors in receiving.
# TYPE ecs_container_network_receive_errors_total counter
ecs_container_network_receive_errors_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_receive_packets_dropped_total Cumulative total count of container network packets dropped in receiving.
# TYPE ecs_container_network_receive_packets_dropped_total counter
ecs_container_network_receive_packets_dropped_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_receive_packets_total Cumulative total count of container network packets received.
# TYPE ecs_container_network_receive_packets_total counter
ecs_container_network_receive_packets_total{container_name="ecs-exporter",interface="eth1"} 8039
# HELP ecs_container_network_transmit_bytes_total Cumulative total size of container network packets transmitted in bytes.
# TYPE ecs_container_network_transmit_bytes_total counter
ecs_container_network_transmit_bytes_total{container_name="ecs-exporter",interface="eth1"} 165338
# HELP ecs_container_network_transmit_dropped_total Cumulative total count of container network packets dropped in transmit.
# TYPE ecs_container_network_transmit_dropped_total counter
ecs_container_network_transmit_dropped_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_transmit_errors_total Cumulative total count of container network errors in transmit.
# TYPE ecs_container_network_transmit_errors_total counter
ecs_container_network_transmit_errors_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_transmit_packets_total Cumulative total count of container network packets transmitted.
# TYPE ecs_container_network_transmit_packets_total counter
ecs_container_network_transmit_packets_total{container_name="ecs-exporter",interface="eth1"} 713
# HELP ecs_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which ecs_exporter was built, and the goos and goarch for the build.
# TYPE ecs_exporter_build_info gauge
ecs_exporter_build_info{branch="",goarch="arm64",goos="linux",goversion="go1.23.2",revision="unknown",tags="unknown",version=""} 1
# HELP ecs_task_cpu_limit_vcpus Configured task CPU limit in vCPUs (1 vCPU = 1024 CPU units). This is optional when running on EC2; if no limit is set, this metric has no value.
# TYPE ecs_task_cpu_limit_vcpus gauge
ecs_task_cpu_limit_vcpus 0.25
# HELP ecs_task_ephemeral_storage_allocated_bytes Configured Fargate task ephemeral storage allocated size in bytes.
# TYPE ecs_task_ephemeral_storage_allocated_bytes gauge
ecs_task_ephemeral_storage_allocated_bytes 2.1491613696e+10
# HELP ecs_task_ephemeral_storage_used_bytes Current Fargate task ephemeral storage usage in bytes.
# TYPE ecs_task_ephemeral_storage_used_bytes gauge
ecs_task_ephemeral_storage_used_bytes 3.7748736e+07
# HELP ecs_task_image_pull_start_timestamp_seconds The time at which the task started pulling docker images for its containers.
# TYPE ecs_task_image_pull_start_timestamp_seconds gauge
ecs_task_image_pull_start_timestamp_seconds 1.7291179014941156e+09
# HELP ecs_task_image_pull_stop_timestamp_seconds The time at which the task stopped (i.e. completed) pulling docker images for its containers.
# TYPE ecs_task_image_pull_stop_timestamp_seconds gauge
ecs_task_image_pull_stop_timestamp_seconds 1.7291179144469e+09
# HELP ecs_task_memory_limit_bytes Configured task memory limit in bytes. This is optional when running on EC2; if no limit is set, this metric has no value.
# TYPE ecs_task_memory_limit_bytes gauge
ecs_task_memory_limit_bytes 5.36870912e+08
# HELP ecs_task_metadata_info ECS task metadata, sourced from the task metadata endpoint version 4.
# TYPE ecs_task_metadata_info gauge
ecs_task_metadata_info{availability_zone="us-east-1a",cluster="arn:aws:ecs:us-east-1:829490980523:cluster/prom-ecs-exporter-sandbox",desired_status="RUNNING",family="prom-ecs-exporter-sandbox-isker-fargate",known_status="RUNNING",launch_type="FARGATE",pull_started_at="2024-10-16T22:31:41.494115693Z",pull_stopped_at="2024-10-16T22:31:54.446899683Z",revision="11",task_arn="arn:aws:ecs:us-east-1:829490980523:task/prom-ecs-exporter-sandbox/0c7f6b0414dc47d0a15019a099cd919b"} 1
# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
Expand All @@ -91,87 +118,117 @@ go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent
# TYPE go_gc_gogc_percent gauge
go_gc_gogc_percent 100
# HELP go_gc_gomemlimit_bytes Go runtime memory limit configured by the user, otherwise math.MaxInt64. This value is set by the GOMEMLIMIT environment variable, and the runtime/debug.SetMemoryLimit function. Sourced from /gc/gomemlimit:bytes
# TYPE go_gc_gomemlimit_bytes gauge
go_gc_gomemlimit_bytes 9.223372036854776e+18
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
go_goroutines 9
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.16.3"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
go_info{version="go1.23.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated in heap and currently in use. Equals to /memory/classes/heap/objects:bytes.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 595760
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
go_memstats_alloc_bytes 2.38768e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated in heap until now, even if released already. Equals to /gc/heap/allocs:bytes.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 595760
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
go_memstats_alloc_bytes_total 2.38768e+06
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. Equals to /memory/classes/profiling/buckets:bytes.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 4092
# HELP go_memstats_frees_total Total number of frees.
go_memstats_buck_hash_sys_bytes 4772
# HELP go_memstats_frees_total Total number of heap objects frees. Equals to /gc/heap/frees:objects + /gc/heap/tiny/allocs:objects.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 123
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
go_memstats_frees_total 237
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. Equals to /memory/classes/metadata/other:bytes.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.97448e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
go_memstats_gc_sys_bytes 1.595176e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and currently in use, same as go_memstats_alloc_bytes. Equals to /memory/classes/heap/objects:bytes.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 595760
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
go_memstats_heap_alloc_bytes 2.38768e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. Equals to /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.508544e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
go_memstats_heap_idle_bytes 3.801088e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.59744e+06
# HELP go_memstats_heap_objects Number of allocated objects.
go_memstats_heap_inuse_bytes 4.030464e+06
# HELP go_memstats_heap_objects Number of currently allocated objects. Equals to /gc/heap/objects:objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 2439
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
go_memstats_heap_objects 13702
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. Equals to /memory/classes/heap/released:bytes.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 6.508544e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
go_memstats_heap_released_bytes 3.801088e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes + /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.668288e+07
go_memstats_heap_sys_bytes 7.831552e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# HELP go_memstats_mallocs_total Total number of heap objects allocated, both live and gc-ed. Semantically a counter version for go_memstats_heap_objects gauge. Equals to /gc/heap/allocs:objects + /gc/heap/tiny/allocs:objects.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2562
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
go_memstats_mallocs_total 13939
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. Equals to /memory/classes/metadata/mcache/inuse:bytes.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. Equals to /memory/classes/metadata/mcache/inuse:bytes + /memory/classes/metadata/mcache/free:bytes.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. Equals to /memory/classes/metadata/mspan/inuse:bytes.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 37400
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
go_memstats_mspan_inuse_bytes 74720
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. Equals to /memory/classes/metadata/mspan/inuse:bytes + /memory/classes/metadata/mspan/free:bytes.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 49152
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
go_memstats_mspan_sys_bytes 81600
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. Equals to /gc/heap/goal:bytes.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. Equals to /memory/classes/other:bytes.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 497348
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
go_memstats_other_sys_bytes 587412
# HELP go_memstats_stack_inuse_bytes Number of bytes obtained from system for stack allocator in non-CGO environments. Equals to /memory/classes/heap/stacks:bytes.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 425984
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
go_memstats_stack_inuse_bytes 524288
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. Equals to /memory/classes/heap/stacks:bytes + /memory/classes/os-stacks:bytes.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 425984
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
go_memstats_stack_sys_bytes 524288
# HELP go_memstats_sys_bytes Number of bytes obtained from system. Equals to /memory/classes/total:byte.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.165032e+07
go_memstats_sys_bytes 1.06404e+07
# HELP go_sched_gomaxprocs_threads The current runtime.GOMAXPROCS setting, or the number of operating system threads that can execute user-level Go code simultaneously. Sourced from /sched/gomaxprocs:threads
# TYPE go_sched_gomaxprocs_threads gauge
go_sched_gomaxprocs_threads 2
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
go_threads 5
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.02
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 65535
# HELP process_network_receive_bytes_total Number of bytes received by the process over the network.
# TYPE process_network_receive_bytes_total counter
process_network_receive_bytes_total 1.0833544e+07
# HELP process_network_transmit_bytes_total Number of bytes sent by the process over the network.
# TYPE process_network_transmit_bytes_total counter
process_network_transmit_bytes_total 153323
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.6584704e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72911791496e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.269272576e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
Expand Down
Loading

0 comments on commit 5c8ca62

Please sign in to comment.