Skip to content

Commit

Permalink
Overhaul all metrics
Browse files Browse the repository at this point in the history
- Fix names to comply with the [official
  guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming)
  and to better mirror the names of similar timeseries from the
  much-more-popular cAdvisor, when reasonable. And don't use the word
  "svc" to refer to tasks, as it is just not correct.
- Improve `help`s.
- Stop reporting per-CPU usage metrics. They're empirically only
  available in Fargate, but the current collector implementation assumes
  they're available everywhere. (They were previously available in EC2 but
  that stopped being the case when ecs-agent was upgraded to use cgroups
  v2.)  Given that it's not clear why per-CPU numbers are useful in
  general, remove them everywhere instead of exposing disjoint metrics for
  Fargate and EC2. This will also prevent Fargate from potentially
  spontaneously breaking in the same way EC2 did.
- Fix task-level memory limit to actually be in bytes (it previously
  said "bytes" but was in fact MiB).
- Correctly report container-level memory limits in all cases - the
  stats `limit` is nonsense if, as in Fargate, there is no container-level
  limit configured in the task definition. While the right data for all
  cases is hiding in the stats response somewhere, I have instead opted to
  cut out the stats middleman and use the task metadata directly to drive
  this metric. I think it's substantially less likely that ECS fails to
  effect the configured limits upon cgroups correctly than it is that we
  fail to interrogate cgroups output correctly: the latter empirically
  happens with some frequency :^).
- Add metrics concerning Fargate ephemeral storage and task image pull
  timestamps.
- Remove the `task_arn` label on task-level metrics, as it does not
  distinctly identify anything within the instance - the instance is the
  task! Users needing the task ARN in their timeseries labels may do so
  by joining to `ecs_task_metadata_info`.

I have tested these changes both in Fargate and EC2 and they look
correct to me.

Signed-off-by: Ian Kerins <git@isk.haus>
  • Loading branch information
isker committed Oct 17, 2024
1 parent 593ea5f commit 43b57c0
Show file tree
Hide file tree
Showing 2 changed files with 215 additions and 232 deletions.
207 changes: 69 additions & 138 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,149 +37,80 @@ from App Runner services.

## Labels

* **container**: Container associated with a metric.
* **cpu**: Available to CPU metrics, helps to breakdown metrics by CPU.
* **device**: Network interface device associated with the metric. Only
### On task-level metrics
None.

### On container-level metrics

* **container_name**: Name of the container (as in the ECS task definition) associated with a metric.
* **interface**: Network interface device associated with the metric. Only
available for several network metrics.

## Example output

(With `--web.disable-exporter-metrics` passed, such that standard Go metrics are not included here.)

```
# HELP ecs_cpu_seconds_total Total CPU usage in seconds.
# TYPE ecs_cpu_seconds_total counter
ecs_cpu_seconds_total{container="ecs-metadata-proxy",cpu="0"} 1.746774278e+08
ecs_cpu_seconds_total{container="ecs-metadata-proxy",cpu="1"} 1.7417992266e+08
# HELP ecs_memory_bytes Memory usage in bytes.
# TYPE ecs_memory_bytes gauge
ecs_memory_bytes{container="ecs-metadata-proxy"} 4.440064e+06
# HELP ecs_memory_limit_bytes Memory limit in bytes.
# TYPE ecs_memory_limit_bytes gauge
ecs_memory_limit_bytes{container="ecs-metadata-proxy"} 9.223372036854772e+18
# HELP ecs_memory_max_bytes Maximum memory usage in bytes.
# TYPE ecs_memory_max_bytes gauge
ecs_memory_max_bytes{container="ecs-metadata-proxy"} 9.023488e+06
# HELP ecs_network_receive_bytes_total Network received in bytes.
# TYPE ecs_network_receive_bytes_total counter
ecs_network_receive_bytes_total{container="ecs-metadata-proxy",device="eth1"} 4.2851757e+07
# HELP ecs_network_receive_dropped_total Network packets dropped in receiving.
# TYPE ecs_network_receive_dropped_total counter
ecs_network_receive_dropped_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_receive_errors_total Network errors in receiving.
# TYPE ecs_network_receive_errors_total counter
ecs_network_receive_errors_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_receive_packets_total Network packets received.
# TYPE ecs_network_receive_packets_total counter
ecs_network_receive_packets_total{container="ecs-metadata-proxy",device="eth1"} 516239
# HELP ecs_network_transmit_bytes_total Network transmitted in bytes.
# TYPE ecs_network_transmit_bytes_total counter
ecs_network_transmit_bytes_total{container="ecs-metadata-proxy",device="eth1"} 1.28412758e+08
# HELP ecs_network_transmit_dropped_total Network packets dropped in transmit.
# TYPE ecs_network_transmit_dropped_total counter
ecs_network_transmit_dropped_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_transmit_errors_total Network errors in transmit.
# TYPE ecs_network_transmit_errors_total counter
ecs_network_transmit_errors_total{container="ecs-metadata-proxy",device="eth1"} 0
# HELP ecs_network_transmit_packets_total Network packets transmitted.
# TYPE ecs_network_transmit_packets_total counter
ecs_network_transmit_packets_total{container="ecs-metadata-proxy",device="eth1"} 429472
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.16.3"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 595760
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 595760
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 4092
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 123
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.97448e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 595760
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.508544e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.59744e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 2439
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 6.508544e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.668288e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2562
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 37400
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 49152
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 497348
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 425984
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 425984
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.165032e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP ecs_container_cpu_usage_seconds_total Cumulative total container CPU usage in seconds.
# TYPE ecs_container_cpu_usage_seconds_total counter
ecs_container_cpu_usage_seconds_total{container_name="ecs-exporter"} 0.027095748000000003
# HELP ecs_container_memory_limit_bytes Configured container memory limit in bytes, set from the container-level limit in the task definition if any, otherwise the task-level limit.
# TYPE ecs_container_memory_limit_bytes gauge
ecs_container_memory_limit_bytes{container_name="ecs-exporter"} 5.36870912e+08
# HELP ecs_container_memory_page_cache_size_bytes Current container memory page cache size in bytes. This is not a subset of used bytes.
# TYPE ecs_container_memory_page_cache_size_bytes gauge
ecs_container_memory_page_cache_size_bytes{container_name="ecs-exporter"} 0
# HELP ecs_container_memory_usage_bytes Current container memory usage in bytes.
# TYPE ecs_container_memory_usage_bytes gauge
ecs_container_memory_usage_bytes{container_name="ecs-exporter"} 4.452352e+06
# HELP ecs_container_network_receive_bytes_total Cumulative total size of container network packets received in bytes.
# TYPE ecs_container_network_receive_bytes_total counter
ecs_container_network_receive_bytes_total{container_name="ecs-exporter",interface="eth1"} 1.1112267e+07
# HELP ecs_container_network_receive_errors_total Cumulative total count of container network errors in receiving.
# TYPE ecs_container_network_receive_errors_total counter
ecs_container_network_receive_errors_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_receive_packets_dropped_total Cumulative total count of container network packets dropped in receiving.
# TYPE ecs_container_network_receive_packets_dropped_total counter
ecs_container_network_receive_packets_dropped_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_receive_packets_total Cumulative total count of container network packets received.
# TYPE ecs_container_network_receive_packets_total counter
ecs_container_network_receive_packets_total{container_name="ecs-exporter",interface="eth1"} 8039
# HELP ecs_container_network_transmit_bytes_total Cumulative total size of container network packets transmitted in bytes.
# TYPE ecs_container_network_transmit_bytes_total counter
ecs_container_network_transmit_bytes_total{container_name="ecs-exporter",interface="eth1"} 165338
# HELP ecs_container_network_transmit_dropped_total Cumulative total count of container network packets dropped in transmit.
# TYPE ecs_container_network_transmit_dropped_total counter
ecs_container_network_transmit_dropped_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_transmit_errors_total Cumulative total count of container network errors in transmit.
# TYPE ecs_container_network_transmit_errors_total counter
ecs_container_network_transmit_errors_total{container_name="ecs-exporter",interface="eth1"} 0
# HELP ecs_container_network_transmit_packets_total Cumulative total count of container network packets transmitted.
# TYPE ecs_container_network_transmit_packets_total counter
ecs_container_network_transmit_packets_total{container_name="ecs-exporter",interface="eth1"} 713
# HELP ecs_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which ecs_exporter was built, and the goos and goarch for the build.
# TYPE ecs_exporter_build_info gauge
ecs_exporter_build_info{branch="",goarch="arm64",goos="linux",goversion="go1.23.2",revision="unknown",tags="unknown",version=""} 1
# HELP ecs_task_cpu_limit_vcpus Configured task CPU limit in vCPUs (1 vCPU = 1024 CPU units). This is optional when running on EC2; if no limit is set, this metric has no value.
# TYPE ecs_task_cpu_limit_vcpus gauge
ecs_task_cpu_limit_vcpus 0.25
# HELP ecs_task_ephemeral_storage_allocated_bytes Configured Fargate task ephemeral storage allocated size in bytes.
# TYPE ecs_task_ephemeral_storage_allocated_bytes gauge
ecs_task_ephemeral_storage_allocated_bytes 2.1491613696e+10
# HELP ecs_task_ephemeral_storage_used_bytes Current Fargate task ephemeral storage usage in bytes.
# TYPE ecs_task_ephemeral_storage_used_bytes gauge
ecs_task_ephemeral_storage_used_bytes 3.7748736e+07
# HELP ecs_task_image_pull_start_timestamp_seconds The time at which the task started pulling docker images for its containers.
# TYPE ecs_task_image_pull_start_timestamp_seconds gauge
ecs_task_image_pull_start_timestamp_seconds 1.7291179014941156e+09
# HELP ecs_task_image_pull_stop_timestamp_seconds The time at which the task stopped (i.e. completed) pulling docker images for its containers.
# TYPE ecs_task_image_pull_stop_timestamp_seconds gauge
ecs_task_image_pull_stop_timestamp_seconds 1.7291179144469e+09
# HELP ecs_task_memory_limit_bytes Configured task memory limit in bytes. This is optional when running on EC2; if no limit is set, this metric has no value.
# TYPE ecs_task_memory_limit_bytes gauge
ecs_task_memory_limit_bytes 5.36870912e+08
# HELP ecs_task_metadata_info ECS task metadata, sourced from the task metadata endpoint version 4.
# TYPE ecs_task_metadata_info gauge
ecs_task_metadata_info{availability_zone="us-east-1a",cluster="arn:aws:ecs:us-east-1:829490980523:cluster/prom-ecs-exporter-sandbox",desired_status="RUNNING",family="prom-ecs-exporter-sandbox-isker-fargate",known_status="RUNNING",launch_type="FARGATE",revision="11",task_arn="arn:aws:ecs:us-east-1:829490980523:task/prom-ecs-exporter-sandbox/0c7f6b0414dc47d0a15019a099cd919b"} 1
```

## Example task definition
Expand Down
Loading

0 comments on commit 43b57c0

Please sign in to comment.