Skip to content

Commit

Permalink
Use upstream ecs-agent types for deserializing API responses
Browse files Browse the repository at this point in the history
This project rolls its own types for deserializing ECS task metadata and
container stats responses. Maintaining these types can be tedious, as
the documentation of these API endpoints is underspecified, in the sense
that properties included in API responses are broadly available in the
documentation, but their precise types (including whether properties are
optional) are sometimes not. If we roll our own types, we are stuck
reverse-engineering the specifics of the ECS API responses, which is
made more tedious by the fact that these can differ between EC2 and
Fargate.

Good news: rolling our own types is not necessary. The ECS Agent is open
source and written in Go. We can depend on it as a library, purely to
get at the structs that it uses for API responses. We were already doing
a similar thing for docker stats; this is just more comprehensive. (The
ECS Agent project itself depends on the same docker library we were
using for these stats, and includes them in its API responses.)

Switching to these ECS Agent types has revealed situations in which our
types were incorrect, silently relying on implicit type conversions of
JSON values in Go's JSON deserializer. One of these situations was
resulting in actually invalid data being served: ECS tasks on EC2 need
not specify task-level resource limits at all, such that these
properties are optional on the JSON response, but our types for them
were not pointers, such that we were incorrectly reporting derived
metrics as zero, when they really should not exist at all.

The downsides of doing this:
- The ECS Agent project has no detectable Go module version. I am not an
expert on this, but I think it's related to a single repo containing
multiple Go modules. I don't think this is a big deal, as the existing
docker dependency already did not have a Go module version.
- We have to upgrade to go 1.21, as the ECS agent project declares that
it requires it in its go.mod.
- Binary size grows, about 12MB -> 18MB on my laptop. I am surprised
that simply switching to use these structs blew up binary size so much,
but I don't think binary size is taken very seriously in the Go
ecosystem anyway, so I don't think this is a big problem.

I think these downsides are worth it in that, going forward, we can
more reliably develop metrics derived from ECS Agent API responses,
because we're using types that are much more likely to be correct.

I plan to add more metrics, and improve existing ones, using these types
in the future.

I have validated these changes by recording output on EC2 and Fargate
before and after this change here:
https://github.com/isker/ecs-exporter-cdk/tree/master/experiments/use-official-types.
The resulting diffs are as expected.

Signed-off-by: Ian Kerins <git@isk.haus>
  • Loading branch information
isker committed Oct 11, 2024
1 parent 8c4f85e commit 1cbfbab
Show file tree
Hide file tree
Showing 6 changed files with 117 additions and 147 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ executors:
# Whenever the Go version is updated here, .promu.yml should also be updated.
golang:
docker:
- image: cimg/go:1.20
- image: cimg/go:1.23
jobs:
test:
executor: golang
Expand Down
2 changes: 1 addition & 1 deletion .promu.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
go:
# Whenever the Go version is updated here, .travis.yml and
# .circle/config.yml should also be updated.
version: 1.20
version: 1.21
repository:
path: github.com/prometheus-community/ecs_exporter
build:
Expand Down
58 changes: 33 additions & 25 deletions ecscollector/collector.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"context"
"fmt"
"log"
"time"

"github.com/prometheus-community/ecs_exporter/ecsmetadata"
"github.com/prometheus/client_golang/prometheus"
Expand Down Expand Up @@ -179,35 +180,42 @@ func (c *collector) Collect(ch chan<- prometheus.Metric) {
metadata.Revision,
metadata.DesiredStatus,
metadata.KnownStatus,
metadata.PullStartedAt,
metadata.PullStoppedAt,
metadata.PullStartedAt.Format(time.RFC3339Nano),
metadata.PullStoppedAt.Format(time.RFC3339Nano),
metadata.AvailabilityZone,
metadata.LaunchType,
)

ch <- prometheus.MustNewConstMetric(
svcCpuLimitDesc,
prometheus.GaugeValue,
float64(metadata.Limits.CPU),
svcLabels...,
)

ch <- prometheus.MustNewConstMetric(
svcMemLimitDesc,
prometheus.GaugeValue,
float64(metadata.Limits.Memory),
svcLabels...,
)
// Task CPU/memory limits are optional when running on EC2 - the relevant
// limits may only exist at the container level.
if metadata.Limits != nil {
if metadata.Limits.CPU != nil {
ch <- prometheus.MustNewConstMetric(
svcCpuLimitDesc,
prometheus.GaugeValue,
*metadata.Limits.CPU,
metadata.TaskARN,
)
}
if metadata.Limits.Memory != nil {
ch <- prometheus.MustNewConstMetric(
svcMemLimitDesc,
prometheus.GaugeValue,
float64(*metadata.Limits.Memory),
metadata.TaskARN,
)
}
}

stats, err := c.client.RetrieveTaskStats(ctx)
if err != nil {
log.Printf("Failed to retrieve container stats: %v", err)
return
}
for _, container := range metadata.Containers {
s := stats[container.DockerID]
s := stats[container.ID]
if s == nil {
log.Printf("Couldn't find container with ID %q in stats", container.DockerID)
log.Printf("Couldn't find container with ID %q in stats", container.ID)
continue
}

Expand Down Expand Up @@ -248,14 +256,14 @@ func (c *collector) Collect(ch chan<- prometheus.Metric) {
networkLabelVals := append(labelVals, iface)

for desc, value := range map[*prometheus.Desc]float64{
networkRxBytesDesc: netStats.RxBytes,
networkRxPacketsDesc: netStats.RxPackets,
networkRxDroppedDesc: netStats.RxDropped,
networkRxErrorsDesc: netStats.RxErrors,
networkTxBytesDesc: netStats.TxBytes,
networkTxPacketsDesc: netStats.TxPackets,
networkTxDroppedDesc: netStats.TxDropped,
networkTxErrorsDesc: netStats.TxErrors,
networkRxBytesDesc: float64(netStats.RxBytes),
networkRxPacketsDesc: float64(netStats.RxPackets),
networkRxDroppedDesc: float64(netStats.RxDropped),
networkRxErrorsDesc: float64(netStats.RxErrors),
networkTxBytesDesc: float64(netStats.TxBytes),
networkTxPacketsDesc: float64(netStats.TxPackets),
networkTxDroppedDesc: float64(netStats.TxDropped),
networkTxErrorsDesc: float64(netStats.TxErrors),
} {
ch <- prometheus.MustNewConstMetric(
desc,
Expand Down
103 changes: 17 additions & 86 deletions ecsmetadata/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ import (
"net/url"
"os"

dockertypes "github.com/docker/docker/api/types"
tmdsv4 "github.com/aws/amazon-ecs-agent/ecs-agent/tmds/handlers/v4/state"
)

type Client struct {
Expand Down Expand Up @@ -58,14 +58,26 @@ func NewClientFromEnvironment() (*Client, error) {
return NewClient(endpoint), nil
}

func (c *Client) RetrieveTaskStats(ctx context.Context) (map[string]*ContainerStats, error) {
out := make(map[string]*ContainerStats)
func (c *Client) RetrieveTaskStats(ctx context.Context) (map[string]*tmdsv4.StatsResponse, error) {
// https://github.com/aws/amazon-ecs-agent/blob/cf8c7a6b65043c550533f330b10aef6d0a342214/agent/handlers/v4/tmdsstate.go#L202
out := make(map[string]*tmdsv4.StatsResponse)
err := c.request(ctx, c.endpoint+"/task/stats", &out)
return out, err
}

func (c *Client) RetrieveTaskMetadata(ctx context.Context) (*TaskMetadata, error) {
var out TaskMetadata
func (c *Client) RetrieveTaskMetadata(ctx context.Context) (*tmdsv4.TaskResponse, error) {
// https://github.com/aws/amazon-ecs-agent/blob/cf8c7a6b65043c550533f330b10aef6d0a342214/agent/handlers/v4/tmdsstate.go#L174
//
// Note that EC2 and Fargate return slightly different task metadata
// responses. At time of writing, as per the documentation, only EC2 has `ServiceName`,
// while only Fargate has `EphemeralStorageMetrics`, `ClockDrift`, and
// `Containers[].Snapshotter`. Ref:
// https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4-fargate-response.html
// https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4-response.html
//
// But `TaskResponse` is the _union_ of these two responses. It has all the
// fields.
var out tmdsv4.TaskResponse
err := c.request(ctx, c.endpoint+"/task", &out)
return &out, err
}
Expand All @@ -88,84 +100,3 @@ func (c *Client) request(ctx context.Context, uri string, out interface{}) error
}
return json.Unmarshal(body, out)
}

type ContainerStats struct {
Name string `json:"name"`
ID string `json:"id"`
NumProcs float64 `json:"num_procs"`
Read string `json:"read"`
PreRead string `json:"preread"`

CPUStats dockertypes.CPUStats `json:"cpu_stats"`
PreCPUStats dockertypes.CPUStats `json:"precpu_stats"`
MemoryStats dockertypes.MemoryStats `json:"memory_stats"`
BlkioStats dockertypes.BlkioStats `json:"blkio_stats"`

Networks map[string]struct {
RxBytes float64 `json:"rx_bytes"`
RxPackets float64 `json:"rx_packets"`
RxErrors float64 `json:"rx_errors"`
RxDropped float64 `json:"rx_dropped"`
TxBytes float64 `json:"tx_bytes"`
TxPackets float64 `json:"tx_packets"`
TxErrors float64 `json:"tx_errors"`
TxDropped float64 `json:"tx_dropped"`
} `json:"networks"`

NetworkRateStats struct {
RxBytesPerSec float64 `json:"rx_bytes_per_sec"`
TxBytesPerSec float64 `json:"tx_bytes_per_sec"`
} `json:"network_rate_stats"`
}

type TaskMetadataLimits struct {
CPU float64 `json:"CPU"`
Memory float64 `json:"Memory"`
}

type TaskMetadata struct {
Cluster string `json:"Cluster"`
TaskARN string `json:"TaskARN"`
Family string `json:"Family"`
Revision string `json:"Revision"`
DesiredStatus string `json:"DesiredStatus"`
KnownStatus string `json:"KnownStatus"`
Limits TaskMetadataLimits `json:"Limits"`
PullStartedAt string `json:"PullStartedAt"`
PullStoppedAt string `json:"PullStoppedAt"`
AvailabilityZone string `json:"AvailabilityZone"`
LaunchType string `json:"LaunchType"`
Containers []struct {
DockerID string `json:"DockerId"`
Name string `json:"Name"`
DockerName string `json:"DockerName"`
Image string `json:"Image"`
ImageID string `json:"ImageID"`
Labels map[string]string `json:"Labels"`
DesiredStatus string `json:"DesiredStatus"`
KnownStatus string `json:"KnownStatus"`
CreatedAt string `json:"CreatedAt"`
StartedAt string `json:"StartedAt"`
Type string `json:"Type"`
Networks []struct {
NetworkMode string `json:"NetworkMode"`
IPv4Addresses []string `json:"IPv4Addresses"`
IPv6Addresses []string `json:"IPv6Addresses"`
AttachmentIndex float64 `json:"AttachmentIndex"`
MACAddress string `json:"MACAddress"`
IPv4SubnetCIDRBlock string `json:"IPv4SubnetCIDRBlock"`
IPv6SubnetCIDRBlock string `json:"IPv6SubnetCIDRBlock"`
DomainNameServers []string `json:"DomainNameServers"`
DomainNameSearchList []string `json:"DomainNameSearchList"`
PrivateDNSName string `json:"PrivateDNSName"`
SubnetGatewayIpv4Address string `json:"SubnetGatewayIpv4Address"`
} `json:"Networks"`
ClockDrift []struct {
ClockErrorBound float64 `json:"ClockErrorBound"`
ReferenceTimestamp string `json:"ReferenceTimestamp"`
ClockSynchronizationStatus string `json:"ClockSynchronizationStatus"`
} `json:"ClockDrift"`
ContainerARN string `json:"ContainerARN"`
LogDriver string `json:"LogDriver"`
} `json:"Containers"`
}
24 changes: 14 additions & 10 deletions go.mod
Original file line number Diff line number Diff line change
@@ -1,28 +1,32 @@
module github.com/prometheus-community/ecs_exporter

go 1.20
go 1.22

require (
github.com/docker/docker v24.0.2+incompatible
github.com/aws/amazon-ecs-agent/ecs-agent v0.0.0-20240920192628-cf8c7a6b6504
github.com/prometheus/client_golang v1.15.1
)

require (
github.com/aws/aws-sdk-go v1.51.3 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/cihub/seelog v0.0.0-20170130134532-f561c5e57575 // indirect
github.com/docker/docker v24.0.9+incompatible // indirect
github.com/docker/go-connections v0.4.0 // indirect
github.com/docker/go-units v0.4.0 // indirect
github.com/gogo/protobuf v1.1.1 // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/docker/go-units v0.5.0 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/protobuf v1.5.4 // indirect
github.com/gorilla/mux v1.8.0 // indirect
github.com/jmespath/go-jmespath v0.4.0 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.4 // indirect
github.com/opencontainers/go-digest v1.0.0 // indirect
github.com/opencontainers/image-spec v1.0.2 // indirect
github.com/opencontainers/image-spec v1.1.0-rc3 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/prometheus/client_model v0.3.0 // indirect
github.com/prometheus/common v0.42.0 // indirect
github.com/prometheus/procfs v0.9.0 // indirect
golang.org/x/sys v0.6.0 // indirect
google.golang.org/protobuf v1.30.0 // indirect
gotest.tools/v3 v3.3.0 // indirect
golang.org/x/exp v0.0.0-20231006140011-7918f672742d // indirect
golang.org/x/sys v0.25.0 // indirect
google.golang.org/protobuf v1.33.0 // indirect
)
Loading

0 comments on commit 1cbfbab

Please sign in to comment.