Remove the constraint of queue data points #2335

yumex93 · 2020-01-12T20:30:46Z

Summary

Currently, this is how agent works:

By default, the interval for agent to publish metrics to TACS is 20s, and current default poll docker container stats interval is 15s.
After agent polls container stats, it will send dp to a queue. We need at least two dps in the queue to calculate the data sent to TACS: https://github.com/aws/amazon-ecs-agent/blob/master/agent/stats/engine.go#L597.
Agent does not send the task metrics to TACS if there is no data related to this task.
TACS uses task metrics to decide which tasks are running on the instance and uses this to calculate reservation data.
Thus, if ECS_POLL_METRICS is enabled and if the interval is gt/eq 10s, sometimes there will not be enough dp in the queue so that the task metrics will not be sent to TACS even the task exists on the instance. The reservation metrics is not accurate under this situation.

Implementation details

Updated the constraint that the queue needs at least two dps before processing it. Considering TACS side will aggregate all dps within 1 min, changing the constraint will not affect the accuracy of CW metrics.

Testing

We have functional tests to verify the CW metrics for CPU/memory utilization. The pr passed all these tests. Also, I did manually test to verify that the CPU/memory reservation metrics is correct when poll metrics is enabled. Besides, I also verify the network I/O and storage I/O metrics matches the number on the instance.

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

petderek · 2020-01-12T21:23:54Z

agent/config/config.go

@@ -62,7 +62,7 @@ const (

 	// DefaultPollingMetricsWaitDuration specifies the default value for polling metrics wait duration
 	// This is only used when PollMetrics is set to true
-	DefaultPollingMetricsWaitDuration = 15 * time.Second
+	DefaultPollingMetricsWaitDuration = 9 * time.Second


If the tcs poll duration < docker stats duration, do we have the risk of repeating the same stats?

tcs poll duration? You mean the publish metrics duration?

Firstly, even for the value before the change, the maximum docker stats duration value that customer can set is the same as publish metrics duration. So in theory, this case will not happen. Secondly, for this case, agent will not send metrics to TACS if there is no enough dps in the queue. And once the data is sent to TACS, agent clears queue. So from my understanding, we will not repeat the same stats. Is this what you asked for?

shubham2892 · 2020-01-13T22:13:17Z

Can we have a test case around this particular scenario?

sharanyad · 2020-01-13T22:44:21Z

what exactly is the concern if there are no enough data points to report? wouldn't the metrics be batched in the next request?

Since we are decreasing the time interval for polling, it may break the original purpose of reducing the number of times metrics are obtained from Docker..

yumex93 · 2020-01-13T23:18:01Z

what exactly is the concern if there are no enough data points to report? wouldn't the metrics be batched in the next request?

Since we are decreasing the time interval for polling, it may break the original purpose of reducing the number of times metrics are obtained from Docker..

So I got customer report the issue about cpu/memory reservation metrics not correct once the ECS_POLL_METRICS flag is enabled. I tried to root cause it and found it is caused not having enough data points.

Agent currently will reset the queue after the data is sent to TACS. So from the TACS point of view, it will become sometimes the task exists in the instance however, sometimes it disappears.

Also, do you know how the original value is chosen for the poll metrics? Like why we chose default value is 15s and maximum value is 20s.

yumex93 · 2020-01-13T23:27:50Z

Can we have a test case around this particular scenario?

I am considering adding a functional test which will not be included here.

sharanyad · 2020-01-13T23:37:50Z

#1646 (comment)

#1475

petderek reviewed Jan 12, 2020

View reviewed changes

yumex93 requested a review from a team January 13, 2020 17:56

yumex93 force-pushed the dev branch 2 times, most recently from 4ce5c09 to 64b23c1 Compare February 3, 2020 06:44

yumex93 changed the title ~~Update polling metrics wait duration~~ Remove the constraint of queue data points Feb 11, 2020

Remove the check against container stats queue length

7fc9132

yumex93 force-pushed the dev branch from 64b23c1 to 7fc9132 Compare February 13, 2020 07:30

yumex93 added the bot/test label Feb 13, 2020

amazon-ecs-bot removed the bot/test label Feb 13, 2020

yumex93 closed this Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the constraint of queue data points #2335

Remove the constraint of queue data points #2335

yumex93 commented Jan 12, 2020 •

edited

Loading

petderek Jan 12, 2020

yumex93 Jan 13, 2020

petderek Jan 13, 2020

yumex93 Jan 13, 2020

shubham2892 commented Jan 13, 2020

sharanyad commented Jan 13, 2020

yumex93 commented Jan 13, 2020 •

edited

Loading

yumex93 commented Jan 13, 2020 •

edited

Loading

sharanyad commented Jan 13, 2020

Remove the constraint of queue data points #2335

Remove the constraint of queue data points #2335

Conversation

yumex93 commented Jan 12, 2020 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

petderek Jan 12, 2020

Choose a reason for hiding this comment

yumex93 Jan 13, 2020

Choose a reason for hiding this comment

petderek Jan 13, 2020

Choose a reason for hiding this comment

yumex93 Jan 13, 2020

Choose a reason for hiding this comment

shubham2892 commented Jan 13, 2020

sharanyad commented Jan 13, 2020

yumex93 commented Jan 13, 2020 • edited Loading

yumex93 commented Jan 13, 2020 • edited Loading

sharanyad commented Jan 13, 2020

yumex93 commented Jan 12, 2020 •

edited

Loading

yumex93 commented Jan 13, 2020 •

edited

Loading

yumex93 commented Jan 13, 2020 •

edited

Loading