Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the constraint of queue data points #2335

Closed
wants to merge 1 commit into from
Closed

Conversation

yumex93
Copy link
Contributor

@yumex93 yumex93 commented Jan 12, 2020

Summary

Currently, this is how agent works:

  1. By default, the interval for agent to publish metrics to TACS is 20s, and current default poll docker container stats interval is 15s.
  2. After agent polls container stats, it will send dp to a queue. We need at least two dps in the queue to calculate the data sent to TACS: https://github.com/aws/amazon-ecs-agent/blob/master/agent/stats/engine.go#L597.
  3. Agent does not send the task metrics to TACS if there is no data related to this task.
  4. TACS uses task metrics to decide which tasks are running on the instance and uses this to calculate reservation data.
    Thus, if ECS_POLL_METRICS is enabled and if the interval is gt/eq 10s, sometimes there will not be enough dp in the queue so that the task metrics will not be sent to TACS even the task exists on the instance. The reservation metrics is not accurate under this situation.

Implementation details

  1. Updated the constraint that the queue needs at least two dps before processing it. Considering TACS side will aggregate all dps within 1 min, changing the constraint will not affect the accuracy of CW metrics.

Testing

We have functional tests to verify the CW metrics for CPU/memory utilization. The pr passed all these tests. Also, I did manually test to verify that the CPU/memory reservation metrics is correct when poll metrics is enabled. Besides, I also verify the network I/O and storage I/O metrics matches the number on the instance.

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -62,7 +62,7 @@ const (

// DefaultPollingMetricsWaitDuration specifies the default value for polling metrics wait duration
// This is only used when PollMetrics is set to true
DefaultPollingMetricsWaitDuration = 15 * time.Second
DefaultPollingMetricsWaitDuration = 9 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tcs poll duration < docker stats duration, do we have the risk of repeating the same stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tcs poll duration? You mean the publish metrics duration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, even for the value before the change, the maximum docker stats duration value that customer can set is the same as publish metrics duration. So in theory, this case will not happen. Secondly, for this case, agent will not send metrics to TACS if there is no enough dps in the queue. And once the data is sent to TACS, agent clears queue. So from my understanding, we will not repeat the same stats. Is this what you asked for?

@yumex93 yumex93 requested a review from a team January 13, 2020 17:56
@shubham2892
Copy link
Contributor

Can we have a test case around this particular scenario?

@sharanyad
Copy link
Contributor

what exactly is the concern if there are no enough data points to report? wouldn't the metrics be batched in the next request?

Since we are decreasing the time interval for polling, it may break the original purpose of reducing the number of times metrics are obtained from Docker..

@yumex93
Copy link
Contributor Author

yumex93 commented Jan 13, 2020

what exactly is the concern if there are no enough data points to report? wouldn't the metrics be batched in the next request?

Since we are decreasing the time interval for polling, it may break the original purpose of reducing the number of times metrics are obtained from Docker..

So I got customer report the issue about cpu/memory reservation metrics not correct once the ECS_POLL_METRICS flag is enabled. I tried to root cause it and found it is caused not having enough data points.

Agent currently will reset the queue after the data is sent to TACS. So from the TACS point of view, it will become sometimes the task exists in the instance however, sometimes it disappears.

Also, do you know how the original value is chosen for the poll metrics? Like why we chose default value is 15s and maximum value is 20s.

@yumex93
Copy link
Contributor Author

yumex93 commented Jan 13, 2020

Can we have a test case around this particular scenario?

I am considering adding a functional test which will not be included here.

@sharanyad
Copy link
Contributor

#1646 (comment)

#1475

@yumex93 yumex93 force-pushed the dev branch 2 times, most recently from 4ce5c09 to 64b23c1 Compare February 3, 2020 06:44
@yumex93 yumex93 changed the title Update polling metrics wait duration Remove the constraint of queue data points Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants