-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the constraint of queue data points #2335
Conversation
agent/config/config.go
Outdated
@@ -62,7 +62,7 @@ const ( | |||
|
|||
// DefaultPollingMetricsWaitDuration specifies the default value for polling metrics wait duration | |||
// This is only used when PollMetrics is set to true | |||
DefaultPollingMetricsWaitDuration = 15 * time.Second | |||
DefaultPollingMetricsWaitDuration = 9 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the tcs poll duration < docker stats duration, do we have the risk of repeating the same stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tcs poll duration? You mean the publish metrics duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, even for the value before the change, the maximum docker stats duration value that customer can set is the same as publish metrics duration. So in theory, this case will not happen. Secondly, for this case, agent will not send metrics to TACS if there is no enough dps in the queue. And once the data is sent to TACS, agent clears queue. So from my understanding, we will not repeat the same stats. Is this what you asked for?
Can we have a test case around this particular scenario? |
what exactly is the concern if there are no enough data points to report? wouldn't the metrics be batched in the next request? Since we are decreasing the time interval for polling, it may break the original purpose of reducing the number of times metrics are obtained from Docker.. |
So I got customer report the issue about cpu/memory reservation metrics not correct once the ECS_POLL_METRICS flag is enabled. I tried to root cause it and found it is caused not having enough data points. Agent currently will reset the queue after the data is sent to TACS. So from the TACS point of view, it will become sometimes the task exists in the instance however, sometimes it disappears. Also, do you know how the original value is chosen for the poll metrics? Like why we chose default value is 15s and maximum value is 20s. |
I am considering adding a functional test which will not be included here. |
4ce5c09
to
64b23c1
Compare
Summary
Currently, this is how agent works:
Thus, if ECS_POLL_METRICS is enabled and if the interval is gt/eq 10s, sometimes there will not be enough dp in the queue so that the task metrics will not be sent to TACS even the task exists on the instance. The reservation metrics is not accurate under this situation.
Implementation details
Testing
We have functional tests to verify the CW metrics for CPU/memory utilization. The pr passed all these tests. Also, I did manually test to verify that the CPU/memory reservation metrics is correct when poll metrics is enabled. Besides, I also verify the network I/O and storage I/O metrics matches the number on the instance.
New tests cover the changes:
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.