-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ability to poll for container stats instead of constant stream #1646
Added ability to poll for container stats instead of constant stream #1646
Conversation
…ub.com/inmar/amazon-ecs-agent into inmar-add-polling-option-for-getting-metrics
07d1d5d
to
3a831a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes lgtm, thanks @jcbowman
@yunhee-l: does this need an integ test to test the polling behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
@@ -104,6 +104,8 @@ additional details on each available environment variable. | |||
| `ECS_UPDATES_ENABLED` | <true | false> | Whether to exit for an updater to apply updates when requested. | false | false | | |||
| `ECS_UPDATE_DOWNLOAD_DIR` | /cache | Where to place update tarballs within the container. | | | | |||
| `ECS_DISABLE_METRICS` | <true | false> | Whether to disable metrics gathering for tasks. | false | true | | |||
| `ECS_POLL_METRICS` | <true | false> | Whether to poll or stream when gathering metrics for tasks. | false | false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an Agent-specific decision to be able to turn off metrics completely? (How does this affect TACS if two Agents in a cluster poll at different frequencies?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that way, if this is turned off we fall back to the streaming
} else { | ||
seelog.Infof("DockerGoClient: Starting to Poll for metrics") | ||
//Sleep for a random period time up to the polling interval. This will help make containers ask for stats at different times | ||
time.Sleep(time.Second * time.Duration(rand.Intn(int(dg.config.PollingMetricsWaitDuration.Seconds())))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good intuition but I don't think this is needed. The only time all containers of a Task would poll at the same time is if Agent stops and starts while the Task is running. Otherwise, assuming go routines run at approximately the same speed (or even randomly), containers are added to the stats engine iteratively so they will already be staggered to some degree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is value in adding extra jitter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add unit tests
agent/config/config.go
Outdated
cfg.platformOverrides() | ||
|
||
return nil | ||
} | ||
|
||
func (cfg *Config) pollMetricsOverrides() { | ||
if cfg.PollMetrics { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -81,6 +85,14 @@ const ( | |||
// 'stuck' in the pull / unpack step. Very small values are unsafe and lead to high failure rate. | |||
minimumImagePullInactivityTimeout = 1 * time.Minute | |||
|
|||
// minimumPollingMetricsWaitDuration specifies the minimum duration to wait before polling for new stats | |||
// from docker. This is only used when PollMetrics is set to true | |||
minimumPollingMetricsWaitDuration = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, how did we come up with the min and max values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
customer did, there are some testing/benchmarking info in the original PR
CHANGELOG.md
Outdated
@@ -1,4 +1,7 @@ | |||
# Changelog | |||
## 1.22.1-dev | |||
* Enhancement - Configurable poll duration for container stats [@jcbowman](https://github.com/jcbowman) [#1646](https://github.com/aws/amazon-ecs-agent/pull/1646) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -184,6 +189,12 @@ func TestInvalidLoggingDriver(t *testing.T) { | |||
assert.Error(t, conf.validateAndOverrideBounds(), "Should be error with invalid-logging-driver") | |||
} | |||
|
|||
func TestDefaultPollMetricsWithoutECSDataDir(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does WithoutECSDataDir
indicate? without config..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what it means (see TestDefaultCheckpointWithoutECSDataDir below), if we do not have config file it should take default value.
} | ||
}() | ||
} else { | ||
seelog.Infof("DockerGoClient: Starting to Poll for metrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also log the container id here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
seelog.Infof("DockerGoClient: Unable to retrieve stats for container %s: %v", | ||
id, statsErr) | ||
if !dg.config.PollMetrics { | ||
seelog.Infof("DockerGoClient: Starting to Stream for metrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seelog.Info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added container ID to the logging line instead
} else { | ||
seelog.Infof("DockerGoClient: Starting to Poll for metrics") | ||
//Sleep for a random period time up to the polling interval. This will help make containers ask for stats at different times | ||
time.Sleep(time.Second * time.Duration(rand.Intn(int(dg.config.PollingMetricsWaitDuration.Seconds())))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean this could end up sleeping for the whole wait duration, skipping a cycle of poll metrics? not sure if this could lead to any unexpected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point the polling cycle has not started, this is adding jitter before starting it.
changelog updated
95e0023
to
9902dd9
Compare
BLOCKED: Holding off on adding unit tests, as sdk migration will end up in more conflicts. I'll update this PR again after migration branch is merged. |
a7ffa9b
to
5d2e394
Compare
increasing fnl test timeout by 2 mins
973d1fc
to
d079426
Compare
Integ and functional tests have been added. |
}, | ||
} | ||
params.StartTime = aws.Time(RoundTimeUp(time.Now(), time.Minute).UTC()) | ||
params.EndTime = aws.Time((*params.StartTime).Add(waitMetricsInCloudwatchDuration).UTC()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think the params here are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I shortened this test's checks and didn't remove the set up part. Removed.
func TestStatsEngineWithNewContainersWithPolling(t *testing.T) { | ||
// additional config fields to use polling instead of stream | ||
cfg.PollMetrics = true | ||
cfg.PollingMetricsWaitDuration = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cfg object is shared by many tests in this package.. we probably should set the value back if we are changing it for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clean up how cfg is shared by all integ tests, but it's a bigger refactoring job. I'll work on it separately, but for this one I reset it to false/default duration at the end of the test.
Is this pr for merging dev to inmar-add-polling-option-for-getting-metrics or the other way? |
merging into dev |
Summary
Push blocked until next agent release
Allows for container stat polling instead of forcing constant stream. This allows for a significant decrease in CPU usage when a lot of containers are deployed. Duplicate/fix of #1475
addresses issue #1422
Implementation details
merging PR #1475 and fixing conflicts, and errors in unit tests.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes: yes
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.