Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Linux cgroup v2 #3117

Closed
Phylu opened this issue Jan 14, 2022 · 2 comments
Closed

Support Linux cgroup v2 #3117

Phylu opened this issue Jan 14, 2022 · 2 comments
Labels
kind/enhancement kind/proposal kind/tracking This issue is being tracked internally

Comments

@Phylu
Copy link

Phylu commented Jan 14, 2022

Summary

Support Linux host systems that use cgroup v2.

Description

cgroup v2 has been been added to the Linux kernel in 2014 as described on the Linux Kernel mailinglist [1] and is adopted by Linux distributions since 2019 [2]. The ECS Agent does not work on Linux destributions that make use of cgroup v2; It only works with cgroup v1.

This has been mentioned on the AWS Containers Roadmap [3] and is cause for issues when using Flatcar Linux as host system in ECS clusters [4][5]. The only workaround in this case is to use an older version of Flatcar which has known security vulnerabilities as described in [6].

Expected Behavior

The ECS Agent works on Linux distributions with cgroup v2.

Observed Behavior

The ECS Agent only works on Linux distributions with cgroup v1.

Environment Details

This issue e.g. occurs when using Flatcar Linux >= v2983.2.0 [7]

Additional Links

[1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
[2] https://medium.com/nttlabs/cgroup-v2-596d035be4d7
[3] aws/containers-roadmap#1535
[4] flatcar/Flatcar#585
[5] https://www.flatcar.org/docs/latest/installing/cloud/aws-ec2/#known-issues
[6] aws/containers-roadmap#1535 (comment)
[7] https://www.flatcar.org/releases/#release-2983.2.0

@yinyic
Copy link
Contributor

yinyic commented Jan 21, 2022

Hi, thank you for opening the issue.

Looking into the cloudwatch metrics error described in your linked container roadmap issue[1] -

cloudwatch metrics for container XXX not collected, reason (cpu): need at least 2 data points in queue to calculate CW stats set" module=engine.go

We suspect the root cause to be with docker engine API response change.

Agent depends on docker for creating container cgroups, and ECS Agent uses docker ContainerStats API for monitoring and collecting container statistics[2]. The API response model is different if docker uses cgroup v2 instead of v1[3]. Specifically, cpu_stats: cpu_usage.percpu_usage is not set. In that case, Agent failed to validate ContainerStats response[4].

To fix this particular issue, Agent needs to be able to handle the docker egnine ContainerStats API response model for both cgroup v1 and v2. Agent itself does not technically need to support cgroup v2 in order to publish container metrics. However, if Agent does not support cgroup v2, it won’t be able to enforce task-level cgroup resource limits, or pass them down to the container cgroups. If we deliver the fix for handling docker stats API change without Agent cgroup v2 support, customer will have visibility but no full control of task resource consumption, which is a rather incomplete experience. Therefore, the stats API fix will likely be part of the greater effort of supporting cgroup v2 with Agent.

We will update the thread, once we have a concrete plan for providing the support.

Another thing we would like to call out, regarding the possible workarounds -
You mentioned that

The only workaround in this case is to use an older version of Flatcar

It should be possible to still use the latest flatcar release, but with cgroups v2 disabled. [5]

[1] aws/containers-roadmap#1535
[2] code link
[3] https://docs.docker.com/engine/api/v1.41/#operation/ContainerStats
[4] code link
[5] https://docs.docker.com/config/containers/runmetrics/#changing-cgroup-version

@yinyic yinyic added kind/enhancement kind/proposal kind/tracking This issue is being tracked internally labels Jan 21, 2022
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Feb 15, 2022
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Feb 15, 2022
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Feb 15, 2022
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Feb 15, 2022
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Feb 16, 2022
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Feb 16, 2022
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
sparrc added a commit that referenced this issue Mar 2, 2022
* Support Unified Cgroups (cgroups v2)

closes aws/containers-roadmap#1535
closes #3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.

* wip

* update cgroups library with nil panic bugfix

* Initialize and toggle cgroup controllers
@karthikeyanvenkatraman
Copy link

Can somebody help us know as to when this issue would be remediated.

sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Mar 30, 2022
* Support Unified Cgroups (cgroups v2)

closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.

* wip

* update cgroups library with nil panic bugfix

* Initialize and toggle cgroup controllers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement kind/proposal kind/tracking This issue is being tracked internally
Projects
None yet
Development

No branches or pull requests

3 participants