Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Linux Network Devices #1271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions config-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,98 @@ In addition to any devices configured with this setting, the runtime MUST also s
* [`/dev/ptmx`][pts.4].
A [bind-mount or symlink of the container's `/dev/pts/ptmx`][devpts].

## <a name="configLinuxNetworkDevices" />Network Devices

Linux network devices are entities that send and receive data packets.
They are not represented as files in the /dev directory, unlike block devices, network devices are represented with the [`net_device`][net_device] data structure in the Linux kernel.
Network devices have their own network namespace and a set of operations distinct from regular file operations. Examples of network devices include Ethernet cards, loopback devices, and virtual devices like bridges, VLANs, and MACVLANs.

This schema focuses solely on moving existing network devices identified by name from the host network namespace into the container network namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.

**`netDevices`** (object, OPTIONAL) set of network devices that MUST be made available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.

The runtime MUST check that is possible to move the network interface to the container namespace and MUST [generate an error](runtime.md#errors) if the check fails.

The runtime MUST set the network device state to "up" after moving it to the network namespace to allow the container to send and receive network traffic through that device.

Notice that after deleting a network namespace, all its migratable network devices are moved to the default network namespace, virtual devices (veth, macvlan, ...) are destroyed.
aojea marked this conversation as resolved.
Show resolved Hide resolved
The runtime MUST move back the network device before the network namespace is deleted.
The runtime MUST set the network device state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.
Comment on lines +207 to +208
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what happens if container's init has exited or is killed (by means other than runtime kill)?

Documentation (network_namespaces(7)) says that "When a network namespace is freed (i.e., when the last process in the namespace terminates), its physical network devices are moved back to the initial network namespace".

Currently, runc has no way of monitoring when a container exits. Meaning, it won't be able to perform those cleanups written as a MUST here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this point had a large debate, see #1271 (comment) , I initially suggested a MAY because I'm not fully familiar with all the runtimes details and diversity so I think relaxing this requirement will be simpler, but the feedback from the reviews indicated that MUST be preferred.

@kolyshkin is acceptable to rewrite this as:

Suggested change
The runtime MUST move back the network device before the network namespace is deleted.
The runtime MUST set the network device state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.
The runtime MUST move back the network device before the network namespace is deleted, unless it is not possible due to limitations in the runtime's ability to monitor the container's process lifecycle or to interact with the network namespace after the container's init process has exited. In such cases, the runtime SHOULD log an event indicating that it was unable to perform the network cleanup.
If the runtime is able to move the network device, it MUST set the network device state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.


The name of the network device is the entry key.
Copy link
Member

@AkihiroSuda AkihiroSuda Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the map order matter? If so, implementation can be complicated for Go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linux kernel guarantees the uniqueness of the name in the runtime namespace, so a set is ok. Order is not important , each network device should be independent of each other ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend a runtime performs a uniqueness check as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uniqueness inside container should be checked, e.g. that rename operation was successful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more text to clarify runtime checks and network devices lifecycle, PTAL

Entry values are objects with the following properties:

* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.
The runtime MUST revert back the original name to guarantee the idempotence of operations, so a container that moves an interface and renames it can be created and destroyed multiple times with the same result.
* **`addresses`** *(array of strings, OPTIONAL)* - the IP addresses, IPv4 and or IPv6, of the device within the container in CIDR format (IP address / Prefix). All IPv4 addresses SHOULD be expressed in their decimal format, consisting of four decimal numbers separated by periods. Each number ranges from 0 to 255 and represents an octet of the address. IPv6 addresses SHOULD be represented in their canonical form as defined in RFC 5952.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the runtime expected to set this? It looks like it is. Let us say that in the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the input to the runtime, the runtime may choose how to set them meanwhile is consistent.
The context is that from kubernetes we got bitten by this, so is a recommendation because we find very hard to enforce this as input as it may break some clients , more context in https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/

The runtime MAY limit the number of addresses allowed.
The runtime MAY revert back the original addresses, keep the existing ones or completely
remove them, since the interface MUST be in down state can not present a problem.
* **`hardwareAddress`** *(string, OPTIONAL)* - represents the hardware address (e.g. MAC Address) of the device's network interface, represented as an IEEE 802 MAC-48, EUI-48, EUI-64, or a 20-octet IP over InfiniBand link-layer address.
The runtime MAY decide to revert back the original hardware address.
* **`mtu`** *(uint32, OPTIONAL)* - the MTU (Maximum Transmission Unit) size for the device.
The runtime MAY decide to revert back the original MTU value.

### Example

#### Moving a device with a renamed interface inside the container:
aojea marked this conversation as resolved.
Show resolved Hide resolved

```json
"netDevices": {
"eth0" : {
"name": "container_eth0"
}
}
```

This configuration will move the device named "eth0" from the host into the container's network namespace. Inside the container, the device will be named "container_eth0".

#### Moving a device with a specific IP address and MTU inside the container:

IPv4 address

```json
"netDevices": {
"ens4": {
"addresses": [
"10.0.0.10/24"
],
"hardwareAddress": "32:ba:1c:b1:eb:63",
"mtu": 9000
}
}
```

IPv6 address

```json
"netDevices": {
"ens4": {
"addresses": [
"2001:db8:1:2::a/64"
],
"hardwareAddress": "32:ba:1c:b1:eb:63",
"mtu": 9000
}
}
```

Dual Stack

```json
"netDevices": {
"ens4": {
"addresses": [
"10.0.0.10/24",
"2001:db8:1:2::a/64"
],
"hardwareAddress": "32:ba:1c:b1:eb:63",
"mtu": 9000
}
}
```


## <a name="configLinuxControlGroups" />Control groups

Also known as cgroups, they are used to restrict resource usage for a container and handle device access.
Expand Down Expand Up @@ -971,6 +1063,7 @@ subset of the available options.
[devices]: https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
[devpts]: https://www.kernel.org/doc/Documentation/filesystems/devpts.txt
[file]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_164
[ifreq]: https://man7.org/linux/man-pages/man7/netdevice.7.html
[libseccomp]: https://github.com/seccomp/libseccomp
[proc]: https://www.kernel.org/doc/Documentation/filesystems/proc.txt
[seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
Expand All @@ -982,6 +1075,7 @@ subset of the available options.
[mknod.1]: https://man7.org/linux/man-pages/man1/mknod.1.html
[mknod.2]: https://man7.org/linux/man-pages/man2/mknod.2.html
[namespaces.7_2]: https://man7.org/linux/man-pages/man7/namespaces.7.html
[net_device]: https://docs.kernel.org/networking/netdevices.html
[null.4]: https://man7.org/linux/man-pages/man4/null.4.html
[personality.2]: https://man7.org/linux/man-pages/man2/personality.2.html
[pts.4]: https://man7.org/linux/man-pages/man4/pts.4.html
Expand Down
14 changes: 14 additions & 0 deletions features-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,3 +228,17 @@ Irrelevant to the availability of Intel RDT on the host operating system.
}
}
```

## <a name="linuxFeaturesNetDevices" />NetDevices

**`netDevices`** (object, OPTIONAL) represents the runtime's implementation status of Linux network devices.

* **`enabled`** (bool, OPTIONAL) represents whether the runtime supports the capability to move Linux network devices into the container's network namespace.

### Example

```json
"netDevices": {
"enabled": true
}
```
6 changes: 6 additions & 0 deletions schema/config-linux.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@
"$ref": "defs-linux.json#/definitions/Device"
}
},
"netDevices": {
"type": "object",
"additionalProperties": {
"$ref": "defs-linux.json#/definitions/NetDevice"
}
},
"uidMappings": {
"type": "array",
"items": {
Expand Down
20 changes: 20 additions & 0 deletions schema/defs-linux.json
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,26 @@
}
}
},
"NetDevice": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"addresses": {
"type": "array",
"items": {
"type": "string"
}
},
"hardwareAddress": {
"type": "string"
},
"mtu": {
"$ref": "defs.json#/definitions/uint32"
}
}
},
"weight": {
"$ref": "defs.json#/definitions/uint16"
},
Expand Down
8 changes: 8 additions & 0 deletions schema/features-linux.json
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,14 @@
}
}
}
},
"netDevices": {
"type": "object",
"properties": {
"enabled": {
"type": "boolean"
}
}
}
}
}
Expand Down
14 changes: 14 additions & 0 deletions schema/test/config/bad/linux-netdevice.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"ociVersion": "1.0.0",
"root": {
"path": "rootfs"
},
"linux": {
"netDevices": {
"eth0": {
"name": "container_eth0",
"mtu": "not_an_int"
}
}
}
}
41 changes: 41 additions & 0 deletions schema/test/config/good/linux-netdevice.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"ociVersion": "1.0.0",
"root": {
"path": "rootfs"
},
"linux": {
"netDevices": {
"eth0": {
"name": "container_eth0"
},
"ens4": {
"addresses": [
"10.0.0.10/24"
],
"hardwareAddress": "32:ba:1c:b1:eb:63",
"mtu": 9000
},
"ens5": {
"addresses": [
"2001:db8:1:2::4/64"
],
"mtu": 1500
},
"ens6": {
"addresses": [
"10.0.0.10/24",
"2001:db8:1:2::4/64"
],
"mtu": 1500
},
"ens7": {
"addresses": [
"10.0.0.10/24",
"2001:db8:1:2::4/64",
"fd00:1::af/48"
],
"mtu": 1500
}
}
}
}
3 changes: 3 additions & 0 deletions schema/test/features/good/runc.json
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,9 @@
},
"selinux": {
"enabled": true
},
"netDevices": {
"enabled": true
}
},
"annotations": {
Expand Down
14 changes: 14 additions & 0 deletions specs-go/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,8 @@ type Linux struct {
Namespaces []LinuxNamespace `json:"namespaces,omitempty"`
// Devices are a list of device nodes that are created for the container
Devices []LinuxDevice `json:"devices,omitempty"`
// NetDevices are key-value pairs, keyed by network device name on the host, moved to the container's network namespace.
NetDevices map[string]LinuxNetDevice `json:"netDevices,omitempty"`
// Seccomp specifies the seccomp security settings for the container.
Seccomp *LinuxSeccomp `json:"seccomp,omitempty"`
// RootfsPropagation is the rootfs mount propagation mode for the container.
Expand Down Expand Up @@ -491,6 +493,18 @@ type LinuxDevice struct {
GID *uint32 `json:"gid,omitempty"`
}

// LinuxNetDevice represents a single network device to be added to the container's network namespace
type LinuxNetDevice struct {
// Name of the device in the container namespace
Name string `json:"name,omitempty"`
// Addresses is the list of IP addresses, IPv4 or IPv6, in CIDR format in the container namespace
Addresses []string `json:"addresses,omitempty"`
// HardwareAddress represents the hardware address (e.g. MAC Address) of the device's network interface
HardwareAddress string `json:"hardwareAddress,omitempty"`
// MTU Maximum Transfer Unit of the network device in the container namespace
MTU uint32 `json:"mtu,omitempty"`
}

// LinuxDeviceCgroup represents a device rule for the devices specified to
// the device controller
type LinuxDeviceCgroup struct {
Expand Down
8 changes: 8 additions & 0 deletions specs-go/features/features.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ type Linux struct {
Selinux *Selinux `json:"selinux,omitempty"`
IntelRdt *IntelRdt `json:"intelRdt,omitempty"`
MountExtensions *MountExtensions `json:"mountExtensions,omitempty"`
NetDevices *NetDevices `json:"netDevices,omitempty"`
}

// Cgroup represents the "cgroup" field.
Expand Down Expand Up @@ -143,3 +144,10 @@ type IDMap struct {
// Nil value means "unknown", not "false".
Enabled *bool `json:"enabled,omitempty"`
}

// NetDevices represents the "netDevices" field.
type NetDevices struct {
// Enabled is true if network devices support is compiled in.
// Nil value means "unknown", not "false".
Enabled *bool `json:"enabled,omitempty"`
}
Loading