Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Linux Network Devices #1271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aojea
Copy link

@aojea aojea commented Nov 7, 2024

The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace.

This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead.

The proposed scheme is based on the existing representation of network device by the struct net_device
https://docs.kernel.org/networking/netdevices.html.

This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration.

Fixes: #1239

@aojea
Copy link
Author

aojea commented Nov 7, 2024

/assign @samuelkarp

config-linux.md Outdated Show resolved Hide resolved
config-linux.md Outdated Show resolved Hide resolved
config-linux.md Outdated Show resolved Hide resolved

**`netdevices`** (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime MAY supply them however it likes.

The name of the network device is the entry key.
Copy link
Member

@AkihiroSuda AkihiroSuda Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the map order matter? If so, implementation can be complicated for Go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linux kernel guarantees the uniqueness of the name in the runtime namespace, so a set is ok. Order is not important , each network device should be independent of each other ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend a runtime performs a uniqueness check as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uniqueness inside container should be checked, e.g. that rename operation was successful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more text to clarify runtime checks and network devices lifecycle, PTAL

@AkihiroSuda
Copy link
Member

@aojea aojea force-pushed the network-devices branch 2 times, most recently from 51e5104 to 3a666eb Compare November 12, 2024 12:26
@aojea
Copy link
Author

aojea commented Nov 12, 2024

https://github.com/opencontainers/runtime-spec/blob/main/features.md should be updated too

updated and addressed the comments


**`netdevices`** (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime MAY supply them however it likes.

The name of the network device is the entry key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend a runtime performs a uniqueness check as well?

schema/config-linux.json Outdated Show resolved Hide resolved
schema/defs-linux.json Outdated Show resolved Hide resolved
schema/test/config/bad/linux-netdevice.json Outdated Show resolved Hide resolved
schema/test/config/good/linux-netdevice.json Outdated Show resolved Hide resolved
@aojea
Copy link
Author

aojea commented Nov 12, 2024

AI @aojea (document the cleanup and destroy of the network interfaces)

config-linux.md Outdated Show resolved Hide resolved
@samuelkarp
Copy link
Member

From the in-person discussion today:

  • Net device lifecycle should follow the network namespace lifecycle
  • @aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted
  • @kad was concerned about restarts and error handling
  • Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

config-linux.md Outdated

This schema focuses solely on moving existing network devices identified by name into the container namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.

**`netDevices`** (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec said "MUST" but, I think it can't do it in the rootless container because the rootless container doesn't have CAP_NET_ADMIN, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should take care of the rootless container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

+1 but It'd be better to clarify it in the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added mor explanations about runtime and network devices lifecycle and runtime checks, PTAL

@aojea
Copy link
Author

aojea commented Nov 19, 2024

om the in-person discussion today:

  • Net device lifecycle should follow the network namespace lifecycle
  • @aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted
  • @kad was concerned about restarts and error handling
  • Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

Pushed a new commit addressing those comments, the changelog is

  • the network namespace lifecycle will move migratebale network devices and destroy virtual devides, the runtime MAY decide to do cleanup actions
  • runtime MUST check the container has enough privileges and an associated network namespace and fail if the check fail
  • removed the Mask field and use the Address field with CIDR notation (IP/Prefix) to deal with IPv4 and IPv6 addresses. Only one IP is allowed to be specified on purpose to simplify the operations and reduce risks
  • Add a HardwareAddress field for use cases that require to set a
    specific mac or infiniband address

config-linux.md Outdated Show resolved Hide resolved
config-linux.md Outdated Show resolved Hide resolved
@tianon
Copy link
Member

tianon commented Dec 9, 2024

Can you please elaborate a bit more on why this is something that runc and friends should be responsible for, when historically they've not officially touched networking?

What are the pros and cons? What are the risks (especially to some of the more esoteric runtimes, like those implemented via VMs)?

When should bundle authors use this new field instead of what they're doing currently (and how do they accurately determine if what they're already doing is something that can be converted to this new field losslessly)?

@tianon
Copy link
Member

tianon commented Dec 9, 2024

Oops, some of what I'm after is in #1239, my apologies.

That being said, I think the use cases for this are still pretty thin since it's just moves of pre-existing interfaces, and most container networking operations will be about new interfaces (veth at the very least). It feels inconsistent to include that bit of networking here when we don't have any other parts (nor plans to include them, because they're understandably complex essentially right away) -- for a higher-order runtime like Docker to take advantage of this, it'd have to split "network creation" processing code into separate paths based on what type of networking is requested (such as pre-creating veth devices before passing a bundle along so they can be moved by runc, which I think has some big rough edges too).

@aojea
Copy link
Author

aojea commented Dec 9, 2024

That being said, I think the use cases for this are still pretty thin since it's just moves of pre-existing interfaces, and most container networking operations will be about new interfaces (veth at the very least).

From the Kubernetes side, we have a considerable number of workloads and demand, specially those related to AI/ML and Telco , that require ONLY the addition of existing network interfaces on the node to be moved to the containers ... existing implementation based on CNI need to rely on out-of-band mechanisms to compensate this problem with a lot of glue and red tape that makes these operations very brittle ... this problem can be easily solved with a declaratively way to indicate "please, move mlx0 to the container" ... that is this proposal

for a higher-order runtime like Docker to take advantage of this, it'd have to split "network creation" processing code into separate paths based on what type of networking is requested (such as pre-creating veth devices before passing a bundle along so they can be moved by runc, which I think has some big rough edges too).

This proposal is about "network interface" as device interface not "networks" as "docker networks" or CNI, I've tried to make this distinction clear in several places to avoid conflating both problems that are completely different.

For a high level runtime these should just be docker run -t -i --net-device=ml0 ubuntu bash.

config-linux.md Outdated
* **`addresses`** *(array of strings, OPTIONAL)* - the IP addresses, IPv4 and or IPv6, of the device within the container in CIDR format (IP address / Prefix). All IPv4 addresses SHOULD be expressed in their decimal format, consisting of four decimal numbers separated by periods. Each number ranges from 0 to 255 and represents an octet of the address. IPv6 addresses SHOULD be represented in their canonical form as defined in RFC 5952.
The runtime MAY limit the number of addresses allowed.
The runtime MAY decide to revert back the original addreses.
* **`hardwareAddress`** *(string, OPTIONAL)* - represents the hardware address (e.g. MAC Address) of the device's network interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the string representation of the MAC address should be clarified

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, clarified to mach the golang ParseMac function https://pkg.go.dev/net#ParseMAC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds too specific to Go implementation.

Probably just need to support the single standard form (HEX:HEX:HEX:HEX:HEX:HEX)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just need to support the single standard form (HEX:HEX:HEX:HEX:HEX:HEX)

that is the IEEE 802 MAC-48

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think golang implementation just got all the common formats https://networkengineering.stackexchange.com/a/82800

@AkihiroSuda AkihiroSuda added this to the v1.3.0 (tentative) milestone Dec 10, 2024
config-linux.md Show resolved Hide resolved
config-linux.md Outdated
The runtime MUST set the network device state to "up" after moving it to the network namespace to allow the container to send and receive network traffic through that device.

Notice that after deleting a network namespace, all its migratable network devices are moved to the default network namespace, virtual devices (veth, macvlan, ...) are destroyed.
The runtime MAY decide to move back or destroy the network device before the network namespace is deleted. If the network device is moved back, the runtime MUST set its state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much the same comment as above ^, why MAY decide to move back and not MUST?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are the experts in the runtime worlds so I will follow your advice, if you prefer and agree it should be MUST I don't have more argument than I'm suggesting MAY because I always prefer to be conservative when defining APIs and spec, I'm worried if some runtimes may not be able to participate in deletion or it can be very complex for them to do it, like I imaging can be the case for kata and gvisor. Being flexible allows them to implement the spec and be fully complaint instead of having to document it as limitation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

Being flexible allows them to implement the spec and be fully complaint instead of having to document it as limitation.

I think either way they'd need to specify whether they support moving back ifaces into the original netns.

I'm not sure whether MUST is better than MAY here or not. If no else has anything to say, let's keep it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think describing the current flow for intended use cases will help inform what we should do here. Are the interfaces that are moved into the containers reused or destroyed or both are valid cases?
We could have a cleanup policy field for each device if we need to support both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put it that way, I think that should be a MUST, since a virtual interface has to be created out of this flow, does not make sense to add the responsability to destroy it to the runtime ... it also makes more clear the spec, with @mrunalp questions I recognize there is a lot of ambiguity, and despite it can be more amenable for runtime implementers it sounds like it be more confusing for users.

I will push a change to make this a MUST

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp @akerouanton added a last commit with MUST return back the interface in down state 47c4d3c , so it removes ambiguity and makes simpler to build automation over it,

  1. node has interface A or user creates interface A
  2. user tells runtime "run with --net-device A"
  3. user tells runtime "stop container"
  4. node has interface A in down state

then is up to the user to do what it wants with interface A, if already exist it can not do much, if the user created the interface then is on him to delete or reuse it

config-linux.md Outdated Show resolved Hide resolved
@utam0k
Copy link
Member

utam0k commented Dec 12, 2024

@aojea Thanks for your hard work! It looks good to me. Also, I agree that it includes 1.3.0.

config-linux.md Outdated Show resolved Hide resolved
config-linux.md Outdated Show resolved Hide resolved
config-linux.md Outdated Show resolved Hide resolved
config-linux.md Outdated Show resolved Hide resolved

* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.
The runtime MUST revert back the original name to guarantee the idempotence of operations, so a container that moves an interfaces and renames it can be created and destroyed multiple times with the same result.
* **`addresses`** *(array of strings, OPTIONAL)* - the IP addresses, IPv4 and or IPv6, of the device within the container in CIDR format (IP address / Prefix). All IPv4 addresses SHOULD be expressed in their decimal format, consisting of four decimal numbers separated by periods. Each number ranges from 0 to 255 and represents an octet of the address. IPv6 addresses SHOULD be represented in their canonical form as defined in RFC 5952.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the runtime expected to set this? It looks like it is. Let us say that in the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the input to the runtime, the runtime may choose how to set them meanwhile is consistent.
The context is that from kubernetes we got bitten by this, so is a recommendation because we find very hard to enforce this as input as it may break some clients , more context in https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/

specs-go/config.go Outdated Show resolved Hide resolved
@aojea
Copy link
Author

aojea commented Dec 12, 2024

Changelog since last review, only the part that was not clear, to decide if runtime MUST or MAY bring back the interface to the host namespace, got updated to make it MUST , since is less ambiguous and covers all use cases, more on #1271 (comment)

@AkihiroSuda
Copy link
Member

Please clean up the commits

The proposed "netdevices" field provides a declarative way to
specify which host network devices should be moved into a container's
network namespace.

This approach is similar than the existing "devices" field used for block
devices but uses a dictionary keyed by the interface name instead.

The proposed scheme is based on the existing representation of network
device by the `struct net_device`
https://docs.kernel.org/networking/netdevices.html.

This proposal focuses solely on moving existing network devices into
the container namespace. It does not cover the complexities of
network configuration or network interface creation, emphasizing the
separation of device management and network configuration.

Signed-off-by: Antonio Ojea <aojea@google.com>
@aojea
Copy link
Author

aojea commented Jan 7, 2025

Please clean up the commits

squashed

Comment on lines +207 to +208
The runtime MUST move back the network device before the network namespace is deleted.
The runtime MUST set the network device state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what happens if container's init has exited or is killed (by means other than runtime kill)?

Documentation (network_namespaces(7)) says that "When a network namespace is freed (i.e., when the last process in the namespace terminates), its physical network devices are moved back to the initial network namespace".

Currently, runc has no way of monitoring when a container exits. Meaning, it won't be able to perform those cleanups written as a MUST here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this point had a large debate, see #1271 (comment) , I initially suggested a MAY because I'm not fully familiar with all the runtimes details and diversity so I think relaxing this requirement will be simpler, but the feedback from the reviews indicated that MUST be preferred.

@kolyshkin is acceptable to rewrite this as:

Suggested change
The runtime MUST move back the network device before the network namespace is deleted.
The runtime MUST set the network device state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.
The runtime MUST move back the network device before the network namespace is deleted, unless it is not possible due to limitations in the runtime's ability to monitor the container's process lifecycle or to interact with the network namespace after the container's init process has exited. In such cases, the runtime SHOULD log an event indicating that it was unable to perform the network cleanup.
If the runtime is able to move the network device, it MUST set the network device state to "down" before moving it back to ensure that the interface is no longer active and won't interfere with other network operations or cause IP address conflicts.

@kolyshkin
Copy link
Contributor

This is somewhat strange that runtime is expected to set some device properties like

  • device name;
  • hardware (e.g. MAC) address;
  • IP address and network mask;
  • MTU;

but not any other properties. Take a look at e.g. ip-link(8) man page which lists some low-level network device properties.

The above man page does not even include the higher-level properties such as IP addresses (managed by ip-address(8) -- again, in this department there's more than just IPv4/v6 address and netmask).

It seems to me that the runtime scope should be limited to

  • moving the network device into a container's net namespace;
  • renaming it (BTW Linux allows a single network device to have multiple names for the last 10 or 15 years, so name in spec should be an array);
  • (maybe) switching it to down state.

As for the rest of the properties, this should be performed by software running inside the container. We have tools such as ip for that, and they should be used, or we may end up having to implement the entirety of ip in runtime implementations.

As for cleanup, see my earlier comment here. Similarly, it's up to a container to perform necessary cleanup for its network devices before it stops itself.

Now, if you absolutely have to have runtime configuring a network device, use a CreateContainer hook from which you can call e.g. ip. But better, make in-container software take care of that.

@aojea
Copy link
Author

aojea commented Jan 16, 2025

@kolyshkin your comments are spot on and were the ones more discussed during the proposal, I replied to the cleanup in #1271 (comment) , let me address the network configuration here:

As for the rest of the properties, this should be performed by software running inside the container. We have tools such as ip for that, and they should be used, or we may end up having to implement the entirety of ip in runtime implementations.

I very much agree with you that network configuration should be handled at the process level within the container, I make reference to that in several places in the proposal #1239 . However, I believe there's a valid case for what we can call "network initialization" to be handled by the runtime.

The key distinction here is between initialization and configuration. My proposal focuses on providing the bare minimum initialization required for basic network functionality within the container. This includes essential properties like the device name, MAC address, IP address, netmask, and MTU. These are the fundamental attributes that are necessary for the interface to be operational at all.

The benefits I see with this initial setup in the runtime is that it reduces the complexity of integration and avoids potential race conditions where the application might try to use the network before it's properly initialized. It also allow to reduce container size and complexity if we don't need to include scripts or logic to perform this basic network setup every time.

I like to think of this as: the runtime provides a basic "network plumbing" framework, while the application running inside the container is free to customize and configure the network according to its specific needs using tools like ip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Network Devices