Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for userspace device drivers with HW offload mode #322

Merged
merged 2 commits into from
Aug 15, 2024

Conversation

kim-tae-kyung
Copy link
Contributor

Add support for SR-IOV Virtual Functions (VFs) using both hardware offload (switchdev) and userspace device drivers such as vfio-pci.
If the MAC address is provided via arguments, update the VF’s MAC address accordingly using netlink.

What this PR does / why we need it:

This feature is essential for scenarios where SR-IOV VFs require both userspace device drivers and hardware offload.
For instance, containerized network functions with Intel E810 may enable eswitch switchdev mode for hardware offload while using the vfio-pci driver for running DPDK or VPP.
Another use case is the KubeVirt SR-IOV Interface.
After merging this PR, users can create a KubeVirt VM supported by OVS hardware offload without modifying the KubeVirt source code.

Special notes for your reviewer:

Even in scenarios where userspace device drivers are employed, and there is no netdevice for the VF on the host, the CNI should still return the name and sandbox for the VF interface to be recognized by Multus.

Release note:

- Support HW offload with userspace device driver
- Support MAC address update in HW offload mode

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jul 27, 2024
@kubevirt-bot
Copy link
Collaborator

Hi @kim-tae-kyung. Thanks for your PR.

I'm waiting for a k8snetworkplumbingwg member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubevirt-bot kubevirt-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 27, 2024
@phoracek
Copy link
Member

@kim-tae-kyung thanks for the PR.

@ykulazhenkov any chance you would be interested in reviewing this? Sadly, I'm not all that familiar with the SR-IOV portion of the project, and I don't feel confident reviewing this.

Copy link
Contributor

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the PR @kim-tae-kyung.

I added few comments/questions

// set VF MAC address if provided, otherwise get it from the VF
// use the PF netlink, since there is no netdevice for the VF
if mac != "" {
if err := netlink.LinkSetVfHardwareAddr(pfLink, vfIdx, hwaddr); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sriov-cni hardware address for VFs is configured for userspace drivers also. I think we need to do the same.

https://github.com/k8snetworkplumbingwg/sriov-cni/blob/176bd65bbe9129203f51dc686e2f13a179dc3da5/cmd/sriov/main.go#L106

https://github.com/k8snetworkplumbingwg/sriov-cni/blob/176bd65bbe9129203f51dc686e2f13a179dc3da5/pkg/sriov/sriov.go#L258

For userspace - configure mac with LinkSetVfHardwareAddr
For kernel net - configure mac with LinkSetVfHardwareAddr + LinkSetHardwareAddr (for VF net dev)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. For kernel net, execute ip link set $link address $hwaddr (LinkSetHardwareAddr) as well as ip link set $link vf $vf mac $hwaddr (LinkSetVfHardwareAddr)

ovs-cni/pkg/sriov/sriov.go

Lines 200 to 206 in c65a4e6

// if MAC address is provided, set it to the VF by using PF netlink
// which is accessible in the host namespace, not in the container namespace
if hwaddr != nil {
if err := netlink.LinkSetVfHardwareAddr(pfLink, vfIdx, hwaddr); err != nil {
return err
}
}

_, err = renameLink(vfNetdevice, contIface.Name)
// in userspace driver mode, configure the smart VF netdevice via PF netlink
// otherwise, configure the smart VF netdevice directly in the container namespace
if !dpdkMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to create a separate functions for dpdk and non-dpdk scenarios and call these functions from the SetupSriovInterface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I create two functions, setupKernelSriovContIface and setupUserspaceSriovContIface.
Thanks to you, the code is now cleaner.

ovs-cni/pkg/sriov/sriov.go

Lines 292 to 302 in 074e965

if !userspaceMode {
// configure the smart VF netdevice directly in the container namespace
if err = setupKernelSriovContIface(contNetns, contIface, deviceID, ifName, mtu); err != nil {
return nil, nil, err
}
} else {
// configure the smart VF netdevice via PF netlink
if err = setupUserspaceSriovContIface(contNetns, contIface, pfLink, vfIdx, ifName); err != nil {
return nil, nil, err
}
}

@@ -672,6 +700,11 @@ func CmdCheck(args *skel.CmdArgs) error {
return err
}

// TODO: CmdCheck for userspace driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to implement this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not in this PR. I will make a separate commit for CmdCheck later.

}
}
// remove the VF representor port from the bridge
repIfName, err := sriov.GetNetRepresentor(cache.Netconf.DeviceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Your review is correct. My code was a duplicate implementation, so I have removed that part.

if sriov.IsOvsHardwareOffloadEnabled(cache.Netconf.DeviceID) {
// there is no network interface in case of userspace driver, so OrigIfName is empty
if !cache.UserspaceMode {
err = sriov.ReleaseVF(args, cache.OrigIfName)
if err != nil {
// try to reset vf into original state as much as possible in case of error
if err := sriov.ResetVF(args, cache.Netconf.DeviceID, cache.OrigIfName); err != nil {
log.Printf("Failed best-effort cleanup of VF %s: %v", cache.OrigIfName, err)
}
}
}
} else {

@@ -40,6 +40,7 @@ type NetConf struct {
SocketFile string `json:"socket_file"`
LinkStateCheckRetries int `json:"link_state_check_retries"`
LinkStateCheckInterval int `json:"link_state_check_interval"`
DPDKMode bool `json:"-,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what you trying to achieve with `json:"-,omitempty"` syntax. If you want to hide the field from the API, then the correct syntax is just `json:"-"`. Currently DPDKMode renders as "-" in the cached json.

I see two options here:

  1. use meaningful json filed name for DPDKMode in NetConf struct
  2. move DPDKMode field to the CachedNetConf struct

The first option will pollute the user API. My personal preference is to use opt 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also prefer the second option.
Additionally, I changed DPDKMode to UserspaceMode, as the VF can be used not only for containerized DPDK but also for containerized QEMU, such as in KubeVirt.

// CachedNetConf containing NetConfig, original smartnic vf interface name
// and kernel/userspace device driver mode of the smartnic vf interface
// (the last two are set only in case of ovs hareware offload scenario).
// this is intended to be used only for storing and retrieving config
// to/from a data store (example file cache).
type CachedNetConf struct {
Netconf *NetConf
OrigIfName string
UserspaceMode bool
}

@kim-tae-kyung
Copy link
Contributor Author

@ykulazhenkov Thanks for your review!
I have reflected your comments. You made me realize that there were missing features and duplicated codes.

Copy link
Contributor

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kim-tae-kyung thanks for addressing my comments.

I added one important comment and one nit.

if err != nil {
return err
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you remove the else branch the code will have the same behavior.
If you will decide to drop the else branch, you can also change line 295 to userspaceMode := false to be more explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I agree. The else branch was wired.

// check if the device driver is the type of userspace driver
userspaceMode := false
if sriov.IsOvsHardwareOffloadEnabled(netconf.DeviceID) {
userspaceMode, err = sriov.HasUserspaceDriver(netconf.DeviceID)
if err != nil {
return err
}
}

return err
}

if err := validateCache(cache, netconf); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache validation may fail in case if "bridge name auto discovery" is used. Please, check the comment on the line 689. You need to move cache validation call to be after getBridgeName() is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It was my mistake. Thanks for your comment.
I moved the IPAM check below the cache validation, instead of moving the cache validation up.

// run the IPAM plugin
// userspace driver does not support IPAM plugin,
// because there is no network interface for the VF on the host
if netconf.NetConf.IPAM.Type != "" && !cache.UserspaceMode {
err = ipam.ExecCheck(netconf.NetConf.IPAM.Type, args.StdinData)
if err != nil {
return fmt.Errorf("failed to check with IPAM plugin type %q: %v", netconf.NetConf.IPAM.Type, err)
}
}

Add support for SR-IOV VFs using both hardware offload (switchdev)
and userspace device driver such as vfio-pci.

Signed-off-by: Taekyung Kim <kim.tae.kyung@sk.com>
If the MAC address is provided from args, update the MAC address of the
VF to the provided MAC address via netlink.

Signed-off-by: Taekyung Kim <kim.tae.kyung@sk.com>
Copy link
Contributor

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, looks good to me

@ykulazhenkov
Copy link
Contributor

ykulazhenkov commented Aug 12, 2024

@phoracek The PR looks good to me. Can you trigger the CI check?

@phoracek
Copy link
Member

/test pull-e2e-ovs-cni

@phoracek
Copy link
Member

Thanks a lot for the review @ykulazhenkov

@SchSeba do you also want to take a look before we merge this?

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nice work!

@kubevirt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kim-tae-kyung, SchSeba, ykulazhenkov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 15, 2024
@phoracek
Copy link
Member

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 15, 2024
@kubevirt-bot kubevirt-bot merged commit d25ad9c into k8snetworkplumbingwg:main Aug 15, 2024
5 checks passed
@kim-tae-kyung
Copy link
Contributor Author

Thanks to All!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants