Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Support #849

Closed
amrragab8080 opened this issue Jan 14, 2019 · 12 comments
Closed

GPU Support #849

amrragab8080 opened this issue Jan 14, 2019 · 12 comments
Assignees
Labels
Roadmap: New Request Type: Question Indicates that an issue, pull request, or discussion needs more information

Comments

@amrragab8080
Copy link

Public facing api doesnt seem to have support yet for passthrough pci devices, namely gpu is this technically feasible?

@raduweiss raduweiss added Type: Question Indicates that an issue, pull request, or discussion needs more information Roadmap: New Request labels Jan 16, 2019
@goswamig
Copy link

That would be super helpful to have support for GPU.

@alexandruag alexandruag added the Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled label Feb 15, 2019
@acatangiu acatangiu added Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` and removed Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` labels Feb 18, 2019
@acatangiu acatangiu self-assigned this Feb 18, 2019
@acatangiu
Copy link
Contributor

GPU support in Firecracker is very hard/tricky at the moment. With current GPU hardware, there's two major problems:

  1. To do device pass-through implies pinning physical memory which would remove our memory oversubscription capabilities.
  2. We can only run 1 customer workload securely per physical GPU, and switching between customer workloads takes a long enough time to make it impractical.

As a result there is no known path to supporting GPUs in Firecracker.

@raduweiss
Copy link
Contributor

@amrragab8080, we'll be looking at this as part of #1179.

@normtown
Copy link

To do device pass-through implies pinning physical memory which would remove our memory oversubscription capabilities.

Why do you want to maintain the ability to oversubscribe memory?

@raduweiss
Copy link
Contributor

Oversubscription is a core part of makes Firecracker a great way to isolate serverless workloads; that's why we took on a tenet around it [1].

[1] https://github.com/firecracker-microvm/firecracker/blob/master/CHARTER.md

@normtown
Copy link

Why is over-subscription a great way to isolate serverless workloads? I genuinely don't know, so the reasoning that led to the existence of the tenet is not self-evident to me.

@raduweiss
Copy link
Contributor

Like all services, serverless compute providers want to keep their servers busy and to improve their overall utilization. Ideally, every CPU cycle on the service provider's servers is running user code, and every byte of RAM is filled with user data. If servers are sitting idle, that’s inefficient.

A part of solving this optimization problem is having the ability to oversubscribe a given server's hardware capacity with workloads who's hardware resource usage is statistically uncorrelated, or, even better, with workloads selected specifically to pack well together.

@normtown
Copy link

What you appear to be saying is that resource over-subscription helps the hosting service (e.g. AWS Lambda or Fargate) to lower their hardware costs. (Which, in turn, passes savings on to customers...presumably.)

That is not the same as being great for isolating workloads. It seems to be the opposite. Particularly, in the case that all workloads attempt to utilize their full resource reservations at the same time. It sounds like the design here is to bet on the workloads not calling in all their debts.

How ingrained into the Firecracker implementation is this resource-over-subscription tenet? Like, would it be remotely feasible to add a feature flag that turns over-subscription off?

P.S. As an aside, the Firecracker tenets don't seem to align with the Fargate project. Specifically the tenet that calls out favoring transient or stateless workloads over long-running or persistent workloads. The Fargate docs do not place similar restrictions on its workloads (AFAICT).

@raduweiss
Copy link
Contributor

raduweiss commented Dec 10, 2019

That is not the same as being great for isolating workloads. It seems to be the opposite. Particularly, in the case that all workloads attempt to utilize their full resource reservations at the same time. It sounds like the design here is to bet on the workloads not calling in all their debts.

Great for isolating serverless workloads, which are bursty and pay-only-when-running. Take a look at https://www.youtube.com/watch?v=QdzV04T_kec , there some more detail there around how Lambda multiplexes workloads.

How ingrained into the Firecracker implementation is this resource-over-subscription tenet? Like, would it be remotely feasible to add a feature flag that turns over-subscription off?

Well, it's a tenet so we stick to it unless there's a very good reason to change it.

P.S. As an aside, the Firecracker tenets don't seem to align with the Fargate project. Specifically the tenet that calls out favoring transient or stateless workloads over long-running or persistent workloads. The Fargate docs do not place similar restrictions on its workloads (AFAICT).

You're quite right here :) This tenet started out as a powerful simplifying assumption, but as you pointed out, it doesn't quite apply to all the serverless container workloads; we might let go of the "transient and stateless" part.

@DemiMarie
Copy link

  1. We can only run 1 customer workload securely per physical GPU, and switching between customer workloads takes a long enough time to make it impractical.

What is the reason for this? Is the attack surface of e.g virtio-gpu or Venus excessive?

@DemiMarie
Copy link

GPU support in Firecracker is very hard/tricky at the moment. With current GPU hardware, there's two major problems:

  1. To do device pass-through implies pinning physical memory which would remove our memory oversubscription capabilities.

In theory it is possible to do better by dynamically manipulating guest IOMMU mappings.

  1. We can only run 1 customer workload securely per physical GPU, and switching between customer workloads takes a long enough time to make it impractical.

Does this also apply to SR-IOV capable GPUs? What about e.g. attacks in which the guest overwrites the GPU’s vBIOS?

@Talador12
Copy link

Talador12 commented Aug 20, 2024

GPU support in Firecracker is very hard/tricky at the moment. With current GPU hardware, there's two major problems:

  1. To do device pass-through implies pinning physical memory which would remove our memory oversubscription capabilities.
  2. We can only run 1 customer workload securely per physical GPU, and switching between customer workloads takes a long enough time to make it impractical.

As a result there is no known path to supporting GPUs in Firecracker.

This is becoming increasingly more important to support. It may be difficult, but we need to find a way to do this.

This can also be managed with Nvidia MIG (cutting the physical GPU into slices) and exposing a specific slice to the VM. This does provide capacity limitations in the current state, but only on GPU capacity, which is widely accepted at the moment.

Also - this issue is not "Closed". It is not implemented and users are still asking for this. We should move this conversation to #1179 since it is still open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap: New Request Type: Question Indicates that an issue, pull request, or discussion needs more information
Projects
None yet
Development

No branches or pull requests

8 participants