Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMS as Tapp #91

Open
wants to merge 109 commits into
base: master
Choose a base branch
from
Open

KMS as Tapp #91

wants to merge 109 commits into from

Conversation

kvinwang
Copy link
Collaborator

@kvinwang kvinwang commented Jan 10, 2025

This PR implements an onchain governed KMS which supports deploying itself as a Tapp on dstack. See the doc for the details.

Key rotation is not implemented in this PR. Since dstack support base image upgrading, I think we can implement key rotation in future versions.

@Leechael Please pay attention to the RPC changes in KMS and Tappd.

The reproducibility of the Tapp docker image is not done yet. Here is a temporary alternative which builds the image on-site:

services:
  # Helios is a Ethereum light client
  helios:
    build:
      context: .
      dockerfile_inline: |
        FROM rust:1.80-alpine@sha256:1f5aff501e02c1384ec61bb47f89e3eebf60e287e6ed5d1c598077afc82e83d5
        WORKDIR /app
        RUN apk add --no-cache git build-base openssl-dev protobuf protobuf-dev perl
        RUN git clone https://github.com/a16z/helios && \
            cd helios && \
            git checkout 5c61864a167c16141a9a12b976c0e9398b332f07
        RUN cd helios && cargo build --release
        WORKDIR /app/helios
    command: [
      "./target/release/helios",
      "ethereum",
      "--network", "mainnet",
      "--checkpoint", "0xbee4f32f91e62060d2aa41c652f6c69431829cfb09b02ea3cad92f65bd15dcce",
      "--rpc-bind-ip", "0.0.0.0",
      "--rpc-port", "8545",
      "--consensus-rpc", "https://ethereum.operationsolarstorm.org",
      "--execution-rpc", "https://ethereum-rpc.publicnode.com"
    ]

  # Auth API is a webhook server that authenticates KMS instances and Apps launches
  auth-api:
    build:
      context: .
      dockerfile_inline: |
        FROM node:18-alpine@sha256:06f7bbbcec00dd10c21a3a0962609600159601b5004d84aff142977b449168e9
        WORKDIR /app

        RUN apk add --no-cache git
        RUN git clone https://github.com/Dstack-TEE/dstack.git && \
            cd dstack && \
            git checkout 4d9011d027af993e17a3fb8182a7249409b78a2e
        WORKDIR /app/dstack/kms/auth-eth
        RUN npm install && \
            npx hardhat typechain && \
            npx tsc --project tsconfig.json
        CMD node dist/src/main.js
    environment:
      - HOST=0.0.0.0
      - PORT=8000
      - ETH_RPC_URL=http://helios:8545
      - KMS_CONTRACT_ADDR=0x515cd0dd984Adc75EE8625131FcC3d2D91f7F140
    depends_on:
      - helios

  # KMS handles the TEE Remote Attestation
  kms:
    build:
      context: .
      dockerfile_inline: |
        FROM rust:1.80-alpine@sha256:1f5aff501e02c1384ec61bb47f89e3eebf60e287e6ed5d1c598077afc82e83d5
        WORKDIR /app
        RUN apk add --no-cache git build-base openssl-dev protobuf protobuf-dev
        RUN git clone https://github.com/Dstack-TEE/dstack.git && \
            cd dstack && \
            git checkout 4d9011d027af993e17a3fb8182a7249409b78a2e
        RUN cd dstack/kms && cargo build --release -p kms
        WORKDIR /app/dstack
        CMD ["./target/release/kms"]
    volumes:
      - kms-volume:/etc/kms
      - /var/run/tappd.sock:/var/run/tappd.sock
    ports:
      - 8000:8000
    depends_on:
      - auth-api

volumes:
  kms-volume:

@kvinwang kvinwang force-pushed the kms-onchain branch 2 times, most recently from 7fb876c to c70a6ed Compare January 16, 2025 06:17
@kvinwang kvinwang requested a review from shelvenzhou January 17, 2025 06:46
Comment on lines +83 to +90
#[derive(Deserialize, Serialize, Debug, Clone)]
pub struct LocalConfig {
pub kms_url: Option<String>,
pub tproxy_url: Option<String>,
pub pccs_url: Option<String>,
pub docker_registry: Option<String>,
pub host_api_url: String,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for what purpose and where is this struct being used?

Copy link
Collaborator Author

@kvinwang kvinwang Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information is generated in host teepod and passed into CVM via a config.json.
Then it is used in CVM:

  • kms_url is used to connect to KMS to get app keys.
  • pccs_url is used during verifying the quote of key-provider.
  • tproxy_url is used to setup the tproxy net if enabled.
  • docker_registry is used as the official docker hub registry cache.
  • host_api_url is a VSOCK address used to notify teepod of the CVM boot progress.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have follow up questions here:

  • does it make sense to have an attestation with this initial configuration hash passed by the host such that the dev can make sure the initial boot was with the correct config? Because IIUC, this config is not part of the CVM's measurements (RTMR0-2). So probably using either extended within RTMR[3] or user-report-data upon requesting attestation quote to verify the boot-up process of the CVM, correct?
  • if the answer to the above question is yes, then could you refer me to that piece please?
  • slightly unrelated question just for my understanding: if a TApp is deployed on a similar CVM base image and same config. How does the dev differentiate their instance from others trying to forge the same identity? Does this require pre-deployment setup such as calculating the TApp measurements and deploying AppAuth for each TApp. Then the light client within the CVM image would do checking the owner and compose hash of the App and then pulls it from the docker_registry? Also, vice-versa, does the CVM needs to verify the identity of the owner prior the light client checking and pulling the TApp container? I am asking this question because not sure if the dev/owner is doing the deployment themselves on bare-metal or there is an intermediary host, like Azure/GCP, that initiate the deployments and has control over the initial host provided config.

Copy link
Collaborator Author

@kvinwang kvinwang Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good questions.
Answers to question 1 and 2:
We currently don't measure the LocalConfig. Instead of trusting these URLs, we do more concrete verifications individually.

  • For kms_url, instead of trusting the URL, the CVM connects to KMS by RA-TLS. The KMS verifies CVM's RA quote. And the CVM measures the root public key of KMS to RTMR3, and the public key is also registered on-chain. Users can compare the root public key in RTMR3 against the on-chain value to ensure the CVM gets keys from the given KMS.
  • For pccs_url, the URL is always not trusted, but verified by the root CA certificate in the verification library.
  • For tproxy_url, it is also protected by bidirectional RA-TLS verification.
  • For docker_registry, the URL is also not trustable. The best practice is to always use image hashes in the docker-compose.yaml so that the docker engine would guarantee the image integrity after pulling the images.
  • For host_api_url, we don't need to trust it as it is used to emit boot progress and act as a proxy for requests to the key-provider.

Regarding question 3:
Before deploying a Tapp CVM, an app-id is required to be registered in the KmsAuth contract and assign with an AppAuth contract to control which compose hash is allowed for the Tapp.
CVMs request app keys from KMS at boot time. The KMS asks the KmsAuth/AppAuth whether the compose-hash and other MRs are allowed to boot. If either the KmsAuth or AppAuth rejects the boot request, the KMS will not distribute the keys and the CVM will abort the boot.

return '0x' + hex;
}

async checkBoot(bootInfo: BootInfo, isKms: boolean): Promise<BootResponse> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review hint: This is where the KMS ask the smart contract for permission.


impl OnboardRpc for OnboardHandler {
async fn bootstrap(self, request: BootstrapRequest) -> Result<BootstrapResponse> {
let keys = Keys::generate(&request.domain).context("Failed to generate keys")?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review hint: Generating KMS root keys

})
}

async fn get_kms_key(self) -> Result<KmsKeyResponse> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: Here the running KMS shares the keys to new KMS instance.

Comment on lines +49 to +53
AuthApi::Dev { dev } => Ok(BootResponse {
is_allowed: true,
reason: "".to_string(),
tproxy_app_id: dev.tproxy_app_id.clone(),
}),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: Allow any app boot for local dev instance.

Comment on lines +54 to +67
AuthApi::Webhook { webhook } => {
let client = reqwest::Client::new();
let path = if is_kms {
"bootAuth/kms"
} else {
"bootAuth/app"
};
let url = url_join(&webhook.url, path);
let response = client.post(&url).json(&boot_info).send().await?;
if !response.status().is_success() {
bail!("Failed to check boot auth: {}", response.text().await?);
}
Ok(response.json().await?)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: Asking the webhook for boot authority.

Comment on lines +205 to +243
let cert_pair = generate_ra_cert(tmp_ca.temp_ca_cert, tmp_ca.temp_ca_key)?;
let ra_client = RaClientConfig::builder()
.tls_no_check(false)
.tls_built_in_root_certs(false)
.remote_uri(kms_url.clone())
.tls_client_cert(cert_pair.cert_pem)
.tls_client_key(cert_pair.key_pem)
.tls_ca_cert(tmp_ca.ca_cert.clone())
.cert_validator(Box::new(|cert| {
let Some(cert) = cert else {
bail!("Missing server cert");
};
let Some(usage) = cert.special_usage else {
bail!("Missing server cert usage");
};
if usage != "kms:rpc" {
bail!("Invalid server cert usage: {usage}");
}
Ok(())
}))
.build()
.into_client()
.context("Failed to create client")?;
let kms_client = kms_rpc::kms_client::KmsClient::new(ra_client);
let response = kms_client
.get_app_key(GetAppKeyRequest {
app_compose: fs::read_to_string(host_shared.dir.app_compose_file())
.context("Failed to read app compose file")?,
})
.await
.context("Failed to get app key")?;
{
let (_, ca_pem) = x509_parser::pem::parse_x509_pem(tmp_ca.ca_cert.as_bytes())
.context("Failed to parse ca cert")?;
let x509 = ca_pem.parse_x509().context("Failed to parse ca cert")?;
let id = hex::encode(x509.public_key().raw);
let provider_info = KeyProviderInfo::new("kms".into(), id);
emit_key_provider_info(&provider_info)?;
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint (Important): Requesting app keys from kms here. The pubkey in the ca cert, used by remote endpoint of the TLS, is written to RTMR3 as provider info.

let Some(usage) = cert.special_usage else {
bail!("Missing server cert usage");
};
if usage != "kms:rpc" {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: A real kms only issues certs with usage=kms:rpc to themselves to server rpc, nerver issues it to apps.

Comment on lines +260 to +272
let provision = host
.get_sealing_key()
.await
.context("Failed to get sealing key")?;
// write to fs
let app_keys =
gen_app_keys_from_seed(&provision.sk).context("Failed to generate app keys")?;
let keys_json = serde_json::to_string(&app_keys).context("Failed to serialize app keys")?;
fs::write(self.app_keys_file(), keys_json).context("Failed to write app keys")?;

// write to RTMR
let provider_info = KeyProviderInfo::new("local-sgx".into(), hex::encode(provision.mr));
emit_key_provider_info(&provider_info)?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: For local key provider, the sgx mr_enclave is written to RTMR3. The sgx quote is verified in fn get_sealing_key.

Comment on lines +46 to +57
let key_provider = &self.app.config.key_provider;
if !key_provider.enabled {
bail!("Key provider is not enabled");
}
let response = get_key(request.quote, key_provider.address, key_provider.port)
.await
.context("Failed to get sealing key from key provider")?;

Ok(GetSealingKeyResponse {
encrypted_key: response.encrypted_key,
provider_quote: response.provider_quote,
})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: Request from one CVM, delegate to the sgx key provider.

Comment on lines +122 to +156
let ra_validator = RaValidator {
remote_app_id: self.app_keys.tproxy_app_id.clone(),
};
let cert_client =
CertRequestClient::create(&self.app_keys, self.local_config.pccs_url.as_deref())
.await
.context("Failed to create cert client")?;
let client_key =
KeyPair::generate_for(&PKCS_ECDSA_P256_SHA256).context("Failed to generate key")?;
let config = CertConfig {
org_name: None,
subject: "tappd".to_string(),
subject_alt_names: vec![],
usage_server_auth: false,
usage_client_auth: true,
ext_quote: true,
};
let client_certs = cert_client
.request_cert(&client_key, config)
.await
.context("Failed to request cert")?;
let client_cert = client_certs.join("\n");
let ca_cert = self.app_keys.ca_cert.clone();
let client = RaClientConfig::builder()
.remote_uri(url)
.maybe_pccs_url(self.local_config.pccs_url.clone())
.tls_client_cert(client_cert)
.tls_client_key(client_key.serialize_pem())
.tls_ca_cert(ca_cert)
.tls_built_in_root_certs(false)
.tls_no_check(self.app_keys.tproxy_app_id == "any")
.cert_validator(Box::new(move |cert| ra_validator.validate(cert)))
.build()
.into_client()
.context("Failed to create RA client")?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: Connect to tproxy here. If the KMS(smart contract) allow to connect to any tproxy, then we don't validate remote cert, otherwise, do bidirectional ra verification, and ensure the remote app-id matches the one KMS returned.

@kvinwang
Copy link
Collaborator Author

@h4x3rotab I've added some review hints.

@@ -0,0 +1,80 @@
// SPDX-License-Identifier: MIT
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Hint: And please pay attention to the smart contracts.

}

// Function to check if KMS is allowed to boot
function isKmsAllowed(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something for right now, but something we have spent a lot of time discussing is censorship resistance of de-registering measurements.
The worst case for kms is one where:

  • a vulnerability is discovered, and the corresponding measurement de-registered
  • however, existing kms instances are blocked from processing new blocks, and therefore don't actually see the de-registration transaction
  • an attacker can spend as long as they need in order to extract the root secret, since the kms instance will never see the de-registration transaction
  • the de-registration might as well not have happened at all

To counteract this scenario, the kms instance needs to make sure it's not being censored — neither in the short or in the long term. What we have arrived at as a possible solution is interacting onchain (kms at startup and then periodically requiring a random nonce to appear onchain), and limiting the number of actions allowed without forcing an onchain interaction. This way you would never allow an unlimited attempts at breaking a kms instance, since there's only so many attempts before you need to either reboot the instance (which will require onchain interaction) or are blocked by the action limit (which also requires onchain interaction)

Copy link
Collaborator Author

@kvinwang kvinwang Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent consideration.

Let me see if I understand the proposed solution correctly:
The KMS node needs to interact with the chain both at startup and periodically by sending transactions, to verify it has access to up-to-date chain data.

I'm wondering if we could also consider using trusted time sources like NTS as an additional verification mechanism? Specifically, we could compare the block timestamp with the trusted timestamp - if the difference is within an acceptable threshold, it would help ensure the node is operating with fresh data rather than being fed stale blocks by an attacker.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KMS node needs to interact with the chain both at startup and periodically by sending transactions, to verify it has access to up-to-date chain data.

I think that's one way of doing it, yes

I'm wondering if we could also consider using trusted time sources like NTS

I think so. Interacting with the blockchain is of course involved and costly — it'd be best to avoid it. With NTS specifically it would be interesting, because we are avoiding relying on local time — only considering what we are getting from NTS and the timestamp of the current chain head. I guess the thing to make sure is that NTS can't be replayed/delayed in a way that defeats the verification (but my guess is that it shouldn't be — as longs as it's https it could only be delayed, but never reordered) and the verification is as good as NTS itself (pretty good if you ask me in this scenario).

For what it's worth I think resolving this issue could be done at a later time
Cc @MoeMahhouk since you were thinking about nts recently

"OK"
}

if !kms_config.onboard.auto_bootstrap_domain.is_empty() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if each instance should be whitelisted on chain before being allowed to onboard
The added visibility might be well worth it right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device-id (which is implemented as a hash of the ppid) is whitelisted. Do you think this is enough?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess that onboarding to kms (sharing the secret) is the most security-critical component, so I'd think that it'd be worth having an onchain trace of it happening — ideally sharing the secret requires an interactive onchain challenge (which would also force up-to-date view of the chain), but I'd imagine at least that all honest KMS instances would make it known that a given device requested the kms key (be it on chain or elsewhere) for transparency

cert_validator: Option<CertValidator>,
}

impl RaClientConfig {
Copy link

@Ruteri Ruteri Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't think using ratls is an issue, we have lately been moving away from relying on RA. Rather we exchange the regular TLS certificates along with their attestations (an example alternative implementation would have each kms instance register their locally generated TLS cert+attestation on boot). If registration prior to onboarding is anyway something you'd want, then this could simplify the implementation a bit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the RA either at the TLS level, HTTP level, or the Application level.
Each has its advantages and disadvantages.

  • The benefits of the Application level are that it's easy to implement in different languages and web frameworks.
  • The benefits of the TLS level are that it is easier to abstract, the attestation stuff is not needed to defined in each RPC definition schema; once a TLS connection is attested, all transferred data over the connection is reliable.
    We can easily reuse the same framework across several services, so far:
    • KMS: onboarding RPC, key providing RPC
    • tproxy: registering RPC
    • CVM: Inter-App Secure Communication (WIP)

The client side can simply use a curl command with an RA-TLS enabled cert to request those different services.

We may do the abstraction at the Application level, but that requires considerations about potential MITM attacks which are still complex.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about doing it at the Application level is that it's more front-and-center, while ratls does hide the details a bit

CVM: Inter-App Secure Communication (WIP)

Do keep in mind establishing the ratls channel does take considerably longer (~1s for Azure for example, if you need to fetch pccs it's also going to take a while). Performance considerations is one thing that pushed us towards exchanging attested tls certs beforehand, and using regular tls for all but the initial connection. Ratls is I think fine for registering & other non-performance critical use cases, but inter-app communication could need the connections to be established faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants