docs: quick deploy using terraform (#634)
Provision AKS cluster and deploy KAITO using Terraform

See ./terraform/ for more information on how to deploy
pauldotyu authored Oct 16, 2024
1 parent 870a93d commit 6481b76
Expand Up @@ -36,7 +36,7 @@ The above figure presents the Kaito architecture overview. Its major components
## Installation

Please check the installation guidance [here](./docs/
Please check the installation guidance [here](./docs/ for deployment using Azure CLI and [here](./terraform/ for deployment using Terraform.

## Quick start

# Local .terraform directories

# .tfstate files

# Crash log files

# Exclude all .tfvars files, which are likely to contain sensitive data, such as
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# to change depending on the environment.

# Ignore override files as they are usually used to override resources locally and so
# are not checked in

# Include override files you do wish to add to version control using negated pattern
# !

# Include tfplan files to ignore the plan output of command: terraform plan -out=tfplan
# example: *tfplan*

# Ignore CLI configuration files
# Deploy KAITO on AKS using Terraform

This is a sample of how to deploy an Open Source KAITO on a new Azure Kubernetes Service (AKS) using Terraform. This sample will deploy the following resources:

- Azure Kubernetes Service (AKS)
- Azure Container Registry (ACR) with short lived, repo scoped token
- Azure Managed Identity with Federated Credential and Role Assignment for GPU Provisioner
- Install the KAITO GPU Provisioner Helm Chart
- Install the KAITO Workspace Helm Chart
- Kubernetes Secret for the ACR token

## Prerequisites

- Terraform 1.9.7 or later
- Azure CLI 2.65.0 or later
- kubectl 1.30.5 or later
- Helm 3.16.2 or later

## Setup

To deploy this sample, you will to use the Azure CLI to login to your Azure account and set the subscription you want to use, then use the Terraform CLI to provision the Azure resources and execute the Helm installations for the KAITO operators.

Login to your Azure account and set the subscription you want to use.

az login
az account set -s <subscription-id>

Export the subscription ID for Terraform to use.

export ARM_SUBSCRIPTION_ID=$(az account show --query id -o tsv)

Initialize the Terraform providers.

terraform init

> [!NOTE]
> The following variables in the [](./ file are available for customization:
> - `location` - The Azure region to deploy the resources. Be sure you have the necessary quota in the region.
> - `kaito_gpu_provisioner_version` - The version of the KAITO GPU Provisioner.
> - `kaito_workspace_version` - The version of the KAITO Workspace.
Run the Terraform apply command and enter `yes` when prompted to deploy the Azure resources.

terraform apply

Log into the AKS cluster.

az aks get-credentials -g $(terraform output -raw rg_name) -n $(terraform output -raw aks_name)

Verify installation of the KAITO operators.

helm list -n gpu-provisioner
helm list -n kaito-workspace

Check status of the KAITO pods.

kubectl get po -n gpu-provisioner
kubectl get po -n kaito-workspace

## Cleanup

Run the Terraform destroy command and enter `yes` when prompted to delete the Azure resources.

terraform destroy
- name: LOCATION
value: ${LOCATION}
value: ${AKS_NAME}
value: ${AKS_NRG_NAME}
value: ${RG_NAME}
value: "false"
tenantId: ${AZURE_TENANT_ID}
clusterName: ${AKS_NAME}
# Create managed identity that the gpu-provisioner will use to interact with Azure
resource "azurerm_user_assigned_identity" "kaito" {
resource_group_name =
location = azurerm_resource_group.example.location
name = "kaitoprovisioner"

# Grant the managed identity the Contributor role to create new AKS nodes
resource "azurerm_role_assignment" "kaito_aks_contributor" {
principal_id = azurerm_user_assigned_identity.kaito.principal_id
scope =
role_definition_name = "Contributor"
skip_service_principal_aad_check = true

# Create a federated identity credential for the managed identity to be used by the gpu-provisioner via workload identity
resource "azurerm_federated_identity_credential" "kaito" {
resource_group_name =
parent_id =
name = "kaitoprovisioner"
issuer = azurerm_kubernetes_cluster.example.oidc_issuer_url
audience = ["api://AzureADTokenExchange"]
subject = "system:serviceaccount:gpu-provisioner:gpu-provisioner"

# Install the gpu-provisioner chart
resource "helm_release" "gpu_provisioner" {
name = "gpu-provisioner"
chart = "${var.kaito_gpu_provisioner_version}.tgz"
namespace = "gpu-provisioner"
create_namespace = true

values = [
AZURE_TENANT_ID = data.azurerm_client_config.current.tenant_id
AZURE_SUBSCRIPTION_ID = data.azurerm_client_config.current.subscription_id
LOCATION = azurerm_resource_group.example.location
AKS_NRG_NAME = azurerm_kubernetes_cluster.example.node_resource_group
KAITO_IDENTITY_CLIENT_ID = azurerm_user_assigned_identity.kaito.client_id

# Install the kaito-workspace chart
resource "helm_release" "kaito_workspace" {
name = "kaito-workspace"
chart = "${var.kaito_workspace_version}.tgz"
namespace = "kaito-workspace"
create_namespace = true

# Create a secret to store the Azure Container Registry credentials for the workspace to refer to when pushing and pulling images from the registry
resource "kubernetes_secret" "example" {
metadata {
name = "myregistrysecret"

type = ""

data = {
".dockerconfigjson" = jsonencode({
auths = {
"${azurerm_container_registry.example.login_server}" = {
"username" =
"password" = azurerm_container_registry_token_password.example.password1
resource "azurerm_kubernetes_cluster" "example" {
resource_group_name =
location = azurerm_resource_group.example.location
name = "aks-${local.random_name}"
dns_prefix = "aks-${local.random_name}"
oidc_issuer_enabled = true
workload_identity_enabled = true

default_node_pool {
name = "default"
node_count = 1
vm_size = "Standard_D2_v2"

upgrade_settings {
drain_timeout_in_minutes = 0
max_surge = "10%"
node_soak_duration_in_minutes = 0

identity {
type = "SystemAssigned"

resource "azurerm_role_assignment" "aks_acr_pull" {
principal_id = azurerm_kubernetes_cluster.example.kubelet_identity[0].object_id
scope =
role_definition_name = "AcrPull"
skip_service_principal_aad_check = true
locals {
random_name = "kaitodemo${random_integer.example.result}"
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=4.5.0"

random = {
source = "hashicorp/random"
version = "=3.6.3"

kubernetes = {
source = "hashicorp/kubernetes"
version = "=2.33.0"

helm = {
source = "hashicorp/helm"
version = "=2.16.1"

provider "azurerm" {
features {
resource_group {
prevent_deletion_if_contains_resources = false

provider "kubernetes" {
host =
username = azurerm_kubernetes_cluster.example.kube_config.0.username
password = azurerm_kubernetes_cluster.example.kube_config.0.password
client_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_certificate)
client_key = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_key)
cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.cluster_ca_certificate)

provider "helm" {
kubernetes {
host =
username = azurerm_kubernetes_cluster.example.kube_config.0.username
password = azurerm_kubernetes_cluster.example.kube_config.0.password
client_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_certificate)
client_key = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.client_key)
cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.example.kube_config.0.cluster_ca_certificate)

data "azurerm_client_config" "current" {}

resource "random_integer" "example" {
min = 10
max = 99

resource "azurerm_resource_group" "example" {
name = "rg-${local.random_name}"
location = var.location
output "rg_name" {
value =

output "aks_name" {
value =
resource "azurerm_container_registry" "example" {
resource_group_name =
location = azurerm_resource_group.example.location
name = "acr${local.random_name}"
sku = "Standard"
admin_enabled = false
anonymous_pull_enabled = false

resource "azurerm_container_registry_scope_map" "example" {
name = "default"
container_registry_name =
resource_group_name =

actions = [

resource "azurerm_container_registry_token" "example" {
name = "default"
container_registry_name =
resource_group_name =
scope_map_id =

resource "azurerm_container_registry_token_password" "example" {
container_registry_token_id =

password1 {
expiry = timeadd(timestamp(), "168h") # 7 days
variable "location" {
type = string
default = "brazilsouth"
description = "value of location"

variable "kaito_gpu_provisioner_version" {
type = string
default = "0.2.0"
description = "kaito gpu provisioner version"

variable "kaito_workspace_version" {
type = string
default = "0.3.1"
description = "kaito workspace version"

variable "registry_repository_name" {
type = string
default = "fine-tuned-adapters/kubernetes"
description = "container registry repository name"

