Kubeflow is fun.
However, deploying Kubeflow could be challenging because it has many components, the built-in and the third-party dependencies.
This repository provides a comprehensive guide and necessary configurations to deploy Kubeflow on a Kubernetes (k8s) cluster within an OpenStack cloud platform. The deployment process leverages tools like Ansible, Kubespray, Kubeflow Manifest to automate and streamline the setup. The deployment process includes the following tasks:
- Prepare a bastion node: Set up a control node where Kubespray will be executed to deploy Kubernetes.
- Create OpenStack instances: Provision three instances—one for the control plane and two for worker nodes.
- Run Ansible playbooks: Install necessary prerequisites on all nodes before deploying Kubernetes.
- Deploy Kubernetes with Kubespray: Use Kubespray to set up a Kubernetes cluster on the prepared nodes.
- Apply Kubeflow common manifests: Deploy common Kubeflow components.
- Apply Kubeflow application manifests: Deploy Kubeflow applications like Notebooks.
The following repositories and resources were used for this deployment:
- Kubeflow: The primary repository for Kubeflow, a machine learning toolkit for Kubernetes.
- Kubespray: A collection of Ansible playbooks for provisioning and managing Kubernetes clusters.
- cloud-provider-openstack: Repository for OpenStack cloud provider integrations with Kubernetes.
- Helm: The Kubernetes package manager used for deploying additional components.
- Kubeflow Manifests: The repository containing manifests for deploying and managing Kubeflow components.
├── LICENSE
├── README.md
├── ansible
│ ├── ansible.cfg
│ ├── inventory
│ └── playbooks
├── cloud-provider-openstack
│ └── manifests
├── helm
│ ├── echo-server
│ ├── ingress.sh
│ ├── matrix.sh
│ └── storage.yaml
├── kubeflow-manifests
│ ├── apps
│ ├── common
│ └── tests
└── kubespray-inventory
└── funkube
- ansible/: Contains Ansible configurations and playbooks for preparing nodes.
ansible.cfg
: Ansible configuration file.inventory
: Hosts inventory file.playbooks/
: Directory with Ansible playbooks.
- cloud-provider-openstack/: Holds manifests for integrating Kubernetes with OpenStack, this only holds the changes that made to original repo, cloud-provider-openstack
manifests/
: OpenStack-specific Kubernetes manifests.
- helm/: Includes Helm charts and scripts for deploying additional services.
echo-server/
: Helm chart for deploying an Echo server.ingress.sh
: Script to set up ingress controllers.matrix.sh
: Script for deploying Matrix component.storage.yaml
: Configuration for persistent storage.
- kubeflow-manifests/: Contains manifests for deploying Kubeflow components.
apps/
: Application-specific manifests (e.g., Notebooks).common/
: Common Kubeflow components.tests/
: Test manifests for validation. The apps and common folder are copied from the original repo, Kubeflow Manifests, but only the components that deployed/tested up to the latest update here. You do NOT need original repo to run.
- kubespray-inventory/: Inventory files for Kubespray deployment.
funkube/
: Specific inventory for the Kubernetes cluster. This folder only holds the customisation part that needs to the original repo, Kubespray. You do need to original repo to run.
- Access to an OpenStack cloud environment. (have openrc.sh file)
- Create instances of bastion at Openstack
- Basic understanding of Ansible, Kubernetes, Kubeflow.
- System update: Ensure latest kernel
sudo apt update
sudo apt upgrade -y
sudo reboot
- Install Openstack CLI
sudo apt install build-essential
sudo apt install -y python3 python3-pip python3-venv
python3 -m venv env
source env/bin/activate
pip install openstackclient
source openrc.sh
openstack server list
- Install Ansible: Ensure Ansible is installed on the bastion node.
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install -y ansible
- Set up Kubespray: Clone the Kubespray repository.
git clone https://github.com/kubernetes-sigs/kubespray.git
cd kubespray
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
- Provision Instances: Create three instances in OpenStack.
- Control Plane Node: Manages cluster state.
- Worker Nodes: Run workloads.
- Configure Networking: Ensure instances can communicate and are accessible from the bastion node.
- Assign Floating IPs: If necessary, assign floating IPs for external access. Otherwise, we can use SSL tunnel to test.
Have IP and hosts information ready and make your own hosts file copy
- Run Playbooks: Execute the playbooks to install prerequisites.
ansible-playbook -i inventory playbooks/prepare_nodes.yml
Have IP and hosts information ready and make your own inventory file
-
Copy Inventory:
cp -r kubespray/inventory/sample kubespray-inventory/funkube
-
Update Inventory: Modify
kubespray-inventory/funkube/hosts.yaml
with your cluster nodes. -
Deploy Cluster:
cp -r kubespray/inventory sample funkube
cp kubeflowfun/ansible/inventory/hosts kubespray/inventory/funkube/inventory.ini
cd kubespray
ansible-playbook -i inventory/funkube/inventory.ini cluster.yml
To prepare Kubeflow deployment, we prepare
- an echo server, to test out k8s API works properly
- an ingress server, to test out web app traffic routing
- a matrix server, to monitor the cluster load
- a storage class, to dynamically provision and manage storage resources
Note: alias k=kubectl
- Deploy Echo Server:
cd helm/echo-server
helm install echo-server .
- Set Up Ingress:
cd helm
./ingress.sh
- Deploy Matrix Component:
cd helm
./matrix.sh
k top nodes
k top pods
- Configure Storage:
k apply -f storage.yaml
- Install kubeflow namespace:
cd kubeflow-manifests/common/
k apply -k kubeflow-namespace/base
- Install kubeflow roles:
cd kubeflow-manifests/common/
k apply -k kubeflow-roles/base
- Install cert-manager:
cd kubeflow-manifests/common/
k apply -k cert-manger/base
k apply -k kubeflow-issuer/base
k get pods -n cert-manager
k get apiservices | grep cert-manager
- Test cert-manager:
cd kubeflow-manifests/tests/
cd cert-manager/
k apply -f self-signed-issuer.yaml
k apply -f test-certificate.yaml
k describe certificate test-certificate -n default
k get secret test-certificate-secret -n default
- Install Istio:
cd kubeflow-manifests/common/
k apply -k istio-1-23/istio-crds/base/
k apply -k istio-1-23/istio-namespace/base
k apply -k istio-1-23/istio-install/overlays/oauth2-proxy/
k apply -k istio-1-23/kubeflow-istio-resources/base
k wait --for=condition=Ready pods --all -n istio-system --timeout 300s
k get pods -n istio-system
k get svc -n istio-system
- Test Istio:
Follow this file: Istio test
Also check out the resources
k get all -n istio-system
k get all -n kubeflow
k get gateway -n istio-system
k get clusterroles | grep kubeflow-istio
k get virtualservice -A
k get pods -n istio-system
- Install oauth2-proxy:
cd kubeflow-manifests/common/
k apply -k oauth2-proxy/overlays/m2m-dex-and-kind/
k wait --for=condition=ready pod -l 'app.kubernetes.io/name=oauth2-proxy' --timeout=180s -n oauth2-proxy
k wait --for=condition=ready pod -l 'app.kubernetes.io/name=cluster-jwks-proxy' --timeout=180s -n istio-system
k get all -n oauth2-proxy
- Install dex:
cd kubeflow-manifests/common/
k apply -k dex/overlays/oauth2-proxy/
k wait --for=condition=ready pods --all --timeout=180s -n auth
- Install networkpolicies:
cd kubeflow-manifests/common/
k apply -k networkpolicies/base
k get networkpolicy -A
k describe networkpolicy jupyter-web-app -n kubeflow
- Install kubeflow roles:
cd kubeflow-manifests/common/
k apply -k kubeflow-roles/base
- Install user namespace:
cd kubeflow-manifests/common/
k apply -k user-namespace/base
- CentralDashboard:
cd kubeflow-manifests/apps/
k apply -k centraldashboard/upstream/base
k apply -k centraldashboard/overlays/oauth2-proxy/
Jupyter web app:
cd kubeflow-manifests/apps/
k apply -k jupyter/notebook-controller/upstream/overlays/kubeflow/
k apply -k jupyter/jupyter-web-app/upstream/overlays/istio/
- Profiles:
cd kubeflow-manifests/apps/
k apply -k profiles/upstream/default/
k apply -k profiles/upstream/overlays/kubeflow/
- Admission-webhook
cd kubeflow-manifests/apps/
k apply -k admission-webhook/upstream/overlays/
k apply -k admission-webhook/upstream/overlays/cert-manager/
k get secret webhook-certs -n kubeflow
k describe validatingwebhookconfiguration -A
- Cinder CSI Plugin
cd cloud-provider-openstack/manifests/cinder-csi-plugin
k apply -f cinder-csi-controllerplugin.yaml
k apply -f cinder-csi-nodeplugin.yaml
k apply -f csi-cinder-driver.yaml
k apply -f csi-secret-cinderplugin.yaml
k apply -f cinder-csi-nodeplugin-rbac.yaml
k apply -f cinder-csi-controllerplugin-rbac.yaml
k get secret cloud-config -n kube-system
- Cinder CSI Plugin Fix
Update authenticate with clouds.yaml as clouds.conf not working
Changes made here: Cloud Provider OpenStack Repo
cd cloud-provider-openstack/manifests/cinder-csi-plugin
k apply -f cinder-csi-controllerplugin.yaml
k get svc -n kubeflow
k delete secret cloud-config -n kube-system
k create secret generic cloud-config -n kube-system --from-file=cloud.conf=/path/to/cloud.conf --from-file=clouds.yaml=/path/to/clouds.yaml
k get secret cloud-config -n kube-system -o yaml
k rollout restart deployment csi-cinder-controllerplugin -n kube-systemc
- PVC viewer (volumes)
cd kubeflow-manifests/apps
k apply -k pvcviewer-controller/upstream/base
k apply -k pvcviewer-controller/upstream/default/
k apply -k apps/volumes-web-app/upstream/overlays/istio/
- Test volume
cd kubeflow-manifests/tests
k apply -f test-pvc.yaml
k get pvc test-pvc
k get pod test-pod
k exec -it test-pod -- cat /data/testfile
- Tensorboard
cd kubeflow-manifests/apps
k apply -k tensorboard/tensorboard-controller/upstream/overlays/kubeflow/
k apply -k apps/tensorboard/tensorboards-web-app/upstream/overlays/istio/
- Drain a node to update the kernel
- Add a new node to the cluster using Kubespray
- Run a DNS test
This project is licensed under the Apache License.
Feel free to open issues or submit pull requests for improvements or fixes.