Implement KubernetesPod class #15

gautierdag · 2024-05-20T17:50:35Z

This implements the KubernetesPod class in an analogous way to KubernetesJob (in jobs.py).

Previously:

        job = KubernetesJob(
            name=job_name,
            image="mydockerimage",
            command=["/bin/bash", "-c", "--"],
            args=[command],
            cpu_request=cfg.launch.cpu_request,
            ram_request=cfg.launch.ram_request,
            gpu_type="nvidia.com/gpu",
            gpu_limit=cfg.launch.gpu_limit,
            gpu_product=cfg.launch.gpu_product,
            backoff_limit=0,
            user_email="myuseremail@ed.ac.uk",
            namespace=cfg.launch.namespace,
            kueue_queue_name=KueueQueue.INFORMATICS,
            secret_env_vars=cfg.launch.env_vars,
            volume_mounts={
                "nfs": {"mountPath": "/nfs", "server": "10.24.1.255", "path": "/"}
            },
        )

Now:

        pod = KubernetesPod(
            name=pod_name,
            image="mydockerimage",
            command=["/bin/bash", "-c", "--"],
            args=[command],
            cpu_request=cfg.launch.cpu_request,
            ram_request=cfg.launch.ram_request,
            gpu_type="nvidia.com/gpu",
            gpu_limit=cfg.launch.gpu_limit,
            gpu_product=cfg.launch.gpu_product,
            user_email="myuseremail@ed.ac.uk",
            namespace=cfg.launch.namespace,
            secret_env_vars=cfg.launch.env_vars,
            volume_mounts={
                "nfs": {"mountPath": "/nfs", "server": "10.24.1.255", "path": "/"}
            },
        )

The main reason for implementing this is just because pods are a simpler level of abstractions to manage. They don't stay alive or force restarts/retries if parameters are set badly.

AntreasAntoniou · 2024-05-20T17:55:14Z

Testing these things is not straightforward due to the cluster needing to be there for a proper test.

Can you add some 'example scripts' that you pretest so we know this all works? :)

gautierdag · 2024-05-20T18:16:07Z

Can you add some 'example scripts' that you pretest so we know this all works? :)

Done ✅

I've added a script in examples/example_pod.py and updated the README.

Does as expected when run:

$ > kubectl logs -f pod/pod-test-info-40gb-full-20240520

Filesystem      Size  Used Avail Use% Mounted on
overlay         1.6T  709G  843G  46% /
tmpfs            64M     0   64M   0% /dev
tmpfs           449G     0  449G   0% /sys/fs/cgroup
10.24.1.255:/    24T   17T  6.8T  72% /nfs
tmpfs           1.0G     0  1.0G   0% /dev/shm
/dev/vda1       1.6T  709G  843G  46% /etc/hosts
tmpfs           1.0G   12K  1.0G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           449G   12K  449G   1% /proc/driver/nvidia
tmpfs            90G  1.6G   89G   2% /run/nvidia-persistenced/socket
tmpfs           449G     0  449G   0% /proc/acpi
tmpfs           449G     0  449G   0% /proc/scsi
tmpfs           449G     0  449G   0% /sys/firmware

Shows NFS mapping is working and I've also tested the GPU allocation is working as expected.

AntreasAntoniou · 2024-05-20T23:08:37Z

Excellent. I'll merge this now.

implement KubernetesPod class

2c726ce

add example script and update README

d134bb0

AntreasAntoniou merged commit 80a5f53 into AntreasAntoniou:main May 20, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement KubernetesPod class #15

Implement KubernetesPod class #15

gautierdag commented May 20, 2024 •

edited

Loading

AntreasAntoniou commented May 20, 2024

gautierdag commented May 20, 2024

AntreasAntoniou commented May 20, 2024

Implement KubernetesPod class #15

Implement KubernetesPod class #15

Conversation

gautierdag commented May 20, 2024 • edited Loading

AntreasAntoniou commented May 20, 2024

gautierdag commented May 20, 2024

AntreasAntoniou commented May 20, 2024

gautierdag commented May 20, 2024 •

edited

Loading