Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement KubernetesPod class #15

Merged
merged 2 commits into from
May 20, 2024

Conversation

gautierdag
Copy link
Contributor

@gautierdag gautierdag commented May 20, 2024

This implements the KubernetesPod class in an analogous way to KubernetesJob (in jobs.py).

Previously:

        job = KubernetesJob(
            name=job_name,
            image="mydockerimage",
            command=["/bin/bash", "-c", "--"],
            args=[command],
            cpu_request=cfg.launch.cpu_request,
            ram_request=cfg.launch.ram_request,
            gpu_type="nvidia.com/gpu",
            gpu_limit=cfg.launch.gpu_limit,
            gpu_product=cfg.launch.gpu_product,
            backoff_limit=0,
            user_email="myuseremail@ed.ac.uk",
            namespace=cfg.launch.namespace,
            kueue_queue_name=KueueQueue.INFORMATICS,
            secret_env_vars=cfg.launch.env_vars,
            volume_mounts={
                "nfs": {"mountPath": "/nfs", "server": "10.24.1.255", "path": "/"}
            },
        )

Now:

        pod = KubernetesPod(
            name=pod_name,
            image="mydockerimage",
            command=["/bin/bash", "-c", "--"],
            args=[command],
            cpu_request=cfg.launch.cpu_request,
            ram_request=cfg.launch.ram_request,
            gpu_type="nvidia.com/gpu",
            gpu_limit=cfg.launch.gpu_limit,
            gpu_product=cfg.launch.gpu_product,
            user_email="myuseremail@ed.ac.uk",
            namespace=cfg.launch.namespace,
            secret_env_vars=cfg.launch.env_vars,
            volume_mounts={
                "nfs": {"mountPath": "/nfs", "server": "10.24.1.255", "path": "/"}
            },
        )

The main reason for implementing this is just because pods are a simpler level of abstractions to manage. They don't stay alive or force restarts/retries if parameters are set badly.

@AntreasAntoniou
Copy link
Owner

Testing these things is not straightforward due to the cluster needing to be there for a proper test.

Can you add some 'example scripts' that you pretest so we know this all works? :)

@gautierdag
Copy link
Contributor Author

Can you add some 'example scripts' that you pretest so we know this all works? :)

Done ✅

I've added a script in examples/example_pod.py and updated the README.

Does as expected when run:

$ > kubectl logs -f pod/pod-test-info-40gb-full-20240520

Filesystem      Size  Used Avail Use% Mounted on
overlay         1.6T  709G  843G  46% /
tmpfs            64M     0   64M   0% /dev
tmpfs           449G     0  449G   0% /sys/fs/cgroup
10.24.1.255:/    24T   17T  6.8T  72% /nfs
tmpfs           1.0G     0  1.0G   0% /dev/shm
/dev/vda1       1.6T  709G  843G  46% /etc/hosts
tmpfs           1.0G   12K  1.0G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           449G   12K  449G   1% /proc/driver/nvidia
tmpfs            90G  1.6G   89G   2% /run/nvidia-persistenced/socket
tmpfs           449G     0  449G   0% /proc/acpi
tmpfs           449G     0  449G   0% /proc/scsi
tmpfs           449G     0  449G   0% /sys/firmware

Shows NFS mapping is working and I've also tested the GPU allocation is working as expected.

@AntreasAntoniou
Copy link
Owner

Excellent. I'll merge this now.

@AntreasAntoniou AntreasAntoniou merged commit 80a5f53 into AntreasAntoniou:main May 20, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants