feat: allow patching of resource class #1728

olevski · 2023-12-12T14:47:50Z

This allows the resource class on a running server to be patched. But only if the server is hibernated.

The reason for the limitation is to avoid people from inadvertantly restarting their sessions. Because if we patch the resource class when the server is running k8s will restart the session pod.

/deploy renku-ui=leafty/2941-feat-modify-session renku=rclone-sessions extra-values=amalthea.image.repository=registry.dev.renku.ch/tasko.olevski/renku-images/amalthea,amalthea.image.tag=0.10.0-0.dev.git.205.hb1d4c56

RenkuBot · 2023-12-12T14:54:32Z

You can access the deployment of this PR at https://renku-ci-nb-1728.dev.renku.ch

renku_notebooks/api/schemas/servers_patch.py

renku_notebooks/api/notebooks.py

leafty · 2024-01-11T11:11:31Z

I am also seeing a case where the server got running but is still reported as failed:

{
        "flora-2eth-another-2dplayground-2dp-465de908": {
            "annotations": {
                "renku.io/branch": "master",
                "renku.io/commit-sha": "3869e389a4353455d82801b4806b04cbb43373d8",
                "renku.io/default_image_used": "False",
                "renku.io/git-host": "gitlab.dev.renku.ch",
                "renku.io/gitlabProjectId": "100761",
                "renku.io/hibernatedSecondsThreshold": "259200",
                "renku.io/hibernation": "",
                "renku.io/hibernationBranch": "",
                "renku.io/hibernationCommitSha": "",
                "renku.io/hibernationDate": "",
                "renku.io/hibernationDirty": "",
                "renku.io/hibernationSynchronized": "",
                "renku.io/idleSecondsThreshold": "86400",
                "renku.io/lastActivityDate": "",
                "renku.io/namespace": "johann.thiebaut1",
                "renku.io/projectName": "another-playground-project",
                "renku.io/repository": "https://gitlab.dev.renku.ch/johann.thiebaut1/another-playground-project",
                "renku.io/resourceClassId": "2",
                "renku.io/servername": "flora-2eth-another-2dplayground-2dp-465de908",
                "renku.io/username": "flora.thiebaut@sdsc.ethz.ch"
            },
            "cloudstorage": [],
            "image": "registry.dev.renku.ch/johann.thiebaut1/another-playground-project:3869e38",
            "name": "flora-2eth-another-2dplayground-2dp-465de908",
            "resources": {
                "requests": {
                    "cpu": 0.5,
                    "memory": "2G",
                    "storage": "4G"
                },
                "usage": {}
            },
            "started": "2024-01-11T10:45:17+00:00",
            "state": {
                "pod_name": "flora-2eth-another-2dplayground-2dp-465de908-0"
            },
            "status": {
                "details": [
                    {
                        "status": "ready",
                        "step": "Initialization"
                    },
                    {
                        "status": "ready",
                        "step": "Downloading server image"
                    },
                    {
                        "status": "ready",
                        "step": "Cloning and configuring the repository"
                    },
                    {
                        "status": "ready",
                        "step": "Git credentials services"
                    },
                    {
                        "status": "ready",
                        "step": "Authentication and proxying services"
                    },
                    {
                        "status": "ready",
                        "step": "Auxiliary session services"
                    },
                    {
                        "status": "ready",
                        "step": "Starting session"
                    }
                ],
                "message": "The resource quota has been exceeded.",
                "readyNumContainers": 7,
                "state": "failed",
                "totalNumContainers": 7,
                "warnings": []
            },
            "url": "https://renku-ci-ui-2942.dev.renku.ch/sessions/flora-2eth-another-2dplayground-2dp-465de908"
        }
    }

leafty · 2024-01-11T15:18:02Z

Also, it seems that updates to GPU annotations are not taken into account.

olevski · 2024-01-11T18:11:30Z

@leafty the action to use a specific amalthea from a specific branch is not published and it was being used only in this PR. If the deployment from your PR in the ui repo amalthea was not getting deployed. I manually updated the image to test a few times and that is how we got the proper amalthea image which stuck around for a bit.

But then when we redeployed this got wiped and we reverted to the regular amalthea image.

I also found the issue when the session is getting stuck on failed and fixed that.

Lastly, I updated the PR description in your PR in the ui repo and in here. So now things should work.

leafty · 2024-01-12T07:57:32Z

@leafty the action to use a specific amalthea from a specific branch is not published and it was being used only in this PR. If the deployment from your PR in the ui repo amalthea was not getting deployed. I manually updated the image to test a few times and that is how we got the proper amalthea image which stuck around for a bit.

But then when we redeployed this got wiped and we reverted to the regular amalthea image.

I also found the issue when the session is getting stuck on failed and fixed that.

Lastly, I updated the PR description in your PR in the ui repo and in here. So now things should work.

The PATCH endpoint seems to work except that GPU annotations are not cleared when needed (change from class with GPU to one with none).

Also, I don't know how feasible it would be to report that the session may be starting soon. On the UI side, the session is shown as failing for one or two minutes before the state goes to "starting". For users, it feels that nothing is happening during that time.

olevski · 2024-01-12T13:39:14Z

@leafty thanks for testing all of this out and finding these edge cases. I addressed the problem with the updates to the pod - it was the result of the statefulset not always causing the pod to restart. And the status of the session should now switch much quicker from Failing to Starting when you update the resource class. After the fix when I tested it took just a second or two to go from Failing to Starting.

leafty · 2024-01-15T13:27:35Z

@leafty thanks for testing all of this out and finding these edge cases. I addressed the problem with the updates to the pod - it was the result of the statefulset not always causing the pod to restart. And the status of the session should now switch much quicker from Failing to Starting when you update the resource class. After the fix when I tested it took just a second or two to go from Failing to Starting.

Thanks 🙏 this seems to be working now.
One note: when changing multiple times the session class before it can start (e.g. selecting another class which cannot be scheduled), it seems that the session takes a while to start (added ~2 minutes).

olevski · 2024-01-15T14:51:53Z

(e.g. selecting another class which cannot be scheduled), it seems that the session takes a while to start (added ~2 minutes).

This is most likely because you cause the session to be scheduled on another node where the image is not cached and has to be downloaded again. Also we have had issues with how quick switch/k8s can remount the drives when the node changes. There is not much that I can do about his unfortunately.

Panaetius

should there be a test for this?

renku_notebooks/api/classes/k8s_client.py

Co-authored-by: Ralf Grubenmann <ralf.grubenmann@sdsc.ethz.ch>

feat: allow patching of resource class

aee7a2a

olevski requested a review from a team as a code owner December 12, 2023 14:47

olevski marked this pull request as draft December 12, 2023 14:47

olevski temporarily deployed to renku-ci-nb-1728 December 12, 2023 14:48 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 December 12, 2023 16:35 — with GitHub Actions Inactive

leafty mentioned this pull request Dec 13, 2023

Allow sessions to be modified SwissDataScienceCenter/renku-ui#2941

Closed

leafty reviewed Dec 13, 2023

View reviewed changes

renku_notebooks/api/schemas/servers_patch.py Show resolved Hide resolved

Merge branch 'master' into feat-patch-resource-class

f04c85e

olevski temporarily deployed to renku-ci-nb-1728 December 20, 2023 17:15 — with GitHub Actions Inactive

chore: patch in resource class

4cc55f6

olevski had a problem deploying to renku-ci-nb-1728 December 20, 2023 21:38 — with GitHub Actions Error

squashme: fix tests

763017f

olevski temporarily deployed to renku-ci-nb-1728 December 20, 2023 21:44 — with GitHub Actions Inactive

olevski marked this pull request as ready for review December 20, 2023 21:45

Merge branch 'master' into feat-patch-resource-class

cf748f4

olevski temporarily deployed to renku-ci-nb-1728 January 4, 2024 13:22 — with GitHub Actions Inactive

leafty reviewed Jan 9, 2024

View reviewed changes

renku_notebooks/api/notebooks.py Outdated Show resolved Hide resolved

fix: patch the resource quota

79bc600

olevski temporarily deployed to renku-ci-nb-1728 January 9, 2024 20:04 — with GitHub Actions Inactive

olevski requested a review from leafty January 9, 2024 20:27

minor fix

4e8cea9

olevski temporarily deployed to renku-ci-nb-1728 January 9, 2024 20:45 — with GitHub Actions Inactive

leafty reviewed Jan 10, 2024

View reviewed changes

renku_notebooks/api/notebooks.py Show resolved Hide resolved

fix: enable patching when the server is failing

2c36ccf

olevski temporarily deployed to renku-ci-nb-1728 January 11, 2024 09:50 — with GitHub Actions Inactive

leafty reviewed Jan 11, 2024

View reviewed changes

renku_notebooks/api/notebooks.py Show resolved Hide resolved

olevski temporarily deployed to renku-ci-nb-1728 January 11, 2024 17:00 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 January 11, 2024 18:09 — with GitHub Actions Inactive

Merge branch 'master' into feat-patch-resource-class

f625c7f

olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 11:15 — with GitHub Actions Inactive

squashme: minor fix

861aad5

olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 13:06 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 13:13 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 13:22 — with GitHub Actions Inactive

Merge branch 'master' into feat-patch-resource-class

84f018f

olevski had a problem deploying to renku-ci-nb-1728 January 15, 2024 14:48 — with GitHub Actions Error

Panaetius requested changes Jan 15, 2024

View reviewed changes

renku_notebooks/api/classes/k8s_client.py Outdated Show resolved Hide resolved

squashme: accept suggestions

4708e2f

Co-authored-by: Ralf Grubenmann <ralf.grubenmann@sdsc.ethz.ch>

olevski had a problem deploying to renku-ci-nb-1728 January 15, 2024 15:11 — with GitHub Actions Failure

squashme: fix linting

d441f59

olevski had a problem deploying to renku-ci-nb-1728 January 15, 2024 15:17 — with GitHub Actions Failure

olevski had a problem deploying to renku-ci-nb-1728 January 15, 2024 15:23 — with GitHub Actions Error

Panaetius approved these changes Jan 15, 2024

View reviewed changes

olevski had a problem deploying to renku-ci-nb-1728 January 15, 2024 15:42 — with GitHub Actions Failure

olevski temporarily deployed to renku-ci-nb-1728 January 15, 2024 16:50 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 January 16, 2024 10:36 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 January 16, 2024 13:34 — with GitHub Actions Inactive

olevski temporarily deployed to renku-ci-nb-1728 January 16, 2024 14:13 — with GitHub Actions Inactive

olevski deployed to renku-ci-nb-1728 January 16, 2024 14:16 — with GitHub Actions View deployment

olevski merged commit 5219719 into master Jan 16, 2024
36 of 48 checks passed

olevski deleted the feat-patch-resource-class branch January 16, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow patching of resource class #1728

feat: allow patching of resource class #1728

olevski commented Dec 12, 2023 •

edited

Loading

RenkuBot commented Dec 12, 2023

leafty commented Jan 11, 2024

leafty commented Jan 11, 2024

olevski commented Jan 11, 2024

leafty commented Jan 12, 2024 •

edited

Loading

olevski commented Jan 12, 2024

leafty commented Jan 15, 2024

olevski commented Jan 15, 2024

Panaetius left a comment

feat: allow patching of resource class #1728

feat: allow patching of resource class #1728

Conversation

olevski commented Dec 12, 2023 • edited Loading

RenkuBot commented Dec 12, 2023

leafty commented Jan 11, 2024

leafty commented Jan 11, 2024

olevski commented Jan 11, 2024

leafty commented Jan 12, 2024 • edited Loading

olevski commented Jan 12, 2024

leafty commented Jan 15, 2024

olevski commented Jan 15, 2024

Panaetius left a comment

Choose a reason for hiding this comment

olevski commented Dec 12, 2023 •

edited

Loading

leafty commented Jan 12, 2024 •

edited

Loading