Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow patching of resource class #1728

Merged
merged 13 commits into from
Jan 16, 2024
Merged

Conversation

olevski
Copy link
Member

@olevski olevski commented Dec 12, 2023

This allows the resource class on a running server to be patched. But only if the server is hibernated.

The reason for the limitation is to avoid people from inadvertantly restarting their sessions. Because if we patch the resource class when the server is running k8s will restart the session pod.

/deploy renku-ui=leafty/2941-feat-modify-session renku=rclone-sessions extra-values=amalthea.image.repository=registry.dev.renku.ch/tasko.olevski/renku-images/amalthea,amalthea.image.tag=0.10.0-0.dev.git.205.hb1d4c56

@olevski olevski requested a review from a team as a code owner December 12, 2023 14:47
@olevski olevski marked this pull request as draft December 12, 2023 14:47
@olevski olevski temporarily deployed to renku-ci-nb-1728 December 12, 2023 14:48 — with GitHub Actions Inactive
@RenkuBot
Copy link
Contributor

You can access the deployment of this PR at https://renku-ci-nb-1728.dev.renku.ch

@olevski olevski temporarily deployed to renku-ci-nb-1728 December 20, 2023 17:15 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 December 20, 2023 21:44 — with GitHub Actions Inactive
@olevski olevski marked this pull request as ready for review December 20, 2023 21:45
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 4, 2024 13:22 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 9, 2024 20:04 — with GitHub Actions Inactive
@olevski olevski requested a review from leafty January 9, 2024 20:27
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 9, 2024 20:45 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 11, 2024 09:50 — with GitHub Actions Inactive
@leafty
Copy link
Member

leafty commented Jan 11, 2024

I am also seeing a case where the server got running but is still reported as failed:

{
        "flora-2eth-another-2dplayground-2dp-465de908": {
            "annotations": {
                "renku.io/branch": "master",
                "renku.io/commit-sha": "3869e389a4353455d82801b4806b04cbb43373d8",
                "renku.io/default_image_used": "False",
                "renku.io/git-host": "gitlab.dev.renku.ch",
                "renku.io/gitlabProjectId": "100761",
                "renku.io/hibernatedSecondsThreshold": "259200",
                "renku.io/hibernation": "",
                "renku.io/hibernationBranch": "",
                "renku.io/hibernationCommitSha": "",
                "renku.io/hibernationDate": "",
                "renku.io/hibernationDirty": "",
                "renku.io/hibernationSynchronized": "",
                "renku.io/idleSecondsThreshold": "86400",
                "renku.io/lastActivityDate": "",
                "renku.io/namespace": "johann.thiebaut1",
                "renku.io/projectName": "another-playground-project",
                "renku.io/repository": "https://gitlab.dev.renku.ch/johann.thiebaut1/another-playground-project",
                "renku.io/resourceClassId": "2",
                "renku.io/servername": "flora-2eth-another-2dplayground-2dp-465de908",
                "renku.io/username": "flora.thiebaut@sdsc.ethz.ch"
            },
            "cloudstorage": [],
            "image": "registry.dev.renku.ch/johann.thiebaut1/another-playground-project:3869e38",
            "name": "flora-2eth-another-2dplayground-2dp-465de908",
            "resources": {
                "requests": {
                    "cpu": 0.5,
                    "memory": "2G",
                    "storage": "4G"
                },
                "usage": {}
            },
            "started": "2024-01-11T10:45:17+00:00",
            "state": {
                "pod_name": "flora-2eth-another-2dplayground-2dp-465de908-0"
            },
            "status": {
                "details": [
                    {
                        "status": "ready",
                        "step": "Initialization"
                    },
                    {
                        "status": "ready",
                        "step": "Downloading server image"
                    },
                    {
                        "status": "ready",
                        "step": "Cloning and configuring the repository"
                    },
                    {
                        "status": "ready",
                        "step": "Git credentials services"
                    },
                    {
                        "status": "ready",
                        "step": "Authentication and proxying services"
                    },
                    {
                        "status": "ready",
                        "step": "Auxiliary session services"
                    },
                    {
                        "status": "ready",
                        "step": "Starting session"
                    }
                ],
                "message": "The resource quota has been exceeded.",
                "readyNumContainers": 7,
                "state": "failed",
                "totalNumContainers": 7,
                "warnings": []
            },
            "url": "https://renku-ci-ui-2942.dev.renku.ch/sessions/flora-2eth-another-2dplayground-2dp-465de908"
        }
    }

@leafty
Copy link
Member

leafty commented Jan 11, 2024

Also, it seems that updates to GPU annotations are not taken into account.

@olevski olevski temporarily deployed to renku-ci-nb-1728 January 11, 2024 17:00 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 11, 2024 18:09 — with GitHub Actions Inactive
@olevski
Copy link
Member Author

olevski commented Jan 11, 2024

@leafty the action to use a specific amalthea from a specific branch is not published and it was being used only in this PR. If the deployment from your PR in the ui repo amalthea was not getting deployed. I manually updated the image to test a few times and that is how we got the proper amalthea image which stuck around for a bit.

But then when we redeployed this got wiped and we reverted to the regular amalthea image.

I also found the issue when the session is getting stuck on failed and fixed that.

Lastly, I updated the PR description in your PR in the ui repo and in here. So now things should work.

@leafty
Copy link
Member

leafty commented Jan 12, 2024

@leafty the action to use a specific amalthea from a specific branch is not published and it was being used only in this PR. If the deployment from your PR in the ui repo amalthea was not getting deployed. I manually updated the image to test a few times and that is how we got the proper amalthea image which stuck around for a bit.

But then when we redeployed this got wiped and we reverted to the regular amalthea image.

I also found the issue when the session is getting stuck on failed and fixed that.

Lastly, I updated the PR description in your PR in the ui repo and in here. So now things should work.

The PATCH endpoint seems to work except that GPU annotations are not cleared when needed (change from class with GPU to one with none).

Also, I don't know how feasible it would be to report that the session may be starting soon. On the UI side, the session is shown as failing for one or two minutes before the state goes to "starting". For users, it feels that nothing is happening during that time.

@olevski olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 11:15 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 13:06 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 13:13 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 12, 2024 13:22 — with GitHub Actions Inactive
@olevski
Copy link
Member Author

olevski commented Jan 12, 2024

@leafty thanks for testing all of this out and finding these edge cases. I addressed the problem with the updates to the pod - it was the result of the statefulset not always causing the pod to restart. And the status of the session should now switch much quicker from Failing to Starting when you update the resource class. After the fix when I tested it took just a second or two to go from Failing to Starting.

@leafty
Copy link
Member

leafty commented Jan 15, 2024

@leafty thanks for testing all of this out and finding these edge cases. I addressed the problem with the updates to the pod - it was the result of the statefulset not always causing the pod to restart. And the status of the session should now switch much quicker from Failing to Starting when you update the resource class. After the fix when I tested it took just a second or two to go from Failing to Starting.

Thanks 🙏 this seems to be working now.
One note: when changing multiple times the session class before it can start (e.g. selecting another class which cannot be scheduled), it seems that the session takes a while to start (added ~2 minutes).

@olevski
Copy link
Member Author

olevski commented Jan 15, 2024

(e.g. selecting another class which cannot be scheduled), it seems that the session takes a while to start (added ~2 minutes).

This is most likely because you cause the session to be scheduled on another node where the image is not cached and has to be downloaded again. Also we have had issues with how quick switch/k8s can remount the drives when the node changes. There is not much that I can do about his unfortunately.

Copy link
Member

@Panaetius Panaetius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a test for this?

renku_notebooks/api/classes/k8s_client.py Outdated Show resolved Hide resolved
Co-authored-by: Ralf Grubenmann <ralf.grubenmann@sdsc.ethz.ch>
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 15, 2024 16:50 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 16, 2024 10:36 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 16, 2024 13:34 — with GitHub Actions Inactive
@olevski olevski temporarily deployed to renku-ci-nb-1728 January 16, 2024 14:13 — with GitHub Actions Inactive
@olevski olevski merged commit 5219719 into master Jan 16, 2024
36 of 48 checks passed
@olevski olevski deleted the feat-patch-resource-class branch January 16, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants