-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow patching of resource class #1728
Conversation
You can access the deployment of this PR at https://renku-ci-nb-1728.dev.renku.ch |
I am also seeing a case where the server got running but is still reported as failed: {
"flora-2eth-another-2dplayground-2dp-465de908": {
"annotations": {
"renku.io/branch": "master",
"renku.io/commit-sha": "3869e389a4353455d82801b4806b04cbb43373d8",
"renku.io/default_image_used": "False",
"renku.io/git-host": "gitlab.dev.renku.ch",
"renku.io/gitlabProjectId": "100761",
"renku.io/hibernatedSecondsThreshold": "259200",
"renku.io/hibernation": "",
"renku.io/hibernationBranch": "",
"renku.io/hibernationCommitSha": "",
"renku.io/hibernationDate": "",
"renku.io/hibernationDirty": "",
"renku.io/hibernationSynchronized": "",
"renku.io/idleSecondsThreshold": "86400",
"renku.io/lastActivityDate": "",
"renku.io/namespace": "johann.thiebaut1",
"renku.io/projectName": "another-playground-project",
"renku.io/repository": "https://gitlab.dev.renku.ch/johann.thiebaut1/another-playground-project",
"renku.io/resourceClassId": "2",
"renku.io/servername": "flora-2eth-another-2dplayground-2dp-465de908",
"renku.io/username": "flora.thiebaut@sdsc.ethz.ch"
},
"cloudstorage": [],
"image": "registry.dev.renku.ch/johann.thiebaut1/another-playground-project:3869e38",
"name": "flora-2eth-another-2dplayground-2dp-465de908",
"resources": {
"requests": {
"cpu": 0.5,
"memory": "2G",
"storage": "4G"
},
"usage": {}
},
"started": "2024-01-11T10:45:17+00:00",
"state": {
"pod_name": "flora-2eth-another-2dplayground-2dp-465de908-0"
},
"status": {
"details": [
{
"status": "ready",
"step": "Initialization"
},
{
"status": "ready",
"step": "Downloading server image"
},
{
"status": "ready",
"step": "Cloning and configuring the repository"
},
{
"status": "ready",
"step": "Git credentials services"
},
{
"status": "ready",
"step": "Authentication and proxying services"
},
{
"status": "ready",
"step": "Auxiliary session services"
},
{
"status": "ready",
"step": "Starting session"
}
],
"message": "The resource quota has been exceeded.",
"readyNumContainers": 7,
"state": "failed",
"totalNumContainers": 7,
"warnings": []
},
"url": "https://renku-ci-ui-2942.dev.renku.ch/sessions/flora-2eth-another-2dplayground-2dp-465de908"
}
} |
Also, it seems that updates to GPU annotations are not taken into account. |
@leafty the action to use a specific amalthea from a specific branch is not published and it was being used only in this PR. If the deployment from your PR in the ui repo amalthea was not getting deployed. I manually updated the image to test a few times and that is how we got the proper amalthea image which stuck around for a bit. But then when we redeployed this got wiped and we reverted to the regular amalthea image. I also found the issue when the session is getting stuck on failed and fixed that. Lastly, I updated the PR description in your PR in the ui repo and in here. So now things should work. |
The PATCH endpoint seems to work except that GPU annotations are not cleared when needed (change from class with GPU to one with none). Also, I don't know how feasible it would be to report that the session may be starting soon. On the UI side, the session is shown as failing for one or two minutes before the state goes to "starting". For users, it feels that nothing is happening during that time. |
@leafty thanks for testing all of this out and finding these edge cases. I addressed the problem with the updates to the pod - it was the result of the statefulset not always causing the pod to restart. And the status of the session should now switch much quicker from Failing to Starting when you update the resource class. After the fix when I tested it took just a second or two to go from Failing to Starting. |
Thanks 🙏 this seems to be working now. |
This is most likely because you cause the session to be scheduled on another node where the image is not cached and has to be downloaded again. Also we have had issues with how quick switch/k8s can remount the drives when the node changes. There is not much that I can do about his unfortunately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a test for this?
Co-authored-by: Ralf Grubenmann <ralf.grubenmann@sdsc.ethz.ch>
This allows the resource class on a running server to be patched. But only if the server is hibernated.
The reason for the limitation is to avoid people from inadvertantly restarting their sessions. Because if we patch the resource class when the server is running k8s will restart the session pod.
/deploy renku-ui=leafty/2941-feat-modify-session renku=rclone-sessions extra-values=amalthea.image.repository=registry.dev.renku.ch/tasko.olevski/renku-images/amalthea,amalthea.image.tag=0.10.0-0.dev.git.205.hb1d4c56