-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the alpha version of RayClusterReplicaSet and RayClusterFleet #161
Comments
We will split the ReplicaSet and Fleet implementation into 2 PRs. After that, we will have a clean up PR to fix all e2e testing issues. |
Check CRD
Update: operator-framework/operator-sdk#6558 add |
Update: |
Problem 1: RayCluster Status is not stable (resolved with workaround)-- checking the logs and find this failed to try resolving symlinks in path "/var/log/pods/aibrix-system_rs-jd4ck-head-fccgd_38fd0805-f8b5-4439-838a-dfac97709590/ray-head/2.log": lstat /var/log/pods/aibrix-system_rs-jd4ck-head-fccgd_38fd0805-f8b5-4439-838a-dfac97709590/ray-head/2.log: no such file or directory% Update 2: could be ray[default] issue? when I run the ray image locally,
not sure what happened and seems it's not that stable. Find a related issue ray-project/ray#45041
Update 3: I tried to disable the dashboard by removing
|
Problem 4 KubeAPIWarningLogger creationTimestampe (not resolved, pending)the message is shown after I apply the rs object. elastic/cloud-on-k8s#6379
resources
Update:
Create an issue kubernetes-sigs/controller-runtime#2956 to track it |
Problem 5 update replica seems doesn't trigger scale up/downUpdate: expectation usage issue, used the wrong key and wrong workflow earlier. |
Problem 6 deletion and recreation doesn't work as expectedUpdate: expectation usage issue, used the wrong key and wrong workflow earlier. Root case is same as problem 5. |
Better to convert to yaml for consistency |
RayCluster Fleet issues |
Problem 1: if kind is a CRD, it should be installed before calling Start (resolved)
confirmed it exists, the problem is the start sequence. If dependency always goes first, we do not have such issues
|
Problem 2: control plane goes down once fleet is created (resolved)
the controller failed to create the single replicaset and result in the apiserver failed to response on mac for desktop. I reproduce this issue by using kind.. Update: Seems it's RS's problem, it repeatly creates many clusters.. make sure the |
Known issues have been resolved and it has been tested with simple cases. We can close this feature for now. |
🚀 Feature Description and Motivation
We need to prioritize the feature a little bit and let's add the controller implementation to support multi-node deployment.
Use Case
to support multi-node vLLM
Proposed Solution
No response
The text was updated successfully, but these errors were encountered: