Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the alpha version of RayClusterReplicaSet and RayClusterFleet #161

Closed
Jeffwan opened this issue Sep 11, 2024 · 17 comments · Fixed by #194
Closed

Implement the alpha version of RayClusterReplicaSet and RayClusterFleet #161

Jeffwan opened this issue Sep 11, 2024 · 17 comments · Fixed by #194
Assignees
Labels
area/distributed kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 11, 2024

🚀 Feature Description and Motivation

We need to prioritize the feature a little bit and let's add the controller implementation to support multi-node deployment.

Use Case

to support multi-node vLLM

Proposed Solution

No response

@Jeffwan Jeffwan added this to the v0.1.0-rc.2 milestone Sep 11, 2024
@Jeffwan Jeffwan added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/feature Categorizes issue or PR as related to a new feature. area/distributed labels Sep 11, 2024
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 12, 2024

We will split the ReplicaSet and Fleet implementation into 2 PRs. After that, we will have a clean up PR to fix all e2e testing issues.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 20, 2024

image
fleet watches the resources but the kind is not installed.

Eventually, it will fail the controller deployment

2024-09-20T23:36:59Z	ERROR	setup	problem running manager	{"error": "failed to wait for rayclusterreplicaset caches to sync: timed out waiting for cache to be synced for Kind *v1.RayCluster"}
main.main
	/workspace/cmd/controllers/main.go:208
runtime.main
	/usr/local/go/src/runtime/proc.go:271

Update:

  1. Install CRD
  2. Even the CRD is installed, we need to add to the schema for controller to know how to process such objects. This is required.
if err := rayv1.AddToScheme(mgr.GetScheme()); err != nil {
    setupLog.Error(err, "unable to add RayCluster to scheme")
    os.Exit(1)
}

@Jeffwan Jeffwan reopened this Sep 21, 2024
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 21, 2024

Error from server (BadRequest): error when creating "fleet.yaml": RayClusterFleet in version "v1alpha1" cannot be handled as a RayClusterFleet: strict decoding error: unknown field "spec.template.metadata.labels"

Check CRD

image
this looks correct, similar like PodTemplateSpec


Update: operator-framework/operator-sdk#6558 add generateEmbeddedObjectMeta=true for controller-gen resolves the problem

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 21, 2024

2024-09-21T01:18:07-07:00	ERROR	Reconciler error	{"controller": "rayclusterfleet", "controllerGroup": "orchestration.aibrix.ai", "controllerKind": "RayClusterFleet", "RayClusterFleet": {"name":"fleet","namespace":"aibrix-system"}, "namespace": "aibrix-system", "name": "fleet", "reconcileID": "01c0ad3a-0adb-4b85-a509-7a646a2f4bb0", "error": "expected pointer, but got nil"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/usr/local/go-1.21/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/usr/local/go-1.21/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/usr/local/go-1.21/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

Update:
em. a simple error should be avoid..
var fleet *orchestrationv1alpha1.RayClusterFleet -> fleet := &orchestrationv1alpha1.RayClusterFleet{}

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 21, 2024

image

➜ ✗ k get rayclusterfleet
NAME    AGE
fleet   37s
➜ ✗ k get rayclusterreplicaset
NAME              AGE
fleet-bbd4bb64b   36s

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 1: RayCluster Status is not stable (resolved with workaround)

image

It keeps crashing
image

image

--
Update 1: I set cpu limit to 400m but start script round up to 1, I thought that could be the problem. After change to 1, it doesn't work

checking the logs and find this
image

failed to try resolving symlinks in path "/var/log/pods/aibrix-system_rs-jd4ck-head-fccgd_38fd0805-f8b5-4439-838a-dfac97709590/ray-head/2.log": lstat /var/log/pods/aibrix-system_rs-jd4ck-head-fccgd_38fd0805-f8b5-4439-838a-dfac97709590/ray-head/2.log: no such file or directory%

Update 2: could be ray[default] issue? when I run the ray image locally,

ray start --head  --num-cpus=1  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365

image

not sure what happened and seems it's not that stable.

Find a related issue ray-project/ray#45041

tail -f /tmp/ray/session_latest/logs/dashboard_agent.log

2024-09-21 23:40:32,268	INFO http_server_agent.py:78 -- Registered 20 routes.
2024-09-21 23:40:32,271	INFO process_watcher.py:46 -- raylet pid is 197
2024-09-21 23:40:32,271	INFO process_watcher.py:66 -- check_parent_via_pipe
2024-09-21 23:40:32,273	INFO event_agent.py:56 -- Report events to 10.1.3.138:38937
2024-09-21 23:40:32,273	INFO event_utils.py:130 -- Monitor events logs modified after 1726985432.0650578 on /tmp/ray/session_2024-09-21_23-40-27_373576_1/logs/events, the source types are all.
2024-09-21 23:40:37,335	WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 1.2 to 1.
2024-09-21 23:40:39,183	INFO web_log.py:211 -- 127.0.0.1 [21/Sep/2024:22:40:39 -0800] "GET /api/local_raylet_healthz HTTP/1.1" 200 163 "-" "Wget/1.21.2"
command terminated with exit code 137

Update 3: I tried to disable the dashboard by removing --dashboard-host=0.0.0.0 because this is working well in the local container testing. However, rayStartParams: {} will inject the --dashboard-host=0.0.0.0 which is definitely a bug. In the end, I use very old version rayproject/ray:2.10.0 which is working well one single node but it still didn't pass the healthcheck, I added following env to bypass the probe injection

        env:
        - name: ENABLE_PROBES_INJECTION
          value: "false"

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 2: Restarts leads to recreation (resolved)

image


Update:

even there's ray cluster, it return nil here.
image
checking the implementation. it was a TODO left here.
image

Update the code resolves the problem

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 3: Delete RS doesn't trigger the cascaded deletion. (resolved)

image


Update: there's a typo on the object, package should be orchestration

image

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 4 KubeAPIWarningLogger creationTimestampe (not resolved, pending)

image

the message is shown after I apply the rs object.

elastic/cloud-on-k8s#6379
kubernetes version

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

resources


Update:

k explain rayclusters.spec.headGroupSpec.template.metadata

GROUP:      ray.io
KIND:       RayCluster
VERSION:    v1

FIELD: metadata <Object>

DESCRIPTION:
    <empty>
FIELDS:
  annotations	<map[string]string>
    <no description>

  finalizers	<[]string>
    <no description>

  labels	<map[string]string>
    <no description>

  name	<string>
    <no description>

  namespace	<string>
    <no description>
  1. creationTimestamp is now known field from api point of view. It could be reserved
  2. we didn't create any timestamp in the code.

Create an issue kubernetes-sigs/controller-runtime#2956 to track it

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 5 update replica seems doesn't trigger scale up/down


Update: expectation usage issue, used the wrong key and wrong workflow earlier.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 6 deletion and recreation doesn't work as expected


Update: expectation usage issue, used the wrong key and wrong workflow earlier. Root case is same as problem 5.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

Problem 7: Operation cannot be fulfilled

image


Update: not done yet.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 22, 2024

# Add the Helm repo
#  helm repo add kuberay https://ray-project.github.io/kuberay-helm/
#  helm repo update
#
#  # Confirm the repo exists
#  helm search repo kuberay --devel
#
#  # Install both CRDs and KubeRay operator v1.2.1.
#  helm install kuberay-operator kuberay/kuberay-operator --version 1.2.1

Better to convert to yaml for consistency

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 23, 2024

RayCluster Fleet issues

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 23, 2024

Problem 1: if kind is a CRD, it should be installed before calling Start (resolved)

2024-09-23T05:31:01Z	ERROR	controller-runtime.source.EventHandler	if kind is a CRD, it should be installed before calling Start	{"kind": "RayCluster.ray.io", "error": "no matches for kind \"RayCluster\" in version \"ray.io/v1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/source/kind.go:56

confirmed it exists, the problem is the start sequence. If dependency always goes first, we do not have such issues

k get crds
rayclusters.ray.io                              2024-09-23T05:31:41Z

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 23, 2024

Problem 2: control plane goes down once fleet is created (resolved)

2024-09-23T05:43:31Z	INFO	KubeAPIWarningLogger	unknown field "spec.template.metadata.creationTimestamp"
2024-09-23T05:43:31Z	INFO	KubeAPIWarningLogger	unknown field "spec.template.spec.headGroupSpec.template.metadata.creationTimestamp"
2024-09-23T05:43:31Z	DEBUG	events	Scaled up replica set fleet-8f544c95c to 1	{"type": "Normal", "object": {"kind":"RayClusterFleet","namespace":"aibrix-system","name":"fleet","uid":"9fe394e5-4d27-472d-809c-95a365783ef9","apiVersion":"orchestration.aibrix.ai/v1alpha1","resourceVersion":"3319"}, "reason": "ScalingReplicaSet"}
2024-09-23T05:43:31Z	INFO	KubeAPIWarningLogger	unknown field "spec.headGroupSpec.template.metadata.creationTimestamp"
E0923 05:43:31.253350       1 rayclusterreplicaset_controller.go:228] "unable to update ReplicaSet status" err="Operation cannot be fulfilled on rayclusterreplicasets.orchestration.aibrix.ai \"fleet-8f544c95c\": the object has been modified; please apply your changes to the latest version and try again"
2024-09-23T05:43:31Z	ERROR	Reconciler error	{"controller": "rayclusterreplicaset", "controllerGroup": "orchestration.aibrix.ai", "controllerKind": "RayClusterReplicaSet", "RayClusterReplicaSet": {"name":"fleet-8f544c95c","namespace":"aibrix-system"}, "namespace": "aibrix-system", "name": "fleet-8f544c95c", "reconcileID": "e48de167-c48a-42e1-9e2c-e6e4bd2d44e7", "error": "Operation cannot be fulfilled on rayclusterreplicasets.orchestration.aibrix.ai \"fleet-8f544c95c\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

image

the controller failed to create the single replicaset and result in the apiserver failed to response on mac for desktop. I reproduce this issue by using kind..


Update: Seems it's RS's problem, it repeatly creates many clusters..
Few issues, the labels was not created for rs earlier. and rs now has the pod template label which should not be in the cluster filters.

image

make sure the newRSTemplate.Labels clones the pod-template-hash label resolves the issue.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 23, 2024

Known issues have been resolved and it has been tested with simple cases. We can close this feature for now.

@Jeffwan Jeffwan closed this as completed Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant