Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the model adapter stability to alpha grade #136

Closed
Jeffwan opened this issue Sep 6, 2024 · 8 comments · Fixed by #260
Closed

Enhance the model adapter stability to alpha grade #136

Jeffwan opened this issue Sep 6, 2024 · 8 comments · Fixed by #260
Assignees
Labels
area/lora kind/enhancement New feature or request priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 6, 2024

🚀 Feature Description and Motivation

  1. model adapter controller lacks off enough tests coverage
  2. Some behaviors like the spec.baseModel and secretRef seems not that smooth yet -> we need Model abstraction later
  3. per lora scheduler name may not work as well, it could be the controller level configuration? -> similar like pod, we should allow it
  4. status and conditions need to be double checked.
  5. Limit the maximum rank supported by base model replicas, otherwise, it may experience the OOM later -> controlled by scheduler at this moment

Use Case

No response

Proposed Solution

No response

@Jeffwan Jeffwan added kind/enhancement New feature or request area/lora priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Sep 6, 2024
@Jeffwan Jeffwan added this to the v0.1.0-rc.2 milestone Sep 6, 2024
@Jeffwan Jeffwan self-assigned this Sep 6, 2024
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 23, 2024

Few follow-ups

  1. if the schedule restarts, how does it know the lora rank size? -> retrieve from the model list?
  2. if we'd like to support replicas, then what's the primary change, how to guarantee there's no duplicate creations issues etc

@Jeffwan Jeffwan modified the milestones: v0.1.0-rc.2, v0.1.0-rc.3 Sep 24, 2024
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 27, 2024

Stability Improvement

Scenario 0: Any status issue or unloaded issues.

unloaded issues are normally related to pod restart etc and the lora information is missing. Now Pod watch has been corrected and all related issues should be resolved.

Status issue should be resolved for most of major workflow. This needs more improvement later

Scenario 1: Delete base model, then delete lora

image
image

Still running, this is definitely a bug, it should enqueue the right model adapter objects. then it could update the status.

Scenarios 2: Create adapter, Delete adapter, Create adapter.

#239 reproducible

Scenarios 3: Failed to delete model adapter which take long time

Deletion doesn't get stucked, it's working fine.
image

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 29, 2024

Scenarios 3: partial pods in terminating status and apply the the adapter

2024-09-29T17:21:47Z	INFO	Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler	{"controller": "model-adapter-controller", "controllerGroup": "model.aibrix.ai", "controllerKind": "ModelAdapter", "ModelAdapter": {"name":"lora-1","namespace":"aibrix-system"}, "namespace": "aibrix-system", "name": "lora-1", "reconcileID": "3c12e51b-5dcf-4757-93e9-ebcd3bd094a4"}
2024-09-29T17:21:47Z	ERROR	Reconciler error	{"controller": "model-adapter-controller", "controllerGroup": "model.aibrix.ai", "controllerKind": "ModelAdapter", "ModelAdapter": {"name":"lora-1","namespace":"aibrix-system"}, "namespace": "aibrix-system", "name": "lora-1", "reconcileID": "3c12e51b-5dcf-4757-93e9-ebcd3bd094a4", "error": "pod does not exist in the cache: llama2-70b-8fd6c849b-hkwnl"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 29, 2024

Misc 1: model adapter didn't filter out unrelated pods (resolved) #255

I0929 17:12:28.773609       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="envoy-gateway-system/envoy-gateway-8484c95484-hkhr5"
I0929 17:12:28.773829       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/vpnkit-controller"
I0929 17:12:28.773852       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="envoy-gateway-system/envoy-aibrix-system-aibrix-eg-903790dc-865ff88d5f-5bvxn"
I0929 17:12:28.773871       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="aibrix-system/aibrix-kuberay-operator-7648cb948-59tcq"
I0929 17:12:28.773886       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/coredns-76f75df574-k5n5b"
I0929 17:12:28.773908       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/kube-apiserver-docker-desktop"
I0929 17:12:28.773944       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/kube-controller-manager-docker-desktop"
I0929 17:12:28.773962       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/kube-scheduler-docker-desktop"
I0929 17:12:28.773983       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="aibrix-system/aibrix-controller-manager-56576666d6-xq5cp"
I0929 17:12:28.773998       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/coredns-76f75df574-9jrxk"
I0929 17:12:28.774013       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/etcd-docker-desktop"
I0929 17:12:28.774026       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/storage-provisioner"
I0929 17:12:28.774042       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="aibrix-system/aibrix-gateway-users-6c5dc67c96-w25zp"
I0929 17:12:28.774055       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="aibrix-system/aibrix-redis-master-84769768cb-zmsm8"
I0929 17:12:28.774068       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="kube-system/kube-proxy-fkgpd"
I0929 17:12:28.774085       1 modeladapter_controller.go:176] "ModelAdapter resource not found. Ignoring since object mush be deleted" modelAdapter="aibrix-system/aibrix-gateway-plugins-9c68c4bb7-ztfs7"

unrelated pods should be filtered out. This is because the model adapter controller Watches(&corev1.Pod{}, &handler.EnqueueRequestForObject{}) but did not filter the related ones, for example with the ownership type or labels. Adding label predicates resolves the issue but we have to use a specific label in future to indicate the pod and lora relationship.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 29, 2024

Misc 2: base model pod event can not trigger adapter reconciliation

image

Since the pod is not owned by the modelAdapter, we can not that easily trigger the adapter reconciliation. It should check reverse lookup adapter list. which one status has this pod name, then put that model adapter in the queue

&handler.EnqueueRequestForObject{} was not correctly used here, ForObject will put pod in the queue, ForOwner will put owner in the queue but we can not set owner, FromMapFunc make sense to support arbitrary collection of objects.

https://book-v1.book.kubebuilder.io/basics/simple_controller

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 30, 2024

Misc 3: Create lora always experience the object outdated issue

24-09-30T05:31:20Z	ERROR	Reconciler error	{"controller": "model-adapter-controller", "controllerGroup": "model.aibrix.ai", "controllerKind": "ModelAdapter", "ModelAdapter": {"name":"lora-1","namespace":"aibrix-system"}, "namespace": "aibrix-system", "name": "lora-1", "reconcileID": "04f54bb5-9ff3-4f98-8d2a-de18f65a7745", "error": "Operation cannot be fulfilled on modeladapters.model.aibrix.ai \"lora-1\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 30, 2024

Testing after base model pod is deleted.

image

image

t0: still same as usual
t1: ip not found, pod in terminating status
t2: eventually failed


conclusion: It's the reconciliation loop issue, we check whether the selectPod is in termination, if so, we should take corresponding actions

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 30, 2024

new pod coming and pending adapter is scheduled but timestamp is not updated.

image

image

@Jeffwan Jeffwan changed the title Enhance the model adapter stability to production grade Enhance the model adapter stability to alpha grade Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lora kind/enhancement New feature or request priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant