Distributed Workloads v0.1.0 Release #910

anishasthana · 2023-08-09T22:19:48Z

This updates the Distributed Workloads manifests (CodeFlare and Ray) to the latest as part of the v0.1.0 release. Release notes can be found at https://github.com/opendatahub-io/distributed-workloads/releases/tag/v0.1.0.

cc @KPostOffice @Maxusmusti @MichaelClifford @tedhtchang @jbusche @LaVLaS

LaVLaS · 2023-08-10T03:39:03Z

Execution of the CI tests is most likely blocked due to an outage in openshift-ci.

anishasthana · 2023-08-10T13:49:07Z

/retest

anishasthana · 2023-08-10T16:11:01Z

/retest

jbusche · 2023-08-10T16:27:19Z

Testing looked good - I tried it utilizing this kfdef applied to the 0.8.0 ODH operator:

apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: codeflare-stack
  namespace: opendatahub
spec:
  applications:
  # CodeFlare
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: codeflare-stack
    name: codeflare-stack
  # KubeRay
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: ray/operator
    name: ray-operator
  repos:
    # ODH Core component manifests
  - name: manifests
    #uri: https://github.com/opendatahub-io/odh-manifests/tarball/master
    uri: https://github.com/anishasthana/odh-manifests/tarball/dw_0.1.1

jbusche

LGTM

tedhtchang

I didn't actually run a e2e test. The change looks good to me. One question, do we need to specify the new kuberay version somewhere?

      parameters:
      - name: odh-kuberay-operator-controller-image
         value: docker.io/kuberay/operator:v0.6.0

in
kfdef/codeflare-stack-kfdef.yaml
tests/resources/codeflare-stack/codeflare-stack-kfdef.yaml
kfdef/ray-minimal-kfdef.yaml

ray/operator/base/params.env

anishasthana · 2023-08-10T19:27:29Z

/rebase

dimakis

LGTM, tested with a batch job and checked all versions. Everything working as expected.

openshift-ci · 2023-08-11T09:33:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dimakis, jbusche
Once this PR has been reviewed and has the lgtm label, please ask for approval from anishasthana. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

anishasthana · 2023-08-11T13:39:47Z

/retest

openshift-ci · 2023-08-11T13:49:18Z

New changes are detected. LGTM label has been removed.

anishasthana · 2023-08-11T17:18:32Z

/retest

LaVLaS · 2023-08-11T17:21:57Z

@anishasthana Can you verify the test install script to ensure that the codeflare-operator install is inline with the distributed workloads feature set in this PR?

LaVLaS · 2023-08-15T04:09:12Z

@anishasthana Attached is the pod log from the failed jupyter notebook run during your test
odh-manifests-PR910-jupyter-nb-kube-3aadmin-0-jupyter-nb-kube-3aadmin.log

anishasthana · 2023-08-15T17:04:46Z

/retest

anishasthana · 2023-08-15T19:03:29Z

/retest

anishasthana · 2023-08-15T19:08:31Z

/retest

anishasthana · 2023-08-15T21:50:34Z

tests/resources/codeflare-stack/mnist_ray_mini.ipynb

@@ -50,7 +50,8 @@
   },
   "outputs": [],
   "source": [
-    "cluster.wait_ready()"
+    "cluster.wait_ready()\n",
+    "sleep(30)"


The reason for this is that the ray dashboard takes a few seconds to become accessible, so there is a gap between wait_ready and the job submission being usable. The short term solution is to simply add a small sleep before submitting the job, and the long-term solution (for next codeflare release) will likely involve waiting longer in wait_ready for the dashboard to be accessible, or adding better messaging in this edge case

We will create a follow-on issue for the wait_ready condition. Related, the notebook pods end up restarting almost immediately and then ending up in a crashloop at a different (earlier) point in the notebook. This makes it much harder to debug as the "useful" logs only showed up in the first instance of the crashloop -- every other instance fails due to an unrelated (but not critical) issue.

Thanks @MichaelClifford @Maxusmusti

anishasthana · 2023-08-16T13:51:26Z

/retest

Signed-off-by: Anish Asthana <anishasthana1@gmail.com>

openshift-ci · 2023-08-17T20:36:20Z

@anishasthana: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/odh-manifests-e2e	`4fb8b89`	link	true	`/test odh-manifests-e2e`
ci/prow/411-odh-manifests-e2e	`4fb8b89`	link	true	`/test 411-odh-manifests-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

anishasthana · 2023-08-17T20:50:40Z

/retest

LaVLaS · 2023-08-17T21:21:23Z

Merging this since it appears to have resolved the Distributed Workloads test. The modelmesh test is failing and will be fixed in #918 based on an offline conversation by @VedantMahabaleshwarkar

openshift-ci bot requested review from jbusche and KPostOffice August 9, 2023 22:21

anishasthana mentioned this pull request Aug 9, 2023

First incubation of the dashboard with v2.14.0 #903

Merged

zdtsw mentioned this pull request Aug 10, 2023

Fix: missing kuberay imageparam variable #911

Closed

3 tasks

jbusche approved these changes Aug 10, 2023

View reviewed changes

openshift-ci bot assigned jbusche Aug 10, 2023

openshift-ci bot added the lgtm label Aug 10, 2023

tedhtchang reviewed Aug 10, 2023

View reviewed changes

ray/operator/base/params.env Show resolved Hide resolved

jbusche mentioned this pull request Aug 10, 2023

Update demos for v0.6.1 project-codeflare/codeflare-sdk#296

Merged

4 tasks

dimakis approved these changes Aug 11, 2023

View reviewed changes

openshift-ci bot assigned dimakis Aug 11, 2023

anishasthana force-pushed the dw_0.1.1 branch from 995d54a to d5d1ff5 Compare August 11, 2023 13:49

openshift-ci bot removed the lgtm label Aug 11, 2023

anishasthana force-pushed the dw_0.1.1 branch from d5d1ff5 to 89db724 Compare August 11, 2023 22:00

anishasthana force-pushed the dw_0.1.1 branch 2 times, most recently from b7aa64c to f32317f Compare August 15, 2023 16:02

anishasthana force-pushed the dw_0.1.1 branch 2 times, most recently from 96b8e15 to 3fbef52 Compare August 15, 2023 21:46

anishasthana commented Aug 15, 2023

View reviewed changes

anishasthana force-pushed the dw_0.1.1 branch 2 times, most recently from 0bb1203 to e618b5b Compare August 17, 2023 02:01

zdtsw mentioned this pull request Aug 17, 2023

Update permission on operator level opendatahub-io/opendatahub-operator#447

Merged

3 tasks

anishasthana force-pushed the dw_0.1.1 branch 2 times, most recently from 3e4e845 to cf4b04b Compare August 17, 2023 17:38

Distributed Workloads v0.1.0 Release

4fb8b89

Signed-off-by: Anish Asthana <anishasthana1@gmail.com>

anishasthana force-pushed the dw_0.1.1 branch from cf4b04b to 4fb8b89 Compare August 17, 2023 17:52

LaVLaS mentioned this pull request Aug 17, 2023

Disable virtual host functionality of kserve #916

Merged

3 tasks

LaVLaS merged commit 8884e70 into opendatahub-io:master Aug 17, 2023

anishasthana deleted the dw_0.1.1 branch August 17, 2023 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Workloads v0.1.0 Release #910

Distributed Workloads v0.1.0 Release #910

anishasthana commented Aug 9, 2023

LaVLaS commented Aug 10, 2023

anishasthana commented Aug 10, 2023

anishasthana commented Aug 10, 2023

jbusche commented Aug 10, 2023

jbusche left a comment

tedhtchang left a comment

anishasthana commented Aug 10, 2023

dimakis left a comment

openshift-ci bot commented Aug 11, 2023

anishasthana commented Aug 11, 2023

openshift-ci bot commented Aug 11, 2023

anishasthana commented Aug 11, 2023

LaVLaS commented Aug 11, 2023

LaVLaS commented Aug 15, 2023

anishasthana commented Aug 15, 2023

anishasthana commented Aug 15, 2023

anishasthana commented Aug 15, 2023

anishasthana Aug 15, 2023

anishasthana commented Aug 16, 2023

openshift-ci bot commented Aug 17, 2023

anishasthana commented Aug 17, 2023

LaVLaS commented Aug 17, 2023

Distributed Workloads v0.1.0 Release #910

Distributed Workloads v0.1.0 Release #910

Conversation

anishasthana commented Aug 9, 2023

LaVLaS commented Aug 10, 2023

anishasthana commented Aug 10, 2023

anishasthana commented Aug 10, 2023

jbusche commented Aug 10, 2023

jbusche left a comment

Choose a reason for hiding this comment

tedhtchang left a comment

Choose a reason for hiding this comment

anishasthana commented Aug 10, 2023

dimakis left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 11, 2023

anishasthana commented Aug 11, 2023

openshift-ci bot commented Aug 11, 2023

anishasthana commented Aug 11, 2023

LaVLaS commented Aug 11, 2023

LaVLaS commented Aug 15, 2023

anishasthana commented Aug 15, 2023

anishasthana commented Aug 15, 2023

anishasthana commented Aug 15, 2023

anishasthana Aug 15, 2023

Choose a reason for hiding this comment

anishasthana commented Aug 16, 2023

openshift-ci bot commented Aug 17, 2023

anishasthana commented Aug 17, 2023

LaVLaS commented Aug 17, 2023