Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Distributed Workloads v0.1.0 Release #910

Merged
merged 1 commit into from
Aug 17, 2023

Conversation

anishasthana
Copy link
Member

This updates the Distributed Workloads manifests (CodeFlare and Ray) to the latest as part of the v0.1.0 release. Release notes can be found at https://github.com/opendatahub-io/distributed-workloads/releases/tag/v0.1.0.

cc @KPostOffice @Maxusmusti @MichaelClifford @tedhtchang @jbusche @LaVLaS

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 10, 2023

Execution of the CI tests is most likely blocked due to an outage in openshift-ci.

@anishasthana
Copy link
Member Author

/retest

1 similar comment
@anishasthana
Copy link
Member Author

/retest

@jbusche
Copy link
Contributor

jbusche commented Aug 10, 2023

Testing looked good - I tried it utilizing this kfdef applied to the 0.8.0 ODH operator:

apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: codeflare-stack
  namespace: opendatahub
spec:
  applications:
  # CodeFlare
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: codeflare-stack
    name: codeflare-stack
  # KubeRay
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: ray/operator
    name: ray-operator
  repos:
    # ODH Core component manifests
  - name: manifests
    #uri: https://github.com/opendatahub-io/odh-manifests/tarball/master
    uri: https://github.com/anishasthana/odh-manifests/tarball/dw_0.1.1

Copy link
Contributor

@jbusche jbusche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually run a e2e test. The change looks good to me. One question, do we need to specify the new kuberay version somewhere?

      parameters:
      - name: odh-kuberay-operator-controller-image
         value: docker.io/kuberay/operator:v0.6.0

in
kfdef/codeflare-stack-kfdef.yaml
tests/resources/codeflare-stack/codeflare-stack-kfdef.yaml
kfdef/ray-minimal-kfdef.yaml

ray/operator/base/params.env Show resolved Hide resolved
@anishasthana
Copy link
Member Author

/rebase

Copy link

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested with a batch job and checked all versions. Everything working as expected.

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dimakis, jbusche
Once this PR has been reviewed and has the lgtm label, please ask for approval from anishasthana. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@anishasthana
Copy link
Member Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2023

New changes are detected. LGTM label has been removed.

@anishasthana
Copy link
Member Author

/retest

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 11, 2023

@anishasthana Can you verify the test install script to ensure that the codeflare-operator install is inline with the distributed workloads feature set in this PR?

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 15, 2023

@anishasthana Attached is the pod log from the failed jupyter notebook run during your test
odh-manifests-PR910-jupyter-nb-kube-3aadmin-0-jupyter-nb-kube-3aadmin.log

@anishasthana anishasthana force-pushed the dw_0.1.1 branch 2 times, most recently from b7aa64c to f32317f Compare August 15, 2023 16:02
@anishasthana
Copy link
Member Author

/retest

2 similar comments
@anishasthana
Copy link
Member Author

/retest

@anishasthana
Copy link
Member Author

/retest

@anishasthana anishasthana force-pushed the dw_0.1.1 branch 2 times, most recently from 96b8e15 to 3fbef52 Compare August 15, 2023 21:46
@@ -50,7 +50,8 @@
},
"outputs": [],
"source": [
"cluster.wait_ready()"
"cluster.wait_ready()\n",
"sleep(30)"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that the ray dashboard takes a few seconds to become accessible, so there is a gap between wait_ready and the job submission being usable. The short term solution is to simply add a small sleep before submitting the job, and the long-term solution (for next codeflare release) will likely involve waiting longer in wait_ready for the dashboard to be accessible, or adding better messaging in this edge case

We will create a follow-on issue for the wait_ready condition. Related, the notebook pods end up restarting almost immediately and then ending up in a crashloop at a different (earlier) point in the notebook. This makes it much harder to debug as the "useful" logs only showed up in the first instance of the crashloop -- every other instance fails due to an unrelated (but not critical) issue.

Thanks @MichaelClifford @Maxusmusti

@anishasthana
Copy link
Member Author

/retest

Signed-off-by: Anish Asthana <anishasthana1@gmail.com>
@openshift-ci
Copy link

openshift-ci bot commented Aug 17, 2023

@anishasthana: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/odh-manifests-e2e 4fb8b89 link true /test odh-manifests-e2e
ci/prow/411-odh-manifests-e2e 4fb8b89 link true /test 411-odh-manifests-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@anishasthana
Copy link
Member Author

/retest

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 17, 2023

Merging this since it appears to have resolved the Distributed Workloads test. The modelmesh test is failing and will be fixed in #918 based on an offline conversation by @VedantMahabaleshwarkar

@LaVLaS LaVLaS merged commit 8884e70 into opendatahub-io:master Aug 17, 2023
@anishasthana anishasthana deleted the dw_0.1.1 branch August 17, 2023 21:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants