-
Notifications
You must be signed in to change notification settings - Fork 211
Conversation
Execution of the CI tests is most likely blocked due to an outage in openshift-ci. |
/retest |
1 similar comment
/retest |
Testing looked good - I tried it utilizing this kfdef applied to the 0.8.0 ODH operator:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually run a e2e test. The change looks good to me. One question, do we need to specify the new kuberay version somewhere?
parameters:
- name: odh-kuberay-operator-controller-image
value: docker.io/kuberay/operator:v0.6.0
in
kfdef/codeflare-stack-kfdef.yaml
tests/resources/codeflare-stack/codeflare-stack-kfdef.yaml
kfdef/ray-minimal-kfdef.yaml
/rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tested with a batch job and checked all versions. Everything working as expected.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dimakis, jbusche The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
New changes are detected. LGTM label has been removed. |
/retest |
@anishasthana Can you verify the test install script to ensure that the codeflare-operator install is inline with the distributed workloads feature set in this PR? |
@anishasthana Attached is the pod log from the failed jupyter notebook run during your test |
b7aa64c
to
f32317f
Compare
/retest |
2 similar comments
/retest |
/retest |
96b8e15
to
3fbef52
Compare
@@ -50,7 +50,8 @@ | |||
}, | |||
"outputs": [], | |||
"source": [ | |||
"cluster.wait_ready()" | |||
"cluster.wait_ready()\n", | |||
"sleep(30)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is that the ray dashboard takes a few seconds to become accessible, so there is a gap between wait_ready and the job submission being usable. The short term solution is to simply add a small sleep before submitting the job, and the long-term solution (for next codeflare release) will likely involve waiting longer in wait_ready for the dashboard to be accessible, or adding better messaging in this edge case
We will create a follow-on issue for the wait_ready condition. Related, the notebook pods end up restarting almost immediately and then ending up in a crashloop at a different (earlier) point in the notebook. This makes it much harder to debug as the "useful" logs only showed up in the first instance of the crashloop -- every other instance fails due to an unrelated (but not critical) issue.
Thanks @MichaelClifford @Maxusmusti
/retest |
0bb1203
to
e618b5b
Compare
3e4e845
to
cf4b04b
Compare
Signed-off-by: Anish Asthana <anishasthana1@gmail.com>
@anishasthana: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
Merging this since it appears to have resolved the Distributed Workloads test. The modelmesh test is failing and will be fixed in #918 based on an offline conversation by @VedantMahabaleshwarkar |
This updates the Distributed Workloads manifests (CodeFlare and Ray) to the latest as part of the v0.1.0 release. Release notes can be found at https://github.com/opendatahub-io/distributed-workloads/releases/tag/v0.1.0.
cc @KPostOffice @Maxusmusti @MichaelClifford @tedhtchang @jbusche @LaVLaS