Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart pods crash loop when using 0.9 #215

Closed
Xaenalt opened this issue Aug 23, 2022 · 4 comments · Fixed by #321
Closed

Quickstart pods crash loop when using 0.9 #215

Xaenalt opened this issue Aug 23, 2022 · 4 comments · Fixed by #321
Labels
bug Something isn't working

Comments

@Xaenalt
Copy link

Xaenalt commented Aug 23, 2022

In attempting to run the quickstart on my local OpenShift, I noticed both the etcd and minio pods crash loop on start. Logs attached

To Reproduce
Steps to reproduce the behavior:

  1. Clone modelmesh-serving
  2. Run scripts/install.sh --namespace modelmesh-serving --quickstart
  3. Observe pods

Expected behavior

Quickstart pods come up

Screenshots
etcd-crash.log
minio-crash.log

Environment (please complete the following information):

  • OS: OpenShift 4.11
  • Browser N/A
  • Version release-0.9 branch
@Xaenalt Xaenalt added the bug Something isn't working label Aug 23, 2022
@njhill
Copy link
Member

njhill commented Aug 25, 2022

@Xaenalt for etcd it looks like the same issue as discussed in #210, and from the log the minio issue looks similar. PRs to address these would be welcome!

@deleeuwblue
Copy link

I was able to successfully run the quickstart script on OpenShift, using the workaround mentioned in #210, for both etcd and minio. However, changing the minio 'working directory' from /data1 to /tmp/data1 means that the default models (for pytorch, sklearn, tensorflow etc) in /data1 cannot be accessed. I also needed to manually create the default bucket 'modelmesh-example-models' before minio could write new objects.

For minio I think a better solution is to change the permissions on the original /data1 directory, so that OpenShift's arbitrarily assigned user ID has read & write access. According to the Red Hat docs, this should be done in the Dockerfile like this:

RUN chgrp -R 0 /some/directory && \
    chmod -R g=u /some/directory

I don't see the Dockerfile for kserve/modelmesh-minio-examples:v0.9.0, is it available? If so, I would be happy to verify the change and create a PR.

For etcd, I suppose the quickstart script could be changed to use a data-dir of /tmp/etcd.data, or the image quay.io/coreos/etcd:v3.5.4 could be extended with modified permissions on the existing data-dir which is $HOME.

@njhill
Copy link
Member

njhill commented Jan 11, 2023

@deleeuwblue thanks for this. When I previously responded in this issue above, I had forgotten that a custom minio image with some sample models is used by the quickstart, was thinking more about the case where the vanilla minio dockerhub image is used.

GIven that we do publish this minio image, you're absolutely right that it would make sense to fix the permissions in it to work as-is on OpenShift.

I don't see the Dockerfile for kserve/modelmesh-minio-examples:v0.9.0, is it available? If so, I would be happy to verify the change and create a PR.

Than you for the offer! Coincidentally we discovered recently that this Dockerfile hadn't been included in the repo, we are in the process of adding it now, just trying to figure out the best place to store the corresponding example model files.

We're also in the middle of doing a 0.10.0 release, not sure yet whether this fix will make it into that but hopefully we can get it done today. Otherwise you may still need to tweak the install to point to a newer minio examples image.

For etcd, I suppose the quickstart script could be changed to use a data-dir of /tmp/etcd.data, or the image quay.io/coreos/etcd:v3.5.4 could be extended with modified permissions on the existing data-dir which is $HOME.

Also a good suggestion. We don't build our own etcd image but we could modify the quickstart manifest to do the former.

cc @ckadner @tedhtchang

@ckadner
Copy link
Member

ckadner commented Jan 25, 2023

I don't see the Dockerfile for kserve/modelmesh-minio-examples:v0.9.0, is it available? If so, I would be happy to verify the change and create a PR.

Thank you for the offer! Coincidentally we discovered recently that this Dockerfile hadn't been included in the repo, we are in the process of adding it now, just trying to figure out the best place to store the corresponding example model files.

We're also in the middle of doing a 0.10.0 release, not sure yet whether this fix will make it into that but hopefully we can get it done today. Otherwise you may still need to tweak the install to point to a newer minio examples image.

@Xaenalt -- we do have a new repo for the kserve/modelmesh-minio-examples along with the modified Dockerfile now.

@njhill had changed the /data1 access permissions and the image is now run as non-root user 1000 modelmesh.

The new minio examples image is part of our v0.10.0 release but you can pull it directly (and probably swap it out on your v0.9.0 deployment) from here:

https://hub.docker.com/r/kserve/modelmesh-minio-dev-examples/tags

ckadner added a commit to ckadner/modelmesh-serving that referenced this issue Jan 26, 2023
Resolves kserve#210
Resolves kserve#215

Signed-off-by: Christian Kadner <ckadner@us.ibm.com>
kserve-oss-bot pushed a commit that referenced this issue Feb 2, 2023
#### Motivation

Addressing file access permission issues on OpenShift reported in issues #210 and #215 for the `etcd` deployment.

#### Modifications

Adding `data-dir` parameter to the container `args`:

```
- --data-dir
- /tmp/etcd.data
```


#### Result

The `etcd` pod comes up fine and spot-testing a few basic use cases went fine. I tested on IBM Cloud Kubernetes 1.24 and OCP 4.10

---

Resolves #210
Resolves #215

Signed-off-by: Christian Kadner <ckadner@us.ibm.com>
njhill pushed a commit that referenced this issue Feb 2, 2023
#### Motivation

Addressing file access permission issues on OpenShift reported in issues #210 and #215 for the `etcd` deployment.

#### Modifications

Adding `data-dir` parameter to the container `args`:

```
- --data-dir
- /tmp/etcd.data
```

#### Result

The `etcd` pod comes up fine and spot-testing a few basic use cases went fine. I tested on IBM Cloud Kubernetes 1.24 and OCP 4.10

---

Resolves #210
Resolves #215

Signed-off-by: Christian Kadner <ckadner@us.ibm.com>
(cherry picked from commit f2a4a30)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants