-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 1.7 - TFX taxi cab example failing the deploy step #692
Comments
Is it a recurring error or happens rarely? I think I've come across with the error in the past. |
3 runs consistently failed. kicking off the 4th run. |
quick update 4th run also failed. |
I think this is what's happening: When you have multiple runs of the deployer step, you don't have unique deployments. You have only one deployment (model-server-v1). So when you try to fetch the TFServing pod corresponding to your run, it lists all pods created from previous runs. And it chooses the alpha-numerically first pod and since it can't find the pod corresponding to the run, the step fails. |
A temporary fix is to delete the deployment. |
In fact, the deployer has been using configurable names for the deployment and the examples are generating the name using {{workflow.name}} |
Tried this. Passed the workflow name as a parameter using the --server-name flag and it doesn't happen anymore. |
Found the bug: the deployer component truncated the deploy name to 64 bytes, which removes the distinct part of the workflow name, thus naming collision. |
solved in #704 |
* The scripts to generate the tests now depend on the jinja2 library but its not in the container. * Add some docs for debugging. * Related to kubeflow/testing#631 * Catch FilenotFoundErrors * The problem is that on master the location of some of the kustomize manifests changes (e.g. v3 versions) but for the v1.0 branches these paths won't exist. So we should just catch these errors and continue. * Update the docker image used by the Tekton pipeline because we need jinja2.
…low#692) * Clarify KFServing pod mutator webhook installation requirement * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> Co-authored-by: Animesh Singh <singhan@us.ibm.com>
…v2 (kubeflow#692) * migrate unit tests to component.yaml and verify with v2 * update readme to remove volumesnapshot * address comments
Fresh deployment the taxi cab TFX example is failing the deploy step with :
++ kubectl get po taxi-cab-classification-model-tfx-taxi-cab-classification-5lxrp taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9 --namespace kubeflow -o 'jsonpath={.status.containerStatuses[0].state.running}'
Error from server (NotFound): pods "taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9" not found
++ date +%s
++ expr 1547611953 + 1 - 1547610952
timeout
The text was updated successfully, but these errors were encountered: