-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus support in TF Job #988
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
Thanks @johnugeorge. Is anyone planning on picking this up in the next 2 weeks? |
Yes. This will be in 0.6 release. @krishnadurai is working on it |
Hi, this is very useful for production, and here are some requirements about this feature:
|
@ScorpioCPH thanks for posting these requirements. I'll include these in the PR. |
Metrics being targeted to track using Prometheus:
|
Thanks for the summary. Will your PR be ready after 0.6 release? |
@gaocegege It is planned for upcoming v1 release |
Instructutions for deploying prometheus-operator with Helm Sample queries for getting monitoring metrics for: CPU Memory Network I/O Issue kubeflow#988
- Is Leader for tf-operator pod - Documents keep-alive check Issue kubeflow#988
Code is now go formatted README modified to remove installation instructions Review comments addressed Issue kubeflow#988
tf-master changed to tf-chief Issue kubeflow#988
* Adds basic monitoring through prometheus setup Instructutions for deploying prometheus-operator with Helm Sample queries for getting monitoring metrics for: CPU Memory Network I/O Issue #988 * Adds prometheus metrics for: - Is Leader for tf-operator pod - Documents keep-alive check Issue #988 * Adds new metrics: - Jobs created - Jobs deleted - Jobs successful - Reference to container based GPU metrics through cAdvisor Updates packages for promauto and removes unneccessary ones * Accepts monitoring port through CLI Code is now go formatted README modified to remove installation instructions Review comments addressed Issue #988 * Restoring unintended changes * Adds job failure metric tf-master changed to tf-chief Issue #988 * Adds job restart metric * Adds missed failure condition
This is now done. |
@richardsliu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Reference: kubeflow/common#22
Since common repo is not ready yet, Prometheus support will be added to TF/Pytorch and code will be later moved to common repo.
/assign @johnugeorge
/cc @richardsliu
The text was updated successfully, but these errors were encountered: