Prometheus support in TF Job #988

johnugeorge · 2019-05-09T17:51:42Z

Since common repo is not ready yet, Prometheus support will be added to TF/Pytorch and code will be later moved to common repo.

/assign @johnugeorge
/cc @richardsliu

issue-label-bot · 2019-05-09T17:51:45Z

Issue-Label Bot is automatically applying the label improvement/enhancement to this issue, with a confidence of 0.74. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

jlewi · 2019-05-13T01:20:51Z

Thanks @johnugeorge. Is anyone planning on picking this up in the next 2 weeks?

johnugeorge · 2019-05-13T04:08:41Z

Yes. This will be in 0.6 release. @krishnadurai is working on it

ScorpioCPH · 2019-05-23T08:14:30Z

Hi, this is very useful for production, and here are some requirements about this feature:

Export leader info, such as report 1 if this is leader and report 0 if not leader.
Report TFJob object creation and deletion infos, such as how many object created per seconds.
Report CPU/Memory used infos.

krishnadurai · 2019-05-24T06:35:28Z

@ScorpioCPH thanks for posting these requirements. I'll include these in the PR.

krishnadurai · 2019-05-25T11:18:28Z

gaocegege · 2019-05-26T04:33:59Z

@krishnadurai

Thanks for the summary. Will your PR be ready after 0.6 release?

johnugeorge · 2019-05-26T06:58:29Z

@gaocegege It is planned for upcoming v1 release

Instructutions for deploying prometheus-operator with Helm Sample queries for getting monitoring metrics for: CPU Memory Network I/O Issue kubeflow#988

- Is Leader for tf-operator pod - Documents keep-alive check Issue kubeflow#988

Code is now go formatted README modified to remove installation instructions Review comments addressed Issue kubeflow#988

tf-master changed to tf-chief Issue kubeflow#988

* Adds basic monitoring through prometheus setup Instructutions for deploying prometheus-operator with Helm Sample queries for getting monitoring metrics for: CPU Memory Network I/O Issue #988 * Adds prometheus metrics for: - Is Leader for tf-operator pod - Documents keep-alive check Issue #988 * Adds new metrics: - Jobs created - Jobs deleted - Jobs successful - Reference to container based GPU metrics through cAdvisor Updates packages for promauto and removes unneccessary ones * Accepts monitoring port through CLI Code is now go formatted README modified to remove installation instructions Review comments addressed Issue #988 * Restoring unintended changes * Adds job failure metric tf-master changed to tf-chief Issue #988 * Adds job restart metric * Adds missed failure condition

richardsliu · 2019-06-03T16:46:40Z

This is now done.
/close

k8s-ci-robot · 2019-06-03T16:46:42Z

@richardsliu: Closing this issue.

In response to this:

This is now done.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned johnugeorge May 9, 2019

issue-label-bot bot added the improvement/enhancement label May 9, 2019

richardsliu added the area/0.6.0 label May 9, 2019

richardsliu mentioned this issue May 9, 2019

TFJob 1.0 #968

Closed

4 tasks

jlewi added the priority/p1 label May 13, 2019

krishnadurai mentioned this issue May 28, 2019

Prometheus Monitoring for TF Operator #1018

Merged

12 tasks

krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue May 28, 2019

Adds prometheus metrics for:

9defad7

- Is Leader for tf-operator pod - Documents keep-alive check Issue kubeflow#988

krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue May 31, 2019

Accepts monitoring port through CLI

98e4161

Code is now go formatted README modified to remove installation instructions Review comments addressed Issue kubeflow#988

krishnadurai mentioned this issue May 31, 2019

Adds prometheus annotation and port configurations for tf-operator kubeflow/kubeflow#3378

Merged

krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue Jun 1, 2019

Adds job failure metric

93efa92

tf-master changed to tf-chief Issue kubeflow#988

k8s-ci-robot closed this as completed Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus support in TF Job #988

Prometheus support in TF Job #988

johnugeorge commented May 9, 2019 •

edited

Loading

issue-label-bot bot commented May 9, 2019

jlewi commented May 13, 2019

johnugeorge commented May 13, 2019 •

edited

Loading

ScorpioCPH commented May 23, 2019

krishnadurai commented May 24, 2019

krishnadurai commented May 25, 2019 •

edited

Loading

gaocegege commented May 26, 2019

johnugeorge commented May 26, 2019

richardsliu commented Jun 3, 2019

k8s-ci-robot commented Jun 3, 2019

Prometheus support in TF Job #988

Prometheus support in TF Job #988

Comments

johnugeorge commented May 9, 2019 • edited Loading

issue-label-bot bot commented May 9, 2019

jlewi commented May 13, 2019

johnugeorge commented May 13, 2019 • edited Loading

ScorpioCPH commented May 23, 2019

krishnadurai commented May 24, 2019

krishnadurai commented May 25, 2019 • edited Loading

gaocegege commented May 26, 2019

johnugeorge commented May 26, 2019

richardsliu commented Jun 3, 2019

k8s-ci-robot commented Jun 3, 2019

johnugeorge commented May 9, 2019 •

edited

Loading

johnugeorge commented May 13, 2019 •

edited

Loading

krishnadurai commented May 25, 2019 •

edited

Loading