Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus support in TF Job #988

Closed
johnugeorge opened this issue May 9, 2019 · 10 comments
Closed

Prometheus support in TF Job #988

johnugeorge opened this issue May 9, 2019 · 10 comments

Comments

@johnugeorge
Copy link
Member

johnugeorge commented May 9, 2019

Reference: kubeflow/common#22

Since common repo is not ready yet, Prometheus support will be added to TF/Pytorch and code will be later moved to common repo.

/assign @johnugeorge
/cc @richardsliu

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label improvement/enhancement to this issue, with a confidence of 0.74. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor

jlewi commented May 13, 2019

Thanks @johnugeorge. Is anyone planning on picking this up in the next 2 weeks?

@johnugeorge
Copy link
Member Author

johnugeorge commented May 13, 2019

Yes. This will be in 0.6 release. @krishnadurai is working on it

@ScorpioCPH
Copy link
Member

Hi, this is very useful for production, and here are some requirements about this feature:

  • Export leader info, such as report 1 if this is leader and report 0 if not leader.
  • Report TFJob object creation and deletion infos, such as how many object created per seconds.
  • Report CPU/Memory used infos.

@krishnadurai
Copy link
Contributor

@ScorpioCPH thanks for posting these requirements. I'll include these in the PR.

@krishnadurai
Copy link
Contributor

krishnadurai commented May 25, 2019

Metrics being targeted to track using Prometheus:


  1. Report for each pod (tf-operator, tf-master, tf-ps and tf-worker), reports on:
  • CPU usage
  • GPU usage
  • Memory usage
  • Network usage
  • I/O usage
  • Keep-Alive check
  • Is Leader check
  1. Report TFJob metrics:
  • Job creation
  • Job deletion
  • Jobs created per hour
  • Successful job completions

@gaocegege
Copy link
Member

@krishnadurai

Thanks for the summary. Will your PR be ready after 0.6 release?

@johnugeorge
Copy link
Member Author

@gaocegege It is planned for upcoming v1 release

krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue May 28, 2019
Instructutions for deploying prometheus-operator with Helm
Sample queries for getting monitoring metrics for:
CPU
Memory
Network
I/O

Issue kubeflow#988
krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue May 28, 2019
- Is Leader for tf-operator pod
- Documents keep-alive check

Issue kubeflow#988
krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue May 31, 2019
Code is now go formatted
README modified to remove installation instructions
Review comments addressed
Issue kubeflow#988
krishnadurai added a commit to krishnadurai/tf-operator that referenced this issue Jun 1, 2019
tf-master changed to tf-chief

Issue kubeflow#988
k8s-ci-robot pushed a commit that referenced this issue Jun 3, 2019
* Adds basic monitoring through prometheus setup
Instructutions for deploying prometheus-operator with Helm
Sample queries for getting monitoring metrics for:
CPU
Memory
Network
I/O

Issue #988

* Adds prometheus metrics for:
- Is Leader for tf-operator pod
- Documents keep-alive check

Issue #988

* Adds new metrics:
- Jobs created
- Jobs deleted
- Jobs successful
- Reference to container based GPU metrics through cAdvisor

Updates packages for promauto and removes unneccessary ones

* Accepts monitoring port through CLI
Code is now go formatted
README modified to remove installation instructions
Review comments addressed
Issue #988

* Restoring unintended changes

* Adds job failure metric
tf-master changed to tf-chief

Issue #988

* Adds job restart metric

* Adds missed failure condition
@richardsliu
Copy link
Contributor

This is now done.
/close

@k8s-ci-robot
Copy link

@richardsliu: Closing this issue.

In response to this:

This is now done.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants