Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy monitoring-stack for prow #12854

Closed
23 tasks done
hongkailiu opened this issue Jun 3, 2019 · 24 comments
Closed
23 tasks done

Deploy monitoring-stack for prow #12854

hongkailiu opened this issue Jun 3, 2019 · 24 comments
Labels
area/metrics area/prow Issues or PRs related to prow kind/feature Categorizes issue or PR as related to a new feature. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@hongkailiu
Copy link
Contributor

hongkailiu commented Jun 3, 2019

What would you like to be added:
Add monitoring-stack including prometheus/alertmanager/grafana for prow.k8s.io.

Why is this needed:

  • Scrape prow components with prometheus to collect data provided by the endpoints which are already available.
  • Oversee with grafana dashboards how prow components are running on the (k8s-)cluster and fire alerts by alertmanager when necessary, eg, some service is down or other conditions are satisfied.

Steps:

  • Check prerequisites: k8s version >=1.8.0
  • Deploy CRDs: Prometheus, ServiceMonitor, PrometheusRule, Alertmanager
  • Deploy prometheus operator and required RBAC
  • Deploy custom object: Prometheus
  • Deploy grafana
  • Expose the ingress via the vanity URL
  • Deploy dashboards
    • use mixins: blocked
    • fix the repo name in tide board
    • add UID into ghproxy board
  • Add services for prow components that have not had /metrics services yet
  • Update config.yaml to enable serving /metrics
  • Add ServiceMontors for all prow-components
  • Deploy ServiceMontors for prometheus and grafana
  • grafana admin access issue: replicas=1
  • https only on grafana ingress: nginx is the only promising solution
  • Link in the bazel targets to apply all the monitoring objects with the normal Prow release target for postsubmit: trails; slack-discussion
  • Deploy ServiceMontor for alertmanager
  • Deploy custom object: Alertmanager
  • Add more alerts
  • Test firing alerts with AlertManager
  • Add readme to do the port-forwarding/proxy for alertmanager and prometheus to make them simpler
  • Decide if we want to HA grafana (view-only) and dev.monitoring.prow.k8s.io (admin): @Katharine and @stevekuznetsov are for this: We leave this as future work.
@hongkailiu hongkailiu added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 3, 2019
@stevekuznetsov
Copy link
Contributor

/cc @cjwagner @fejta @krzyzacy

@stevekuznetsov stevekuznetsov added this to the v1.15 milestone Jun 3, 2019
@stevekuznetsov stevekuznetsov added area/metrics area/prow Issues or PRs related to prow labels Jun 3, 2019
hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 4, 2019
This commit initiates the work for kubernetes#12854
hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 4, 2019
This commit initiates the work for kubernetes#12854
hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 4, 2019
This commit initiates the work for kubernetes#12854
@hongkailiu
Copy link
Contributor Author

tony-yang [7:04 PM]
1.13.5-gke.10

hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 4, 2019
This commit initiates the work for kubernetes#12854
hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 4, 2019
This commit initiates the work for kubernetes#12854
hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 6, 2019
This commit initiates the work for kubernetes#12854
hongkailiu added a commit to hongkailiu/test-infra that referenced this issue Jun 6, 2019
This commit initiates the work for kubernetes#12854
@stevekuznetsov
Copy link
Contributor

Thought: a bazel target to do the port-forwarding/proxy for alertmanager and prometheus to make them simpler

@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented Jun 10, 2019

As well: link in the bazel targets to apply all the monitoring objects with the normal Prow release target for postsubmit (bazel run //prow/cluster/production.apply)

@hongkailiu
Copy link
Contributor Author

Also need to find a convenient way to test queries:

  • ingress for prometheus
  • edit access (to a restricted set of people) to grafana (This might be preferable given the circumstance ^_^)

@hongkailiu
Copy link
Contributor Author

hongkailiu commented Jun 18, 2019

monitoring.prow.k8s.io is not https yet.

@cjwagner
Copy link
Member

SSL is working now, but is still optional. We need to force SSL redirection for HTTP requests to make it safe to use basic auth, but unfortunately this isn't possible to configure with ingress-gce: kubernetes/ingress-gce#51

I think we'll need to add an nginx proxy solely for this purpose like we used to have to do for Prow before we added SSL redirection to deck. Grafana doesn't appear to have any SSL redirect or HSTS options.

@hongkailiu
Copy link
Contributor Author

SSL is working now, but is still optional.

Would kubernetes.io/ingress.allow-http: "false" help as described 51#issuecomment-335960782?

@hongkailiu
Copy link
Contributor Author

@cjwagner
I would like to give another try of selling the idea of mixins (or jsonnet) for monitoring stack.

  • reusable grafana dashboards: kind of more important when deploy/grafana has more than one replica (for HA). Provisioning of dashborads via json files ensures the consistent view across grafana instances, comparing to editting via UI. We could still do the UI editting, only for debugging purpose. Once we think we should include some new panel or boards, we will generate the json file and provision it to all instances.
  • it is kind of popular in the community, eg, kubernetes-mixin. Ease the code review too (to some extent ^_^ @stevekuznetsov can tell more on this point).
  • dependency managerment: the underlying libs (eg, grafana/grafonnet-lib) will be maintained by grafana gurus and the upgrade procedure should be more smooth when we decide to change the version of grafana in the future.

Let me know your decision.

@spiffxp
Copy link
Member

spiffxp commented Jul 9, 2019

/milestone v1.16
checklist above is still incomplete, I'm assuming we have more to do here

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.15, v1.16 Jul 9, 2019
@spiffxp
Copy link
Member

spiffxp commented Jul 9, 2019

/sig testing

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 9, 2019
@hongkailiu
Copy link
Contributor Author

@spiffxp Yes. We are working on this issue.

@spiffxp
Copy link
Member

spiffxp commented Jul 10, 2019

Currently available at https://monitoring.prow.k8s.io/

@stevekuznetsov
Copy link
Contributor

All we have left is the AlertManager deployment, then we're done!

@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented Jul 10, 2019

@hongkailiu
Copy link
Contributor Author

Created kubernetes/community#3902, requesting slack incoming webhook URL.

@hongkailiu
Copy link
Contributor Author

@stevekuznetsov
Copy link
Contributor

/woof

@k8s-ci-robot
Copy link
Contributor

@stevekuznetsov: dog image

In response to this:

/woof

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krzyzacy
Copy link
Member

/meowvie sunglass

@k8s-ci-robot
Copy link
Contributor

@krzyzacy: cat image

In response to this:

/meowvie sunglass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics area/prow Issues or PRs related to prow kind/feature Categorizes issue or PR as related to a new feature. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

7 participants
@spiffxp @krzyzacy @hongkailiu @cjwagner @stevekuznetsov @k8s-ci-robot and others