Skip to content
This repository has been archived by the owner on Jul 22, 2022. It is now read-only.

Expose workqueue stats #75

Merged
merged 1 commit into from
Sep 17, 2018
Merged

Conversation

rlguarino
Copy link
Contributor

This CL adds an http endpoint with one handler "/metrics" that serves
Prometheus metrics exported by MetaController. The metrics are collected
using OpenCensus and exported using the Prometheus HTTP exposition
format.

This CL also configures the Kubernetes workqueue package to collect its
statistics and expose them via the Prometheus http endpoint. We collect
and expose all of the workqueue metrics (Depth, Adds, Latency,
WorkDuration, Retries) tagged with the name of the queue.

main.go Outdated
mux := http.NewServeMux()
mux.Handle("/metrics", exporter)
srv := &http.Server{
Addr: ":9999",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enisoc I want to get your opinion on how we should configure this. My initial guess would be to just use an environment variable, but I wanted to discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a flag to configure this.

"os"
"os/signal"
"sync"
"syscall"
"time"

"github.com/0xRLG/ocworkqueue"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to going via OpenCensus instead of reusing the Prometheus provider from core k8s? Whenever possible, I prefer to reuse code and patterns from the core since they're well-exercised.

Copy link
Contributor Author

@rlguarino rlguarino Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't even know that existed I've been using this package for a long time now and I've only just made it open source. There isn't a huge difference between the two plain old prometheus and opencensus. Both result in Prometheus metrics being exposed on the debug endpoint. I think that OpenCensus ends up being cheaper to collect if you don't report it anywhere but we don't do that in this PR.

To be honest I think the fact that this project is in GoogleCloudPlatform and OpenCensus is being driven by Google people the two together made sense to me.

I really don't mind updating the PR to use the prometheus exporter from client-go.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright that makes sense. Let's just keep OpenCensus. This PR should be good to go then whenever you have a chance to make the other changes below.

main.go Outdated
@@ -42,6 +48,7 @@ import (
var (
discoveryInterval = flag.Duration("discovery-interval", 30*time.Second, "How often to refresh discovery cache to pick up newly-installed resources")
informerRelist = flag.Duration("cache-flush-interval", 30*time.Minute, "How often to flush local caches and relist objects from the API server")
debugAddr = flag.String("debug-addr", "localhost:9999", "The address to bind the debug http endpoints")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding to localhost by default means it won't be accessible from outside the Pod, doesn't it? Is that intentional? It looks like earlier you had just ":9999".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right I'll change it.

main.go Outdated
Handler: mux,
}
go func() {
glog.Fatalf("cannot serve debug endpoint: %v", srv.ListenAndServe())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be logged but not fatal? The debug endpoint isn't strictly necessary for Metacontroller to perform its main duties. If I'm relying on Metacontroller to manage my app, I wouldn't want it to give up just because, for example, the metrics endpoint couldn't bind due to a port conflict.

In any case, we should probably ignore ErrServerClosed here because we call Shutdown() below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right I remember fighting with this log line quite a bit I should have paid more attention. I'll change this to an Errorf

@rlguarino rlguarino force-pushed the ross/2 branch 3 times, most recently from a87d998 to 527886a Compare September 17, 2018 17:10
This CL adds an http endpoint with one handler "/metrics" that serves
Prometheus metrics exported by MetaController. The metrics are collected
using OpenCensus and exported using the Prometheus HTTP exposition
format.

This CL also configures the Kubernetes workqueue package to collect its
statistics and expose them via the Prometheus http endpoint. We collect
and expose all of the workqueue metrics (Depth, Adds, Latency,
WorkDuration, Retries) tagged with the name of the queue.
@rlguarino
Copy link
Contributor Author

I addressed the comments and rebased this off master. I think the GoPkg.lock should now be the result of master's GoPkg.locl with my changes to GoPkg.toml and then running dep ensure.

I ran:

$ git checkout origin/master Gopkg.lock
$ rm -rf vendor/
$ dep ensure

Copy link

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@enisoc enisoc merged commit 31394a5 into GoogleCloudPlatform:master Sep 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants