chore: refactor metrics endpoint #216

bavarianbidi · 2024-02-19T15:25:10Z

refactoring is needed to make the metrics package usable from within the runner package for further metrics.

This change also makes the metric-collector independent from requests to the /metrics endpoint

This refactoring is needed to continue on #181

I will raise a couple of additional PRs to at least introduce some counters, like e.g. instance_created_total and instance_created_error_total.
I will follow this pattern -> https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-for-binary-outcomes-exposing-errors-and-totals to introduce some more metrics.

The future code might look like e.g.

diff --git a/runner/providers/external/external.go b/runner/providers/external/external.go
index d157404..9ce94d2 100644
--- a/runner/providers/external/external.go
+++ b/runner/providers/external/external.go
@@ -14,6 +14,7 @@ import (
 	garmErrors "github.com/cloudbase/garm-provider-common/errors"
 	garmExec "github.com/cloudbase/garm-provider-common/util/exec"
 	"github.com/cloudbase/garm/config"
+	"github.com/cloudbase/garm/metrics"
 	"github.com/cloudbase/garm/params"
 	"github.com/cloudbase/garm/runner/common"
 
@@ -86,6 +87,10 @@ func (e *external) CreateInstance(ctx context.Context, bootstrapParams commonPar
 	if err != nil {
 		return commonParams.ProviderInstance{}, garmErrors.NewProviderError("provider binary %s returned error: %s", e.execPath, err)
 	}
+	metrics.InstanceCreatedCount.WithLabelValues(
+		bootstrapParams.PoolID, // label: pool_id
+		e.cfg.Name,             // label: provider
+	).Inc()
 
 	var param commonParams.ProviderInstance
 	if err := json.Unmarshal(out, &param); err != nil {

refactoring is needed to make the metrics package usable from within the runner package for further metrics. This change also makes the metric-collector independent from requests to the /metrics endpoint Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

bavarianbidi · 2024-02-19T15:27:56Z

@gabriel-samfira by refactoring all the metrics, i've seen that we sometimes add the labels hostname and controller_id (by using r.GetControllerInfo(ctx)) to the defined metrics. But sometimes we don't add this.

Will do another round in this PR and align all the labels to have at least the controller_id as label. We do not need the hostname label, but if you prefer to have it, please let me know.

gabriel-samfira · 2024-02-19T15:51:59Z

This is great! As part of the next release, I was thinking of extending the external provider interface to allow an RPC API of some sort that external providers can call into for logging, telemetry, etc. We can also do metrics there once that happens, like how long executing an operation against a provider took and things of that sort.

Will do an in-depth review soon.

gabriel-samfira

I love the separation of concerns and how much cleaner the code is. I have just a couple of comments, but overall this looks much nicer!

metrics/metrics.go

runner/metrics/metrics.go

by adding the context from main and make auth.GetAdminContext accepting a context we are now able to stop the metrics collection loop once the context is canceled Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

gabriel-samfira

Just a couple of really tiny suggestions to align with the rest of the code base. Overall this looks great! Thanks for all the work!

config/config.go

cmd/garm/main.go

runner/metrics/metrics.go

fail if metric registration panics Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

bavarianbidi · 2024-02-20T14:24:16Z

cmd/garm/main.go

+		if err := metrics.RegisterMetrics(); err != nil {
+			log.Fatal(err)
+		}


one downside of this implementation (register the metrics here instead of doing it in the init):

for any further metric we want to change, we have to do a e.g.

if cfg.Metrics.Enable { metrics.InstanceCreatedCount.WithLabelValues( bootstrapParams.PoolID, // label: pool_id e.cfg.Name, // label: provider ).Inc() }

i'm either thinking of registering the metrics, no matter if cfg.Metrics.Enable is set or directly drop cfg.Metrics.Enable and always run garm with /metrics.

WDYT @gabriel-samfira ?

Wouldn't we just add it to the RegisterMetrics() function and it gets registered here?

In essence, RegisterMetrics() would act like the init() function, but would be capable of returning an error we can potentially handle. So instead of using the init() function, we define our own and call it here to register the metrics.

If metrics are enabled, we register them and start the collector loop.

runner/enterprises_test.go

fix linter warnings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

gabriel-samfira requested changes Feb 19, 2024

View reviewed changes

metrics/metrics.go Outdated Show resolved Hide resolved

runner/metrics/metrics.go Outdated Show resolved Hide resolved

runner/metrics/metrics.go Outdated Show resolved Hide resolved

runner/metrics/metrics.go Outdated Show resolved Hide resolved

bavarianbidi added 2 commits February 20, 2024 06:33

fix: improve metrics collection loop

97f172e

by adding the context from main and make auth.GetAdminContext accepting a context we are now able to stop the metrics collection loop once the context is canceled Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

feat: define a default duration for metrics update

3e025dd

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

gabriel-samfira requested changes Feb 20, 2024

View reviewed changes

config/config.go Show resolved Hide resolved

cmd/garm/main.go Outdated Show resolved Hide resolved

runner/metrics/metrics.go Outdated Show resolved Hide resolved

runner/metrics/metrics.go Outdated Show resolved Hide resolved

runner/metrics/metrics.go Outdated Show resolved Hide resolved

bavarianbidi added 2 commits February 20, 2024 14:27

chore: rework prometheus metrics registration

17d74df

fail if metric registration panics Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

fix: stop metrics collector ticker on ctx.Done

0a53b8f

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

bavarianbidi requested a review from gabriel-samfira February 20, 2024 14:16

bavarianbidi commented Feb 20, 2024

View reviewed changes

gabriel-samfira approved these changes Feb 20, 2024

View reviewed changes

gabriel-samfira reviewed Feb 20, 2024

View reviewed changes

runner/enterprises_test.go Outdated Show resolved Hide resolved

bavarianbidi added 2 commits February 20, 2024 16:39

fix: pass context.TODO by getting admin context

2a3e4d6

fix linter warnings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

fix: switch to context.Background() for adminctx

b1cbfac

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

gabriel-samfira merged commit e108140 into cloudbase:main Feb 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: refactor metrics endpoint #216

chore: refactor metrics endpoint #216

bavarianbidi commented Feb 19, 2024 •

edited

Loading

bavarianbidi commented Feb 19, 2024

gabriel-samfira commented Feb 19, 2024

gabriel-samfira left a comment

gabriel-samfira left a comment

bavarianbidi Feb 20, 2024

gabriel-samfira Feb 20, 2024

gabriel-samfira Feb 20, 2024

chore: refactor metrics endpoint #216

chore: refactor metrics endpoint #216

Conversation

bavarianbidi commented Feb 19, 2024 • edited Loading

bavarianbidi commented Feb 19, 2024

gabriel-samfira commented Feb 19, 2024

gabriel-samfira left a comment

Choose a reason for hiding this comment

gabriel-samfira left a comment

Choose a reason for hiding this comment

bavarianbidi Feb 20, 2024

Choose a reason for hiding this comment

gabriel-samfira Feb 20, 2024

Choose a reason for hiding this comment

gabriel-samfira Feb 20, 2024

Choose a reason for hiding this comment

bavarianbidi commented Feb 19, 2024 •

edited

Loading