Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: refactor metrics endpoint #216

Merged
merged 7 commits into from
Feb 20, 2024

Conversation

bavarianbidi
Copy link
Contributor

@bavarianbidi bavarianbidi commented Feb 19, 2024

refactoring is needed to make the metrics package usable from within the runner package for further metrics.

This change also makes the metric-collector independent from requests to the /metrics endpoint

This refactoring is needed to continue on #181

I will raise a couple of additional PRs to at least introduce some counters, like e.g. instance_created_total and instance_created_error_total.
I will follow this pattern -> https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-for-binary-outcomes-exposing-errors-and-totals to introduce some more metrics.

The future code might look like e.g.

diff --git a/runner/providers/external/external.go b/runner/providers/external/external.go
index d157404..9ce94d2 100644
--- a/runner/providers/external/external.go
+++ b/runner/providers/external/external.go
@@ -14,6 +14,7 @@ import (
 	garmErrors "github.com/cloudbase/garm-provider-common/errors"
 	garmExec "github.com/cloudbase/garm-provider-common/util/exec"
 	"github.com/cloudbase/garm/config"
+	"github.com/cloudbase/garm/metrics"
 	"github.com/cloudbase/garm/params"
 	"github.com/cloudbase/garm/runner/common"
 
@@ -86,6 +87,10 @@ func (e *external) CreateInstance(ctx context.Context, bootstrapParams commonPar
 	if err != nil {
 		return commonParams.ProviderInstance{}, garmErrors.NewProviderError("provider binary %s returned error: %s", e.execPath, err)
 	}
+	metrics.InstanceCreatedCount.WithLabelValues(
+		bootstrapParams.PoolID, // label: pool_id
+		e.cfg.Name,             // label: provider
+	).Inc()
 
 	var param commonParams.ProviderInstance
 	if err := json.Unmarshal(out, &param); err != nil {

refactoring is needed to make the metrics package usable from within the
runner package for further metrics.

This change also makes the metric-collector independent from requests to
the /metrics endpoint

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
@bavarianbidi
Copy link
Contributor Author

@gabriel-samfira by refactoring all the metrics, i've seen that we sometimes add the labels hostname and controller_id (by using r.GetControllerInfo(ctx)) to the defined metrics. But sometimes we don't add this.

Will do another round in this PR and align all the labels to have at least the controller_id as label. We do not need the hostname label, but if you prefer to have it, please let me know.

@gabriel-samfira
Copy link
Member

This is great! As part of the next release, I was thinking of extending the external provider interface to allow an RPC API of some sort that external providers can call into for logging, telemetry, etc. We can also do metrics there once that happens, like how long executing an operation against a provider took and things of that sort.

Will do an in-depth review soon.

Copy link
Member

@gabriel-samfira gabriel-samfira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the separation of concerns and how much cleaner the code is. I have just a couple of comments, but overall this looks much nicer!

by adding the context from main and make auth.GetAdminContext accepting a
context we are now able to stop the metrics collection loop once the
context is canceled

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Copy link
Member

@gabriel-samfira gabriel-samfira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of really tiny suggestions to align with the rest of the code base. Overall this looks great! Thanks for all the work!

fail if metric registration panics

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Comment on lines +224 to +226
if err := metrics.RegisterMetrics(); err != nil {
log.Fatal(err)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one downside of this implementation (register the metrics here instead of doing it in the init):

for any further metric we want to change, we have to do a e.g.

if cfg.Metrics.Enable {
       metrics.InstanceCreatedCount.WithLabelValues(
		bootstrapParams.PoolID, // label: pool_id
		e.cfg.Name,             // label: provider
	).Inc()
}

i'm either thinking of registering the metrics, no matter if cfg.Metrics.Enable is set or directly drop cfg.Metrics.Enable and always run garm with /metrics.

WDYT @gabriel-samfira ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we just add it to the RegisterMetrics() function and it gets registered here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In essence, RegisterMetrics() would act like the init() function, but would be capable of returning an error we can potentially handle. So instead of using the init() function, we define our own and call it here to register the metrics.

If metrics are enabled, we register them and start the collector loop.

fix linter warnings

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
@gabriel-samfira gabriel-samfira merged commit e108140 into cloudbase:main Feb 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants