[SURE-4340] Add Prometheus Metrics #2172

p-se · 2024-02-22T15:36:59Z

Refers to #1408

Expose Prometheus metrics of the fleet-controller for the following
controllers with corresponding E2E tests:

GitRepo
Bundle
BundleDeployment
Cluster
ClusterGroup

mig4ng · 2024-02-27T15:23:52Z

Thank you for this work @p-se !
Last week I had some workloads down, which I did not get alerted on due to a failed deployment of a PrometheusRule in one of my GitRepos.

Looking forward to see this in Fleet and subsequently in Rancher. 🚀

Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup

Refers to rancher#2172

p-se · 2024-02-29T11:36:28Z

It might be worth looking at when exactly the data is being collected in the reconciliation loops of the controllers, because I'm not sure if it is perfect as it currently is.

weyfonk

Thanks for this great effort! Leaving mostly nitpicks and some questions, as I am not very familiar with Prometheus.
Happy to discuss :)

weyfonk · 2024-03-04T10:57:34Z

charts/fleet/values.yaml

@@ -66,6 +66,9 @@ priorityClassName: ""
 gitops:
  enabled: true

+metrics:
+  enabled: true


Do we want metrics to be enabled by default? 🤔

I'm open for discussions, but I thought it would not be too expensive to just collect and make them available by default.

Let's try enabling by default and see if it's costly

internal/metrics/metrics.go

internal/cmd/controller/summary/summary.go

internal/cmd/controller/root.go

internal/metrics/bundle_metrics.go

internal/metrics/gitrepo_metrics.go

internal/cmd/controller/reconciler/gitrepo_controller.go

internal/cmd/controller/reconciler/clustergroup_controller.go

internal/cmd/controller/reconciler/bundledeployment_controller.go

bigkevmcd · 2024-03-11T11:19:31Z

internal/cmd/controller/reconciler/cluster_controller.go

@@ -167,6 +168,8 @@ func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
 		logger.V(1).Error(err, "Reconcile failed final update to cluster status", "status", cluster.Status)
 	}

+	metrics.CollectClusterMetrics(cluster)


I wonder if in addition to these metrics, we want to record the duration of the Reconcile call

Record the start time, and record the duration at the end of the function (defer) this could be useful to indicate how frequently the cluster can be reconciled, or indicate issues reconciling beyond simple errors (overload cases?).

(Ideally this would be a Histogram, as https://prometheus.io/docs/practices/histograms/ it's very similar to a request duration).

controller-runtime does this out-of-the-box for us, e.g.:

controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.005"} 81 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.01"} 113 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.025"} 165 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.05"} 319 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.1"} 413 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.15"} 453 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.2"} 470 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.25"} 490 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.3"} 493 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.35"} 496 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.4"} 499 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.45"} 501 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.5"} 502 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.6"} 502 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.7"} 502 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.8"} 502 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.9"} 502 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.25"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.5"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.75"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="2"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="2.5"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="3"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="3.5"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="4"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="4.5"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="5"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="6"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="7"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="8"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="9"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="10"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="15"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="20"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="25"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="30"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="40"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="50"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="60"} 503 controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="+Inf"} 503

bigkevmcd · 2024-03-11T14:55:10Z

I think it'd be useful to also gauge the number of Paused resources, this is useful to highlight issues where folks have things that are not being applied.

Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup

charts/fleet/templates/deployment.yaml

manno · 2024-04-04T12:39:36Z

charts/fleet/values.yaml

@@ -66,6 +66,9 @@ priorityClassName: ""
 gitops:
  enabled: true

+metrics:
+  enabled: true


Let's try enabling by default and see if it's costly

dev/setup-fleet-downstream

dev/setup-tests

e2e/metrics/cluster_test.go

manno · 2024-04-04T12:57:01Z

e2e/metrics/util.go

@@ -0,0 +1,181 @@
+package metrics


Is there a more descriptive file name?

exporter.go fine?

manno · 2024-04-04T12:59:15Z

e2e/metrics/cluster_test.go

+						nil,
+					)
+					if expectedExist {
+						Expect(err).ToNot(HaveOccurred())


We could weaken the test and just check for existance?
On the other hand, since metrics is a separate suite already, we could run it without parallelism and even set the specs to Ordered, so the number of resources is predictable?

I haven't fully figured it out how to test cases in which a cluster is modified (if that is even possible and supposed to be possible) or deleted.

On the other topics, the test is "weakened" in the sense that it does not care about the values of the metrics and only considers existing cluster resources, how ever many there might be. It works well with running it in parallel, as do the other tests.

Adding one or two other tests will be done in a separate issue: #2315

p-se · 2024-04-10T07:35:30Z

I think it'd be useful to also gauge the number of Paused resources, this is useful to highlight issues where folks have things that are not being applied.

#2314

Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup

e2e/metrics/bundle_test.go

manno · 2024-04-12T16:20:23Z

e2e/metrics/bundledeployment_test.go

+		Context(
+			"when the GitRepo (and therefore Bundle) is changed",
+			Label("bundle-altered"),
+			func() {


nit: newlines look strange to me

Suggested change

Context(

"when the GitRepo (and therefore Bundle) is changed",

Label("bundle-altered"),

func() {

When("the GitRepo (and therefore Bundle) is changed", Label("bundle-altered"), func() {

weyfonk

Looking good, thanks for this effort 🚀
Leaving a few comments and doubts.

internal/metrics/bundle_metrics.go

weyfonk · 2024-04-15T07:37:07Z

internal/cmd/controller/reconciler/bundle_controller.go

 		return ctrl.Result{}, err
 	}

 	if bundle.Status.ObservedGeneration != bundle.Generation {
 		if err := setResourceKey(context.Background(), &bundle.Status, bundle, manifest, r.isNamespaced); err != nil {
 			updateDisplay(&bundle.Status)
+			metrics.BundleCollector.Collect(bundle)


In this case and the above two, my understanding is that collected metrics will not match the state of the bundle in the cluster, as the bundle's status has only been updated in the Fleet controller's memory.

Is that likely to be an issue? In other words, wouldn't this result in inconsistency that may be confusing to users?

Your understanding is correct. In the meanwhile, Mario has removed the lines which were updating the status in memory without writing it to the apiserver.

weyfonk · 2024-04-15T07:39:31Z

internal/cmd/controller/reconciler/bundle_controller.go

@@ -159,6 +165,7 @@ func (r *BundleReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
 	}

 	updateDisplay(&bundle.Status)
+	metrics.BundleCollector.Collect(bundle)


Do we want to run this no matter what the result of RetryOnConflict below turns out to be, or would we rather collect metrics only in success cases, as done for bundle deployments?

You're right, we only want to collect in success cases.

weyfonk · 2024-04-15T07:42:40Z

internal/cmd/controller/reconciler/cluster_controller.go

@@ -163,6 +165,8 @@ func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
 		logger.V(1).Error(err, "Reconcile failed final update to cluster status", "status", cluster.Status)
 	}

+	metrics.ClusterCollector.Collect(cluster)


Same question as for bundles: should this metrics collection run only if RetryOnConflict was successful?

internal/metrics/metrics.go

weyfonk · 2024-04-15T10:15:50Z

e2e/metrics/cluster_test.go

+
+				var existingClusters clusters
+				err = json.Unmarshal([]byte(clustersOut), &existingClusters)
+				Expect(err).ToNot(HaveOccurred())


Do we perhaps want to check that existingClusters contains at least one cluster, so that the for loop below doesn't become a no-op?

weyfonk · 2024-04-15T10:17:32Z

e2e/metrics/cluster_test.go

+	}
+
+	It(
+		"should have as many clusters in metrics as there are objects in the cluster",


I don't understand the meaning of this statement, or how it maps to the logic below.

The purpose is to make sure every cluster resource that exists at the time of testing has corresponding metrics. I've changed the description accordingly.

Well, interestingly, thinking about it after I've parallelized all the others tests, and even considering that this one appears to work, it might not have been such a good idea to try to test every existing cluster resource for the existence of their corresponding metrics. It appears to be working coincidentally when running tests in parallel, but there is no guarantee that it will always work. This may not be obvious now, but will become more obvious later, when I've created the follow-up PR to add tests to this file, which will create new cluster resources. New cluster resources can be created and destroyed at any time when the tests are run in parallel.

The issue described above is not crucial for this PR and not even necessary for our CI. As another is going to follow, I can fix it there. And as long we don't actually run those tests in parallel, which we currently don't, it won't be an issue for CI at all. Just something that, I think, would be good to have fixed to ensure independence of test cases. However, I would like to see those tests being run in parallel and in shuffled order someday, if only to ensure that they are not dependent on each other. But seeing them complete in under 15 seconds is nice, too.

weyfonk · 2024-04-15T10:19:37Z

e2e/metrics/clustergroup_test.go

+		"fleet_cluster_group_resource_count_ready":         true,
+		"fleet_cluster_group_resource_count_unknown":       true,
+		"fleet_cluster_group_resource_count_waitapplied":   true,
+	}


I think a similar comment about the state metric to the one present in cluster metrics tests would make sense here. WDYT?

weyfonk · 2024-04-15T13:03:38Z

e2e/metrics/exporter.go

+func (l promLabels) String() string {
+	r := ""
+	for k, v := range l {
+		r += fmt.Sprintf("%s=%q, ", k, v)


nit: not very important here since we're dealing with test code, but this could be done with a strings.StringBuilder.

https://pkg.go.dev/k8s.io/apimachinery/pkg/labels#FormatLabels could also do this?

Thank you both! I first thought to use StringBuilder, mostly because I wanted the result to look exactly like a metric does. Then I realized that my implementation doesn't exactly behave like I wanted it to and switched to strings.Join. But I got curious about the performance implications of those different options and found out that "printf" comes with the greatest performance penalty. Using + for small strings does not seem to be a big issue. Though, I kept printf for readability and because it is just test code, as you also said @weyfonk.

e2e/metrics/exporter.go

bigkevmcd · 2024-04-15T16:14:33Z

I'm somewhat surprised to see internal/metrics with no tests added, this I think has to be remediated before merging.

Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup

This prevents nil pointer errors when deploying Fleet without any `shards` Helm value, in which case a single controller should and now will be deployed.

mig4ng · 2024-04-23T12:17:38Z

Amazing! 🎉

Refers to rancher#2172

p-se force-pushed the fleet-metrics branch from 5661892 to b5087dc Compare February 27, 2024 12:07

p-se force-pushed the fleet-metrics branch from 5311e00 to 8d832b3 Compare February 28, 2024 12:39

p-se marked this pull request as ready for review February 29, 2024 11:09

p-se requested a review from a team as a code owner February 29, 2024 11:09

p-se force-pushed the fleet-metrics branch from 0870956 to 600ead2 Compare February 29, 2024 11:22

p-se added a commit to p-se/fleet that referenced this pull request Feb 29, 2024

Add developer docs for monitoring

ee89eec

Refers to rancher#2172

p-se added a commit to p-se/fleet that referenced this pull request Feb 29, 2024

Add developer docs for monitoring

3ed611d

Refers to rancher#2172

p-se mentioned this pull request Feb 29, 2024

Add developer docs for monitoring #2189

Merged

weyfonk reviewed Mar 4, 2024

View reviewed changes

bigkevmcd reviewed Mar 11, 2024

View reviewed changes

p-se marked this pull request as draft April 3, 2024 09:31

p-se force-pushed the fleet-metrics branch 2 times, most recently from d46809f to 8ec593f Compare April 4, 2024 12:47

manno reviewed Apr 4, 2024

View reviewed changes

p-se mentioned this pull request Apr 10, 2024

metrics: gauge the number of Paused resources #2314

Closed

p-se force-pushed the fleet-metrics branch from 49e7ac9 to c8d8522 Compare April 10, 2024 12:01

p-se force-pushed the fleet-metrics branch from c8d8522 to 3d070bd Compare April 10, 2024 12:04

p-se marked this pull request as ready for review April 11, 2024 08:02

p-se requested review from manno, weyfonk and bigkevmcd April 11, 2024 08:10

p-se force-pushed the fleet-metrics branch from 726f746 to a453054 Compare April 11, 2024 14:25

manno reviewed Apr 12, 2024

View reviewed changes

e2e/metrics/bundle_test.go Outdated Show resolved Hide resolved

manno reviewed Apr 12, 2024

View reviewed changes

e2e/metrics/bundle_test.go Outdated Show resolved Hide resolved

manno reviewed Apr 12, 2024

View reviewed changes

weyfonk reviewed Apr 15, 2024

View reviewed changes

p-se force-pushed the fleet-metrics branch from 726bb8e to 4e9fe52 Compare April 22, 2024 14:53

p-se force-pushed the fleet-metrics branch 2 times, most recently from 648cd4e to b0732a6 Compare April 23, 2024 07:27

Add Prometheus metrics to fleet-controller

83043c0

Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup

p-se force-pushed the fleet-metrics branch from b0732a6 to 83043c0 Compare April 23, 2024 08:04

weyfonk and others added 2 commits April 23, 2024 13:05

Simplify shards configuration detection (rancher#2354)

dceb784

This prevents nil pointer errors when deploying Fleet without any `shards` Helm value, in which case a single controller should and now will be deployed.

Metrics service only target the default shard

436a81d

manno force-pushed the fleet-metrics branch from 094902b to 436a81d Compare April 23, 2024 12:01

manno approved these changes Apr 23, 2024

View reviewed changes

manno merged commit d5d1b44 into rancher:main Apr 23, 2024
8 checks passed

This was referenced Apr 24, 2024

metrics: make sure metrics work well with sharding #2355

Closed

[SURE-4340] Prometheus Metrics missing #1408

Closed

Add metrics config to rancher-monitoring chart #2295

Closed

p-se added a commit to p-se/fleet that referenced this pull request Apr 24, 2024

Add developer docs for monitoring

3d821aa

Refers to rancher#2172

p-se added a commit to p-se/fleet that referenced this pull request Apr 24, 2024

Add developer docs for monitoring

2722824

Refers to rancher#2172

p-se modified the milestone: v2.10.0 May 17, 2024

p-se mentioned this pull request May 21, 2024

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart rancher/charts#3949

Merged

Tommy12789 mentioned this pull request Jun 19, 2024

Add ServiceMonitor for scraping gitops-controller in rancher-monitoring chart rancher/charts#4101

Merged

BrewTestBot mentioned this pull request Jul 17, 2024

fleet-cli 0.10.0 Homebrew/homebrew-core#177602

Merged

olblak mentioned this pull request Jul 18, 2024

Feature Request: Enhance Fleet Bundles with resource verification capabilities #2649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-4340] Add Prometheus Metrics #2172

[SURE-4340] Add Prometheus Metrics #2172

p-se commented Feb 22, 2024 •

edited

Loading

mig4ng commented Feb 27, 2024 •

edited

Loading

p-se commented Feb 29, 2024

weyfonk left a comment

weyfonk Mar 4, 2024

p-se Mar 8, 2024

manno Apr 4, 2024

bigkevmcd Mar 11, 2024 •

edited

Loading

p-se Apr 10, 2024

bigkevmcd commented Mar 11, 2024

manno Apr 4, 2024

manno Apr 4, 2024

p-se Apr 10, 2024

manno Apr 4, 2024

p-se Apr 10, 2024

p-se commented Apr 10, 2024

manno Apr 12, 2024

weyfonk left a comment

weyfonk Apr 15, 2024

p-se Apr 22, 2024

weyfonk Apr 15, 2024

p-se Apr 22, 2024

weyfonk Apr 15, 2024

weyfonk Apr 15, 2024

weyfonk Apr 15, 2024

p-se Apr 18, 2024

weyfonk Apr 15, 2024

p-se Apr 18, 2024

weyfonk Apr 15, 2024

bigkevmcd Apr 15, 2024

p-se Apr 19, 2024

bigkevmcd commented Apr 15, 2024

mig4ng commented Apr 23, 2024

[SURE-4340] Add Prometheus Metrics #2172

[SURE-4340] Add Prometheus Metrics #2172

Conversation

p-se commented Feb 22, 2024 • edited Loading

mig4ng commented Feb 27, 2024 • edited Loading

p-se commented Feb 29, 2024

weyfonk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigkevmcd Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigkevmcd commented Mar 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-se commented Apr 10, 2024

Choose a reason for hiding this comment

weyfonk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigkevmcd commented Apr 15, 2024

mig4ng commented Apr 23, 2024

p-se commented Feb 22, 2024 •

edited

Loading

mig4ng commented Feb 27, 2024 •

edited

Loading

bigkevmcd Mar 11, 2024 •

edited

Loading