Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-4340] Add Prometheus Metrics #2172

Merged
merged 3 commits into from
Apr 23, 2024
Merged

[SURE-4340] Add Prometheus Metrics #2172

merged 3 commits into from
Apr 23, 2024

Conversation

p-se
Copy link
Contributor

@p-se p-se commented Feb 22, 2024

Refers to #1408

Expose Prometheus metrics of the fleet-controller for the following
controllers with corresponding E2E tests:

  • GitRepo
  • Bundle
  • BundleDeployment
  • Cluster
  • ClusterGroup

@mig4ng
Copy link

mig4ng commented Feb 27, 2024

Thank you for this work @p-se !
Last week I had some workloads down, which I did not get alerted on due to a failed deployment of a PrometheusRule in one of my GitRepos.

Looking forward to see this in Fleet and subsequently in Rancher. 🚀

@p-se p-se marked this pull request as ready for review February 29, 2024 11:09
@p-se p-se requested a review from a team as a code owner February 29, 2024 11:09
p-se added a commit to p-se/fleet that referenced this pull request Feb 29, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
p-se added a commit to p-se/fleet that referenced this pull request Feb 29, 2024
p-se added a commit to p-se/fleet that referenced this pull request Feb 29, 2024
@p-se
Copy link
Contributor Author

p-se commented Feb 29, 2024

It might be worth looking at when exactly the data is being collected in the reconciliation loops of the controllers, because I'm not sure if it is perfect as it currently is.

Copy link
Contributor

@weyfonk weyfonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great effort! Leaving mostly nitpicks and some questions, as I am not very familiar with Prometheus.
Happy to discuss :)

@@ -66,6 +66,9 @@ priorityClassName: ""
gitops:
enabled: true

metrics:
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want metrics to be enabled by default? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open for discussions, but I thought it would not be too expensive to just collect and make them available by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try enabling by default and see if it's costly

@@ -167,6 +168,8 @@ func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
logger.V(1).Error(err, "Reconcile failed final update to cluster status", "status", cluster.Status)
}

metrics.CollectClusterMetrics(cluster)
Copy link
Contributor

@bigkevmcd bigkevmcd Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if in addition to these metrics, we want to record the duration of the Reconcile call

Record the start time, and record the duration at the end of the function (defer) this could be useful to indicate how frequently the cluster can be reconciled, or indicate issues reconciling beyond simple errors (overload cases?).

(Ideally this would be a Histogram, as https://prometheus.io/docs/practices/histograms/ it's very similar to a request duration).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller-runtime does this out-of-the-box for us, e.g.:

controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.005"} 81
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.01"} 113
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.025"} 165
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.05"} 319
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.1"} 413
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.15"} 453
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.2"} 470
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.25"} 490
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.3"} 493
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.35"} 496
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.4"} 499
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.45"} 501
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.5"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.6"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.7"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.8"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.9"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.25"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.75"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="2"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="2.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="3"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="3.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="4"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="4.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="6"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="7"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="8"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="9"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="10"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="15"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="20"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="25"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="30"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="40"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="50"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="60"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="+Inf"} 503

@bigkevmcd
Copy link
Contributor

I think it'd be useful to also gauge the number of Paused resources, this is useful to highlight issues where folks have things that are not being applied.

p-se added a commit to p-se/fleet that referenced this pull request Mar 15, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
@p-se p-se marked this pull request as draft April 3, 2024 09:31
@p-se p-se force-pushed the fleet-metrics branch 2 times, most recently from d46809f to 8ec593f Compare April 4, 2024 12:47
@@ -66,6 +66,9 @@ priorityClassName: ""
gitops:
enabled: true

metrics:
enabled: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try enabling by default and see if it's costly

@@ -0,0 +1,181 @@
package metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more descriptive file name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exporter.go fine?

nil,
)
if expectedExist {
Expect(err).ToNot(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could weaken the test and just check for existance?
On the other hand, since metrics is a separate suite already, we could run it without parallelism and even set the specs to Ordered, so the number of resources is predictable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully figured it out how to test cases in which a cluster is modified (if that is even possible and supposed to be possible) or deleted.

On the other topics, the test is "weakened" in the sense that it does not care about the values of the metrics and only considers existing cluster resources, how ever many there might be. It works well with running it in parallel, as do the other tests.

Adding one or two other tests will be done in a separate issue: #2315

@p-se
Copy link
Contributor Author

p-se commented Apr 10, 2024

I think it'd be useful to also gauge the number of Paused resources, this is useful to highlight issues where folks have things that are not being applied.

#2314

p-se added a commit to p-se/fleet that referenced this pull request Apr 10, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
p-se added a commit to p-se/fleet that referenced this pull request Apr 10, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
@p-se p-se marked this pull request as ready for review April 11, 2024 08:02
@p-se p-se requested review from manno, weyfonk and bigkevmcd April 11, 2024 08:10
p-se added a commit to p-se/fleet that referenced this pull request Apr 11, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
Comment on lines +89 to +92
Context(
"when the GitRepo (and therefore Bundle) is changed",
Label("bundle-altered"),
func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newlines look strange to me

Suggested change
Context(
"when the GitRepo (and therefore Bundle) is changed",
Label("bundle-altered"),
func() {
When("the GitRepo (and therefore Bundle) is changed", Label("bundle-altered"), func() {

Copy link
Contributor

@weyfonk weyfonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks for this effort 🚀
Leaving a few comments and doubts.

return ctrl.Result{}, err
}

if bundle.Status.ObservedGeneration != bundle.Generation {
if err := setResourceKey(context.Background(), &bundle.Status, bundle, manifest, r.isNamespaced); err != nil {
updateDisplay(&bundle.Status)
metrics.BundleCollector.Collect(bundle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case and the above two, my understanding is that collected metrics will not match the state of the bundle in the cluster, as the bundle's status has only been updated in the Fleet controller's memory.

Is that likely to be an issue? In other words, wouldn't this result in inconsistency that may be confusing to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -159,6 +165,7 @@ func (r *BundleReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
}

updateDisplay(&bundle.Status)
metrics.BundleCollector.Collect(bundle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run this no matter what the result of RetryOnConflict below turns out to be, or would we rather collect metrics only in success cases, as done for bundle deployments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we only want to collect in success cases.

@@ -163,6 +165,8 @@ func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
logger.V(1).Error(err, "Reconcile failed final update to cluster status", "status", cluster.Status)
}

metrics.ClusterCollector.Collect(cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as for bundles: should this metrics collection run only if RetryOnConflict was successful?


var existingClusters clusters
err = json.Unmarshal([]byte(clustersOut), &existingClusters)
Expect(err).ToNot(HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perhaps want to check that existingClusters contains at least one cluster, so that the for loop below doesn't become a no-op?

}

It(
"should have as many clusters in metrics as there are objects in the cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the meaning of this statement, or how it maps to the logic below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to make sure every cluster resource that exists at the time of testing has corresponding metrics. I've changed the description accordingly.

Well, interestingly, thinking about it after I've parallelized all the others tests, and even considering that this one appears to work, it might not have been such a good idea to try to test every existing cluster resource for the existence of their corresponding metrics. It appears to be working coincidentally when running tests in parallel, but there is no guarantee that it will always work. This may not be obvious now, but will become more obvious later, when I've created the follow-up PR to add tests to this file, which will create new cluster resources. New cluster resources can be created and destroyed at any time when the tests are run in parallel.

The issue described above is not crucial for this PR and not even necessary for our CI. As another is going to follow, I can fix it there. And as long we don't actually run those tests in parallel, which we currently don't, it won't be an issue for CI at all. Just something that, I think, would be good to have fixed to ensure independence of test cases. However, I would like to see those tests being run in parallel and in shuffled order someday, if only to ensure that they are not dependent on each other. But seeing them complete in under 15 seconds is nice, too.

"fleet_cluster_group_resource_count_ready": true,
"fleet_cluster_group_resource_count_unknown": true,
"fleet_cluster_group_resource_count_waitapplied": true,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a similar comment about the state metric to the one present in cluster metrics tests would make sense here. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

func (l promLabels) String() string {
r := ""
for k, v := range l {
r += fmt.Sprintf("%s=%q, ", k, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not very important here since we're dealing with test code, but this could be done with a strings.StringBuilder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both! I first thought to use StringBuilder, mostly because I wanted the result to look exactly like a metric does. Then I realized that my implementation doesn't exactly behave like I wanted it to and switched to strings.Join. But I got curious about the performance implications of those different options and found out that "printf" comes with the greatest performance penalty. Using + for small strings does not seem to be a big issue. Though, I kept printf for readability and because it is just test code, as you also said @weyfonk.

@bigkevmcd
Copy link
Contributor

I'm somewhat surprised to see internal/metrics with no tests added, this I think has to be remediated before merging.

p-se added a commit to p-se/fleet that referenced this pull request Apr 22, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
p-se added a commit to p-se/fleet that referenced this pull request Apr 22, 2024
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
@p-se p-se force-pushed the fleet-metrics branch 2 times, most recently from 648cd4e to b0732a6 Compare April 23, 2024 07:27
Refers to rancher#2172, SURE-4340

Expose Prometheus metrics of the fleet-controller for the following
controllers:

- GitRepo
- Bundle
- BundleDeployment
- Cluster
- ClusterGroup
weyfonk and others added 2 commits April 23, 2024 13:05
This prevents nil pointer errors when deploying Fleet without any
`shards` Helm value, in which case a single controller should and now
will be deployed.
@manno manno merged commit d5d1b44 into rancher:main Apr 23, 2024
8 checks passed
@mig4ng
Copy link

mig4ng commented Apr 23, 2024

Amazing! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants