-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SURE-4340] Add Prometheus Metrics #2172
Conversation
Thank you for this work @p-se ! Looking forward to see this in Fleet and subsequently in Rancher. 🚀 |
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
It might be worth looking at when exactly the data is being collected in the reconciliation loops of the controllers, because I'm not sure if it is perfect as it currently is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great effort! Leaving mostly nitpicks and some questions, as I am not very familiar with Prometheus.
Happy to discuss :)
@@ -66,6 +66,9 @@ priorityClassName: "" | |||
gitops: | |||
enabled: true | |||
|
|||
metrics: | |||
enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want metrics to be enabled by default? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open for discussions, but I thought it would not be too expensive to just collect and make them available by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try enabling by default and see if it's costly
internal/cmd/controller/reconciler/bundledeployment_controller.go
Outdated
Show resolved
Hide resolved
@@ -167,6 +168,8 @@ func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |||
logger.V(1).Error(err, "Reconcile failed final update to cluster status", "status", cluster.Status) | |||
} | |||
|
|||
metrics.CollectClusterMetrics(cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in addition to these metrics, we want to record the duration of the Reconcile
call
Record the start time, and record the duration at the end of the function (defer
) this could be useful to indicate how frequently the cluster can be reconciled, or indicate issues reconciling beyond simple errors (overload cases?).
(Ideally this would be a Histogram, as https://prometheus.io/docs/practices/histograms/ it's very similar to a request duration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
controller-runtime does this out-of-the-box for us, e.g.:
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.005"} 81
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.01"} 113
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.025"} 165
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.05"} 319
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.1"} 413
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.15"} 453
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.2"} 470
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.25"} 490
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.3"} 493
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.35"} 496
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.4"} 499
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.45"} 501
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.5"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.6"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.7"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.8"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="0.9"} 502
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.25"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="1.75"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="2"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="2.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="3"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="3.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="4"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="4.5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="5"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="6"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="7"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="8"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="9"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="10"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="15"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="20"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="25"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="30"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="40"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="50"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="60"} 503
controller_runtime_reconcile_time_seconds_bucket{controller="bundle",le="+Inf"} 503
I think it'd be useful to also gauge the number of Paused resources, this is useful to highlight issues where folks have things that are not being applied. |
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
d46809f
to
8ec593f
Compare
@@ -66,6 +66,9 @@ priorityClassName: "" | |||
gitops: | |||
enabled: true | |||
|
|||
metrics: | |||
enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try enabling by default and see if it's costly
e2e/metrics/util.go
Outdated
@@ -0,0 +1,181 @@ | |||
package metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more descriptive file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exporter.go
fine?
e2e/metrics/cluster_test.go
Outdated
nil, | ||
) | ||
if expectedExist { | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could weaken the test and just check for existance?
On the other hand, since metrics is a separate suite already, we could run it without parallelism and even set the specs to Ordered
, so the number of resources is predictable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully figured it out how to test cases in which a cluster is modified (if that is even possible and supposed to be possible) or deleted.
On the other topics, the test is "weakened" in the sense that it does not care about the values of the metrics and only considers existing cluster resources, how ever many there might be. It works well with running it in parallel, as do the other tests.
Adding one or two other tests will be done in a separate issue: #2315
|
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
Context( | ||
"when the GitRepo (and therefore Bundle) is changed", | ||
Label("bundle-altered"), | ||
func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newlines look strange to me
Context( | |
"when the GitRepo (and therefore Bundle) is changed", | |
Label("bundle-altered"), | |
func() { | |
When("the GitRepo (and therefore Bundle) is changed", Label("bundle-altered"), func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks for this effort 🚀
Leaving a few comments and doubts.
return ctrl.Result{}, err | ||
} | ||
|
||
if bundle.Status.ObservedGeneration != bundle.Generation { | ||
if err := setResourceKey(context.Background(), &bundle.Status, bundle, manifest, r.isNamespaced); err != nil { | ||
updateDisplay(&bundle.Status) | ||
metrics.BundleCollector.Collect(bundle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case and the above two, my understanding is that collected metrics will not match the state of the bundle in the cluster, as the bundle's status has only been updated in the Fleet controller's memory.
Is that likely to be an issue? In other words, wouldn't this result in inconsistency that may be confusing to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct. In the meanwhile, Mario has removed the lines which were updating the status in memory without writing it to the apiserver.
@@ -159,6 +165,7 @@ func (r *BundleReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr | |||
} | |||
|
|||
updateDisplay(&bundle.Status) | |||
metrics.BundleCollector.Collect(bundle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to run this no matter what the result of RetryOnConflict
below turns out to be, or would we rather collect metrics only in success cases, as done for bundle deployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we only want to collect in success cases.
@@ -163,6 +165,8 @@ func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |||
logger.V(1).Error(err, "Reconcile failed final update to cluster status", "status", cluster.Status) | |||
} | |||
|
|||
metrics.ClusterCollector.Collect(cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as for bundles: should this metrics collection run only if RetryOnConflict
was successful?
e2e/metrics/cluster_test.go
Outdated
|
||
var existingClusters clusters | ||
err = json.Unmarshal([]byte(clustersOut), &existingClusters) | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we perhaps want to check that existingClusters
contains at least one cluster, so that the for
loop below doesn't become a no-op?
e2e/metrics/cluster_test.go
Outdated
} | ||
|
||
It( | ||
"should have as many clusters in metrics as there are objects in the cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the meaning of this statement, or how it maps to the logic below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to make sure every cluster resource that exists at the time of testing has corresponding metrics. I've changed the description accordingly.
Well, interestingly, thinking about it after I've parallelized all the others tests, and even considering that this one appears to work, it might not have been such a good idea to try to test every existing cluster resource for the existence of their corresponding metrics. It appears to be working coincidentally when running tests in parallel, but there is no guarantee that it will always work. This may not be obvious now, but will become more obvious later, when I've created the follow-up PR to add tests to this file, which will create new cluster resources. New cluster resources can be created and destroyed at any time when the tests are run in parallel.
The issue described above is not crucial for this PR and not even necessary for our CI. As another is going to follow, I can fix it there. And as long we don't actually run those tests in parallel, which we currently don't, it won't be an issue for CI at all. Just something that, I think, would be good to have fixed to ensure independence of test cases. However, I would like to see those tests being run in parallel and in shuffled order someday, if only to ensure that they are not dependent on each other. But seeing them complete in under 15 seconds is nice, too.
"fleet_cluster_group_resource_count_ready": true, | ||
"fleet_cluster_group_resource_count_unknown": true, | ||
"fleet_cluster_group_resource_count_waitapplied": true, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a similar comment about the state metric to the one present in cluster metrics tests would make sense here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
e2e/metrics/exporter.go
Outdated
func (l promLabels) String() string { | ||
r := "" | ||
for k, v := range l { | ||
r += fmt.Sprintf("%s=%q, ", k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not very important here since we're dealing with test code, but this could be done with a strings.StringBuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pkg.go.dev/k8s.io/apimachinery/pkg/labels#FormatLabels could also do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both! I first thought to use StringBuilder, mostly because I wanted the result to look exactly like a metric does. Then I realized that my implementation doesn't exactly behave like I wanted it to and switched to strings.Join
. But I got curious about the performance implications of those different options and found out that "printf" comes with the greatest performance penalty. Using +
for small strings does not seem to be a big issue. Though, I kept printf for readability and because it is just test code, as you also said @weyfonk.
I'm somewhat surprised to see |
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
648cd4e
to
b0732a6
Compare
Refers to rancher#2172, SURE-4340 Expose Prometheus metrics of the fleet-controller for the following controllers: - GitRepo - Bundle - BundleDeployment - Cluster - ClusterGroup
This prevents nil pointer errors when deploying Fleet without any `shards` Helm value, in which case a single controller should and now will be deployed.
Amazing! 🎉 |
Refers to #1408
Expose Prometheus metrics of the fleet-controller for the following
controllers with corresponding E2E tests: