-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record Metrics for Reminder #4831
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of comments, mostly aiming to help simplify your life. 😁
internal/reminder/metrics/metrics.go
Outdated
// For tracking average calculation | ||
// TODO: consider persisting this to avoid reset on restart (maybe) | ||
totalBatches int64 | ||
totalReminders int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than storing these as int64s, why not use an Int64Counter
? That will automatically record the total, but also that the metric is cumulative over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/reminder/metrics/metrics.go
Outdated
// Current number of reminders in the batch | ||
BatchSize metric.Int64Gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use an Int64Histogram
for this, which will also give you the Average metric you're tracking next (approximately, but with more-detailed frequency information, which is probably interesting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/reminder/metrics/metrics.go
Outdated
0, // immediate | ||
10, // 10 seconds | ||
20, // 20 seconds | ||
40, // 40 seconds | ||
80, // 1m 20s | ||
160, // 2m 40s | ||
320, // 5m 20s | ||
640, // 10m 40s | ||
1280, // 21m 20s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While powers of 2 are generally handy, I'd tend to align this more with human units, e.g.:
0, 10, 30, 60, 120, 300, 600, 900, 1800, 3600, 7200
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what frequency we'll run reminder
at, but 2 hours seems reasonably pessimistic for now -- it's really annoying when you run into problems and your frequency distribution has always been "it's not so good".
Given that our expected value for latency is 1/2 scan interval, I could even be convinced that < 1 minute is not particularly interesting (which could trim a couple buckets if you're worried about it -- I'm not!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with:
var delayBuckets = []float64{
60, // 1 minute
300, // 5 minutes
600, // 10 minutes
1800, // 30 minutes
3600, // 1 hour
7200, // 2 hours
10800, // 3 hours
18000, // 5 hours
25200, // 7 hours
36000, // 10 hours
}
internal/reminder/metrics/metrics.go
Outdated
// Update running average | ||
m.totalBatches++ | ||
m.totalReminders += size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should aim to get the metrics code to do this for you automatically -- in particular, if we end up with two reminder
processes running, getting a global average from reminder_avg_batch_size
is impossible without knowing the numerator and denominator for each process. Exporting them as two cumulative totals allows aggregation across multiple processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think this should be doable now for multiple reminder processes as we can aggregate metrics which are being exposed.
internal/reminder/reminder.go
Outdated
remMetrics.SendDelay.Record(ctx, (time.Now().Sub(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds()) | ||
} else { | ||
// TODO: Should the send delay be zero if the reminder has never been sent? | ||
remMetrics.SendDelay.Record(ctx, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will produce a bunch of +Inf
values in the histogram bucket, which will disturb any sort of Nth percentile.
It's probably better to either export nothing here, or export it under some different counter metric ("new reminders").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing a new metric now.
otel.SetMeterProvider(mp) | ||
|
||
// Create metrics | ||
meter := mp.Meter("reminder-service") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meter := mp.Meter("reminder-service") | |
meter := mp.Meter("reminder") |
} | ||
|
||
// Shutdown gracefully shuts down the metrics server | ||
func (p *Provider) Shutdown(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make this method private, then cancelling the context to Start
or NewProvider
becomes the only way to reach this code, which simplifies your lifecycle over having multiple callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(among other things, you can guarantee that p
is not nil
, because you don't call the shutdown
method unless p
is non-nil.)
func (p *Provider) Metrics() *Metrics { | ||
return p.metrics | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main suggestion is to remove the Provider
object in favor of a static setup function with cancellation, but if you do keep it, I'd suggesting making Metrics
exported, rather than needing to add an accessor function.
internal/reminder/reminder.go
Outdated
@@ -143,6 +163,11 @@ func (r *reminder) sendReminders(ctx context.Context) error { | |||
return fmt.Errorf("error creating reminder messages: %w", err) | |||
} | |||
|
|||
remMetrics := r.metricsProvider.Metrics() | |||
if remMetrics != nil { | |||
remMetrics.RecordBatch(ctx, int64(len(repos))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make batches a histogram, this can just be a Record()
on that histogram.
internal/reminder/metrics/metrics.go
Outdated
func (m *Metrics) RecordSendDelay(ctx context.Context, delaySeconds float64) { | ||
m.SendDelay.Record(ctx, delaySeconds) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function isn't doing anything -- you're already exposing SendDelay
, and there's not much semantic benefit to foo.RecordSendDelay(...)
over foo.SendDelay.Record(...)
. That also saves you from needing to write a comment.
If you transform BatchSize
into a histogram as well, then you might be able to get rid of all the methods on Metrics
, which would be nice.
This PR needs additional information before we can continue. It is now marked as stale because it has been open for 30 days with no activity. Please provide the necessary details to continue or it will be closed in 30 days. |
Opening this, I'm working on this. |
20f38c1
to
b92b0de
Compare
b92b0de
to
ccde7be
Compare
Signed-off-by: Vyom Yadav <jackhammervyom@gmail.com>
ccde7be
to
6dc6aa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evankanderson Because we reused eventer code in a6d9b22, which created pubWithMetrics
using minder
namespace, there are some non-reminder metrics being outputted by reminder (e.g. minder_eventer_publish_time_seconds_bucket
).
I think this can cause problems, but I assume them to use the same infra, so not sure about how big of a problem that is. (Is this something we want to tackle in this PR?)
Also updated the code to remove provider and have a fire and forget setupMetricServer
function
if r.metrics != nil { | ||
// sendDelay = Now() - ReminderLastSent - MinElapsed | ||
reminderLastSent := repo.ReminderLastSent | ||
if reminderLastSent.Valid { | ||
r.metrics.SendDelay.Record(ctx, (time.Since(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about this metric. Let's say a repo has been frequently updated every hour (due to GH webhook events) and r.cfg.RecurrenceConfig.MinElapsed
is 2 hours. In this case, let's say after sending a reminder 18 hours back, every hour the repo was updated by GH webhook and wasn't eligible for a reminder. Now when it becomes eligible for the reminder SendDelay
would be recorded as 18h
which seems to be very wrong in this case. (I think we discussed this previously too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like it should be time.Since(<last evaluation>)
here. In fact, that feels like it should be at the top of the "if" block replacing the comment on line 177.
// Recording metric for first time reminders (after how long was the initial reminder sent) | ||
newSendDelay := time.Since(repoToLastUpdated[repo.ID]) - r.cfg.RecurrenceConfig.MinElapsed | ||
r.metrics.NewSendDelay.Record(ctx, newSendDelay.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new metric for first time reminders. (I'm thinking this is the definition of sendDelay
and what we have above is wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes -- I think you're right about determining the metrics for "sendDelay".
If you want to be particularly cute, you could write:
sendDelay := time.Since(...) - ...MinElapsed
recorder := r.metrics.SendDelay
if !reminderLastSent.Valid {
// Use a different metric for first-time reminders
recorder = r.metrics.NewSendDelay
}
recorder.Record(ctx, sendDelay.Seconds())
totalBatches, err := meter.Int64Counter( | ||
"total_batches", | ||
metric.WithDescription("Total number of batches processed"), | ||
) | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have batch_size
as a histogram, the +Inf
bucket will show all batches that have been sent, since it's storing a CDF (cumulative distribution function), IIRC. (If it's actually storing a histogram, summing the buckets will give the same result).
Note that total reminders can't be directly calculated from the histogram, because the buckets cover a range of values, so that metric is still independently useful.
logger.Err(err).Msg("error shutting down metrics provider") | ||
} | ||
|
||
return server.Shutdown(shutdownCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shutdown sequence is a little weird, because ListenAndServe
is potentially returning before this point, which seems unusual: https://pkg.go.dev/net/http#Server.Shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ctx.Done()
and r.stop
should call server.Shutdown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's especially weird since you don't actually wait for startMetricsServer
to exit in the main thread. (Which seems okay -- if you cut off a metrics HTTP connection in the middle, no real harm done.)
}() | ||
|
||
var err error | ||
r.metrics, err = metrics.NewMetrics(otel.Meter("reminder")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like startMetricsServer
should maybe return a metrics.Metrics
, but I don't feel strongly.
if r.metrics != nil { | ||
// sendDelay = Now() - ReminderLastSent - MinElapsed | ||
reminderLastSent := repo.ReminderLastSent | ||
if reminderLastSent.Valid { | ||
r.metrics.SendDelay.Record(ctx, (time.Since(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like it should be time.Since(<last evaluation>)
here. In fact, that feels like it should be at the top of the "if" block replacing the comment on line 177.
} | ||
} | ||
|
||
return eligibleRepos, nil | ||
return eligibleRepos, eligibleReposLastUpdated, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to copy idToLastUpdate
into a new map, or could you just return idToLastUpdate
?
Summary
Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.
Fixes #(related issue)
Change Type
Mark the type of change your PR introduces:
Testing
Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.
Review Checklist: