Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record Metrics for Reminder #4831

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Vyom-Yadav
Copy link
Member

@Vyom-Yadav Vyom-Yadav commented Oct 27, 2024

Summary

Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.

Fixes #(related issue)

Change Type

Mark the type of change your PR introduces:

  • Bug fix (resolves an issue without affecting existing features)
  • Feature (adds new functionality without breaking changes)
  • Breaking change (may impact existing functionalities or require documentation updates)
  • Documentation (updates or additions to documentation)
  • Refactoring or test improvements (no bug fixes or new functionality)

Testing

Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.

Review Checklist:

  • Reviewed my own code for quality and clarity.
  • Added comments to complex or tricky code sections.
  • Updated any affected documentation.
  • Included tests that validate the fix or feature.
  • Checked that related changes are merged.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments, mostly aiming to help simplify your life. 😁

Comment on lines 31 to 34
// For tracking average calculation
// TODO: consider persisting this to avoid reset on restart (maybe)
totalBatches int64
totalReminders int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than storing these as int64s, why not use an Int64Counter? That will automatically record the total, but also that the metric is cumulative over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 25 to 26
// Current number of reminders in the batch
BatchSize metric.Int64Gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an Int64Histogram for this, which will also give you the Average metric you're tracking next (approximately, but with more-detailed frequency information, which is probably interesting).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 10 to 18
0, // immediate
10, // 10 seconds
20, // 20 seconds
40, // 40 seconds
80, // 1m 20s
160, // 2m 40s
320, // 5m 20s
640, // 10m 40s
1280, // 21m 20s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While powers of 2 are generally handy, I'd tend to align this more with human units, e.g.:

0, 10, 30, 60, 120, 300, 600, 900, 1800, 3600, 7200

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what frequency we'll run reminder at, but 2 hours seems reasonably pessimistic for now -- it's really annoying when you run into problems and your frequency distribution has always been "it's not so good".

Given that our expected value for latency is 1/2 scan interval, I could even be convinced that < 1 minute is not particularly interesting (which could trim a couple buckets if you're worried about it -- I'm not!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with:

var delayBuckets = []float64{
	60,    // 1 minute
	300,   // 5 minutes
	600,   // 10 minutes
	1800,  // 30 minutes
	3600,  // 1 hour
	7200,  // 2 hours
	10800, // 3 hours
	18000, // 5 hours
	25200, // 7 hours
	36000, // 10 hours
}

Comment on lines 75 to 77
// Update running average
m.totalBatches++
m.totalReminders += size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should aim to get the metrics code to do this for you automatically -- in particular, if we end up with two reminder processes running, getting a global average from reminder_avg_batch_size is impossible without knowing the numerator and denominator for each process. Exporting them as two cumulative totals allows aggregation across multiple processes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think this should be doable now for multiple reminder processes as we can aggregate metrics which are being exposed.

remMetrics.SendDelay.Record(ctx, (time.Now().Sub(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds())
} else {
// TODO: Should the send delay be zero if the reminder has never been sent?
remMetrics.SendDelay.Record(ctx, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce a bunch of +Inf values in the histogram bucket, which will disturb any sort of Nth percentile.

It's probably better to either export nothing here, or export it under some different counter metric ("new reminders").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing a new metric now.

otel.SetMeterProvider(mp)

// Create metrics
meter := mp.Meter("reminder-service")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meter := mp.Meter("reminder-service")
meter := mp.Meter("reminder")

}

// Shutdown gracefully shuts down the metrics server
func (p *Provider) Shutdown(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this method private, then cancelling the context to Start or NewProvider becomes the only way to reach this code, which simplifies your lifecycle over having multiple callers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(among other things, you can guarantee that p is not nil, because you don't call the shutdown method unless p is non-nil.)

Comment on lines 130 to 132
func (p *Provider) Metrics() *Metrics {
return p.metrics
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main suggestion is to remove the Provider object in favor of a static setup function with cancellation, but if you do keep it, I'd suggesting making Metrics exported, rather than needing to add an accessor function.

@@ -143,6 +163,11 @@ func (r *reminder) sendReminders(ctx context.Context) error {
return fmt.Errorf("error creating reminder messages: %w", err)
}

remMetrics := r.metricsProvider.Metrics()
if remMetrics != nil {
remMetrics.RecordBatch(ctx, int64(len(repos)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make batches a histogram, this can just be a Record() on that histogram.

Comment on lines 82 to 105
func (m *Metrics) RecordSendDelay(ctx context.Context, delaySeconds float64) {
m.SendDelay.Record(ctx, delaySeconds)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function isn't doing anything -- you're already exposing SendDelay, and there's not much semantic benefit to foo.RecordSendDelay(...) over foo.SendDelay.Record(...). That also saves you from needing to write a comment.

If you transform BatchSize into a histogram as well, then you might be able to get rid of all the methods on Metrics, which would be nice.

Copy link
Contributor

This PR needs additional information before we can continue. It is now marked as stale because it has been open for 30 days with no activity. Please provide the necessary details to continue or it will be closed in 30 days.

@github-actions github-actions bot added the Stale label Nov 28, 2024
@github-actions github-actions bot closed this Dec 29, 2024
@Vyom-Yadav
Copy link
Member Author

Opening this, I'm working on this.

@Vyom-Yadav Vyom-Yadav reopened this Jan 3, 2025
@Vyom-Yadav Vyom-Yadav removed the Stale label Jan 3, 2025
@Vyom-Yadav Vyom-Yadav force-pushed the add-metrics-to-reminder branch from 20f38c1 to b92b0de Compare January 3, 2025 12:27
@coveralls
Copy link

coveralls commented Jan 3, 2025

Coverage Status

coverage: 54.92% (-0.2%) from 55.145%
when pulling 6dc6aa3 on Vyom-Yadav:add-metrics-to-reminder
into a6d9b22 on mindersec:main.

@Vyom-Yadav Vyom-Yadav force-pushed the add-metrics-to-reminder branch from b92b0de to ccde7be Compare January 4, 2025 17:21
Signed-off-by: Vyom Yadav <jackhammervyom@gmail.com>
@Vyom-Yadav Vyom-Yadav force-pushed the add-metrics-to-reminder branch from ccde7be to 6dc6aa3 Compare January 4, 2025 17:24
Copy link
Member Author

@Vyom-Yadav Vyom-Yadav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evankanderson Because we reused eventer code in a6d9b22, which created pubWithMetrics using minder namespace, there are some non-reminder metrics being outputted by reminder (e.g. minder_eventer_publish_time_seconds_bucket).

I think this can cause problems, but I assume them to use the same infra, so not sure about how big of a problem that is. (Is this something we want to tackle in this PR?)


Also updated the code to remove provider and have a fire and forget setupMetricServer function

Comment on lines +176 to +180
if r.metrics != nil {
// sendDelay = Now() - ReminderLastSent - MinElapsed
reminderLastSent := repo.ReminderLastSent
if reminderLastSent.Valid {
r.metrics.SendDelay.Record(ctx, (time.Since(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about this metric. Let's say a repo has been frequently updated every hour (due to GH webhook events) and r.cfg.RecurrenceConfig.MinElapsed is 2 hours. In this case, let's say after sending a reminder 18 hours back, every hour the repo was updated by GH webhook and wasn't eligible for a reminder. Now when it becomes eligible for the reminder SendDelay would be recorded as 18h which seems to be very wrong in this case. (I think we discussed this previously too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels like it should be time.Since(<last evaluation>) here. In fact, that feels like it should be at the top of the "if" block replacing the comment on line 177.

Comment on lines +182 to +184
// Recording metric for first time reminders (after how long was the initial reminder sent)
newSendDelay := time.Since(repoToLastUpdated[repo.ID]) - r.cfg.RecurrenceConfig.MinElapsed
r.metrics.NewSendDelay.Record(ctx, newSendDelay.Seconds())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new metric for first time reminders. (I'm thinking this is the definition of sendDelay and what we have above is wrong)

@Vyom-Yadav Vyom-Yadav marked this pull request as ready for review January 4, 2025 17:32
@Vyom-Yadav Vyom-Yadav requested a review from a team as a code owner January 4, 2025 17:32
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes -- I think you're right about determining the metrics for "sendDelay".

If you want to be particularly cute, you could write:

sendDelay := time.Since(...) - ...MinElapsed
recorder := r.metrics.SendDelay
if !reminderLastSent.Valid {
  // Use a different metric for first-time reminders
  recorder = r.metrics.NewSendDelay
}
recorder.Record(ctx, sendDelay.Seconds())

Comment on lines +75 to +81
totalBatches, err := meter.Int64Counter(
"total_batches",
metric.WithDescription("Total number of batches processed"),
)
if err != nil {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have batch_size as a histogram, the +Inf bucket will show all batches that have been sent, since it's storing a CDF (cumulative distribution function), IIRC. (If it's actually storing a histogram, summing the buckets will give the same result).

Note that total reminders can't be directly calculated from the histogram, because the buckets cover a range of values, so that metric is still independently useful.

logger.Err(err).Msg("error shutting down metrics provider")
}

return server.Shutdown(shutdownCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shutdown sequence is a little weird, because ListenAndServe is potentially returning before this point, which seems unusual: https://pkg.go.dev/net/http#Server.Shutdown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ctx.Done() and r.stop should call server.Shutdown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's especially weird since you don't actually wait for startMetricsServer to exit in the main thread. (Which seems okay -- if you cut off a metrics HTTP connection in the middle, no real harm done.)

}()

var err error
r.metrics, err = metrics.NewMetrics(otel.Meter("reminder"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like startMetricsServer should maybe return a metrics.Metrics, but I don't feel strongly.

Comment on lines +176 to +180
if r.metrics != nil {
// sendDelay = Now() - ReminderLastSent - MinElapsed
reminderLastSent := repo.ReminderLastSent
if reminderLastSent.Valid {
r.metrics.SendDelay.Record(ctx, (time.Since(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels like it should be time.Since(<last evaluation>) here. In fact, that feels like it should be at the top of the "if" block replacing the comment on line 177.

}
}

return eligibleRepos, nil
return eligibleRepos, eligibleReposLastUpdated, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to copy idToLastUpdate into a new map, or could you just return idToLastUpdate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants