-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
truncate realtime metrics instead of purging #17017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think what you have here is a really sweet performance improvement, but with that speed, it seems like it is a more risky approach.
I am fine with that risk, given that we document/test things heavily, which I think is the one thing that I think this PR is heavily lacking. Plenty of proof that it is faster (and honestly, it just makes sense that it is), but almost no info on how the change was implemented.
Regarding the description here:
The metrics table is separated into sub tables: metrics_00 - metrics_23, one table per hour.
You should add some context that realtime metrics are not kept around for more than a day (with this change, no more than 12 hours, ideally), and that it is by design. I had to kind of infer this from the BZ and some of the code and table layout: for example, I learned from this PR, though not directly, that the sub Metric
tables are divided by hours of the day. This is in no way obvious to me, and I wouldn't expect that to be obvious to anyone, which made things like Metric.truncate
seemed to come out of no where, and seems super scary to do unless that context is known.
Maybe this exists somewhere in some documentation somewhere, but I was never made aware of it, and I work here!
app/models/metric.rb
Outdated
def self.with_table_for_time(hour = Time.now.utc.hour) | ||
Thread.current[:metric_table] = reindex_table_name(hour) | ||
yield | ||
self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you are returning self
here? Doesn't seem to be used anywhere, as far as I can tell.
app/models/metric.rb
Outdated
end | ||
|
||
# reindex_table_name # tablename for "now" | ||
# reindex_Table_name(4) # tablename for hour number 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments barely make sense to me, and only after I understood everything that was going on around them could I figure out what the purpose of this was.
Can these be written yard
format or something more standard?
app/models/metric.rb
Outdated
end | ||
|
||
def self.with_table_for_time(hour = Time.now.utc.hour) | ||
Thread.current[:metric_table] = reindex_table_name(hour) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't give a better suggestion here, but with this affecting whether either one or "all" of the tables gets truncated, this seems super scary to me doing this in a Thread
var (and this is coming from someone who used thread variables as well to do stuff in ManageIQPerformance
).
Just for slight piece of mind, can we make this variable name more verbose, something like :targeted_metric_subtable_name
, just it avoids the possibility of it being collided by some other code messing with Thread
variables and possibly re-using that indentifier.
app/models/metric/purging.rb
Outdated
@@ -61,6 +61,8 @@ def self.purge_count(older_than, interval) | |||
purge_scope(older_than, interval).count | |||
end | |||
|
|||
# TODO: delete these using the same created_at queries by date | |||
# rather than by joining to metrics - then hourly could be truncated vs deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having more comments around the truncation
strategy and workflow would allow this comment to reference those and be more meaningful.
spec/models/metric/purging_spec.rb
Outdated
@@ -82,4 +82,24 @@ | |||
end | |||
end | |||
end | |||
|
|||
context "#purge_realtime" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With how risky I see this change being, I feel like having more than 1 test here would be a good idea. Didn't look myself, but based on seeing you are creating this context for "#purge_realtime"
yourself, I am a bit discouraged that we didn't have tests around this in the first place 😞 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible edge cases that come to mind:
- Daylight savings time issues
- Different source times zones
- New years
Not saying that what you have here has bugs, but extra tests couldn't hurt with confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along these same lines, would it make sense to write these tests in a separate PR so that they work regardless of the underlying implementation, merge them in, and then make sure this PR still passes with them in place?
With a BZ attached to this PR, I don't know how much pressure is on to take the time to do that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
app/models/metric/purging.rb
Outdated
end | ||
end | ||
|
||
def self.determine_target_hours(older_than, end_date = Time.now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the default value for end_date
should match how you are invoking it above with Time.now.utc
instead of just Time.now
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly just omit the above argument, and default end_date
to Time.now.utc
?
app/models/metric/purging.rb
Outdated
end | ||
|
||
def self.determine_target_hours(older_than, end_date = Time.now) | ||
return [] if (end_date - older_than) > 24.hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we could possibly never run metrics purging if the purge timer is greater than 24.hours
? Not here, but maybe we should fall back to the old way of doing things if the truncation strategy would be acting on no tables, just incase someone mis-configures things.
Otherwise, I don't think we have a safe guard or way of warning a user that this setting should remain under 12.hours
/24.hours
(haven't done the math on your claim in the PR description 😉 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly what I was concerned about it my comments in gitter. I am almost thinking we should completely prevent setting a time greater than or equal to 24 hours in config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that in the lib/vmdb/config/validator.rb
by adding a new method, VMDB::Config::Validator#performance
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a safeguard in here. but it seemed out of place to know about the interval and the duration.
I think it is probably: duration <= 22.hours ; duration + interval + 1.hour < 24.hours
- but it just takes longer than a day to purge each hour. So I said 12.hours to be safe.
app/models/metric/purging.rb
Outdated
return [] if (end_date - older_than) > 24.hours | ||
|
||
start_hour = older_than.utc.hour | ||
end_hour = end_date.utc.hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I could really see this being botched up during daylight savings...
app/models/metric/purging.rb
Outdated
end | ||
|
||
def self.truncate_child_tables(older_than) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the below method should really be documented I think. There is a decent amount of domain knowledge that comes with them to understand how they work fully (for example, the mod division logic you do to figure out the good hours for an older_than
and end_date
that might be between two different days). I would also be surprised if most developers even knew that the metrics table was split into 24 parts on hour (again, I didn't until today... and only new that it was sharded).
Anecdotal "me too": I was aware we had some table partitioning, and (unrelatedly) got curious a couple weeks ago how it works and started digging in the code, together with @yaacov (plus googling how partitioning is normally done in rails). => We guessed from the table names |
@cben Sorry you spent so much time...I could have told you it all because I wrote it. I don't mind writing it yup but where is the best place to help something like this. I considered guides/architecture but I'm curious if you had tried to look there at all in the first place. If not, then putting it there wouldn't have helped you. |
@cben Maybe just a huge code comment on the Metric model? |
No, I didn't even think of looking at guides.
Yes, first place I looked is Metric and MetricRollup models.
|
@Fryguy For the record, my comments weren't meant to be a "boo, this was designed bad" when I was talking about the table partitioning being not obvious. It was more a comment on how the new implementation of purging now very much relies on knowing the database table partitioning strategy, where before, it was not necessary (which arguably is kind of impressive).
I would give a +1 to this idea, and also suggest referencing it when talking adding other comments around the partitioning strategy, since the comments around the purging cod would be in a different file ( |
@NickLaMuro No worries... I didn't hear it that way at all. I more felt bad that time was spent trying to discover something I already knew and could have documented, but didn't, and that time was "wasted" in my mind which I feel bad for. But I can fix it :) |
@NickLaMuro would you feel more comfortable with Even if we avoided the |
121c8b2
to
f3d82ad
Compare
put in a bunch of tests. that pass before/after. I will admit to a little jury rigging with the tests. example:
When the code runs 21 minutes later (or a few hours later), the hour 19 bucket will be truncated. I feel that while this is technically different, it stays within the intent of the code. Also, if the purging code takes more than 21 minutes (current code being that case), it could potentially be truncated sooner than the existing code. |
Opened #17034 to document as discussed above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from @NickLaMuro @jrafanie and @gtanzillo
app/models/metric/purging.rb
Outdated
return if target_hours.blank? | ||
|
||
target_hours.each do |hour| | ||
Metric.with_table_for_time(hour) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per GT: please use
ActiveRecord::Base.connection.execute("truncate table #{table_name}"
app/models/metric.rb
Outdated
@@ -3,7 +3,22 @@ class Metric < ApplicationRecord | |||
|
|||
include Metric::Common | |||
|
|||
def self.reindex_table_name | |||
"metrics_#{Time.now.utc.hour + 1}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we lost the +1
here
this may effect reindexing which actually wants to reindex the next hour
app/models/metric/purging.rb
Outdated
end | ||
end | ||
|
||
def self.determine_target_hours(older_than, end_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
describe edge cases and why it is written
# Used for realtime metrics. | ||
def self.truncate_child_tables(older_than) | ||
target_hours = determine_target_hours(older_than, Time.now.utc) | ||
return if target_hours.blank? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add validations for these settings
if this is blank, let the user know there is a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: lib/vmdb/config/validator.rb
is what we can do to check this (ignore the line that this comment is on)
I opened a PR for some |
Regarding metrics: our symptom was disk usage and swapping on our database appliance. Essentially we started swapping hard (~90-100%) even though we had massive amounts of free + buff/cache memory available on the system. After ~ 14 hours, we have reduced our swap usage on the appliance to less than 1%.
By increasing The combination of these two PRs will make a large positive impact in overall application performance. |
d05f158
to
f14ab4c
Compare
f14ab4c
to
e6ce2f6
Compare
if we want to be more cautious we can patch this in + opt :truncate, "Truncate child tables for speed"
- Metric::Purging.purge(dates[interval], interval, opts[:window], opts[:limit]) do |count, _|
- pbar.progress += count
+ if interval == "realtime" && opts[:truncate]
+ Metric::Purging.truncate_child_tables(dates[interval])
+ else
+ Metric::Purging.purge(dates[interval], interval, opts[:window], opts[:limit]) do |count, _|
+ pbar.progress += count
+ end |
kicking |
This pull request is not mergeable. Please rebase and repush. |
e6ce2f6
to
424105e
Compare
Looks good as we discussed. Kickstarted travis. Hopefully it doesn't timeout again. |
@kbrock This seems to still be part of this PR, does it make sense to just remove that bullet from the PR description? Or did you still think it made sense to have that separate? |
@NickLaMuro I extracted that commit into a PR because I was thinking that we may want to backport just that functionality. It fixes a bug. It also seems easier to merge. |
I get this, but let me rephrase based things that have changed since my original comment: This PR and #17051 BOTH have the Seeing that #17051 has now been merged, I think you need to rebase the branch once again with a fresh copy of master, and then we are good. Also, you can check off the box in the PR description. I will leave that to you. |
yeah, rebasing should fix that 👍 |
The metrics table is separated into sub tables: metrics_00 - metrics_23, one table per hour. Deleting from the parent table delegating the deletes to the child tables. The batching for these deletes sometimes conflicts with the delegation and causes the process to run slowly. The purging process is aware of the delegation. It truncates the hourly table not associated with dates in the retain window. This removes the need for batching. Also, truncate is a more streamlined procedure. As long as this runs often, this will properly purge the metrics. * rule of thumb: (interval + retain) < 12.hours. * May work with total higher than 12 hours, but it is not advised * Default retain period is to keep 4 hours worth of data * Default purge interval is to run every 21 minutes. - https://bugzilla.redhat.com/show_bug.cgi?id=1537733
Checked commit kbrock@a027c9c with ruby 2.3.3, rubocop 0.52.0, haml-lint 0.20.0, and yamllint 1.10.0 |
return if target_hours.blank? | ||
|
||
target_hours.each do |hour| | ||
Metric.connection.truncate(Metric.reindex_table_name(hour), "Metric Truncate table #{hour}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. 👍
I think a developer who didn't happen to read the comments might see this now and say "huh, what is reindex_table_name
", and hopefully would be able to grok what is happening with this before trying to make a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thnx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am cool with this. I think the tests will be a great initial indicator that things are working as expected, and the code should definitely be an improvement from a performance end.
Thanks for the work on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!! 👍
truncate realtime metrics instead of purging (cherry picked from commit 7d4db70)
truncate realtime metrics instead of purging (cherry picked from commit 7d4db70) https://bugzilla.redhat.com/show_bug.cgi?id=1553472
Gaprindashvili backport details:
|
Backported to Fine via #17124 |
extracted:
reindex_table_name
Background
The
metrics
table holds realtime metrics and is separated into sub tables:metrics_00
-metrics_23
, one table per hour. A trigger onmetrics
will put into the corresponding table based upon thetimestamp
column.Every 21 minutes, the scheduler calls
Metric::Purging#purge_realtime_timer
to delete all metrics older than 4 hours old. So only 5 tables should be populated at a time (note: rounding of time makes it 5 tables not 4).The purging is taking too long, and
metrics
is growing. This causes the query to run even slower.Before
Deleting from the parent table,
metrics
, delegating the deletes to the child tables (e.g.:metrics_23
). We delete in batches of 1,000 by default.The batching for these deletes doesn't quite work with the multi table abstraction and causes ~23x23 queries per delete.
For each run, we run
# records / 1,000
of the above query.After
The purging process is aware of the delegation. It truncates the hourly table not
associated with dates in the retain window. This removes the need for batching.
Also, truncate is a more streamlined procedure.
Constant number of queries.
The purging will not pickup all older records, only those with hour numbers more than 4 hours ago. We run this every 21 minutes, so this is not a problem.
Tweaking the interval and purge window:
If you run this every night at midnight, it will never purge hours 18-24. So those tables will grow and never be purged. This will need to be run at least once from 4am - 8pm for the purging to work.
The safe rule of thumb is:
interval + window < 12 hours
. The current defaults21.minutes + 4.hours < 12 hours
is well within these limits. Increasing interval to1.hour
may even be betterNumbers
Metric.count = 84,110, batch size = 1,000
comments
Extrapolated numbers
For the db with an issue, there are many more records. metrics_21 alone has 6,184,590. All metrics is 32,778,292 records.
Since each query takes ~10 minutes - and it would need to perform ~330 of them, it will take 2 1/4 days to delete a days worth of metrics. We'd never get out.
The new process takes closer to 1 minute on an appliance