Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

truncate realtime metrics instead of purging #17017

Merged
merged 1 commit into from
Mar 5, 2018

Conversation

kbrock
Copy link
Member

@kbrock kbrock commented Feb 17, 2018

extracted:

Background

The metrics table holds realtime metrics and is separated into sub tables: metrics_00 - metrics_23, one table per hour. A trigger on metrics will put into the corresponding table based upon the timestamp column.

Every 21 minutes, the scheduler calls Metric::Purging#purge_realtime_timer to delete all metrics older than 4 hours old. So only 5 tables should be populated at a time (note: rounding of time makes it 5 tables not 4).

The purging is taking too long, and metrics is growing. This causes the query to run even slower.

Before

Deleting from the parent table, metrics, delegating the deletes to the child tables (e.g.: metrics_23). We delete in batches of 1,000 by default.
The batching for these deletes doesn't quite work with the multi table abstraction and causes ~23x23 queries per delete.

For each run, we run # records / 1,000 of the above query.

After

The purging process is aware of the delegation. It truncates the hourly table not
associated with dates in the retain window. This removes the need for batching.
Also, truncate is a more streamlined procedure.

Constant number of queries.

The purging will not pickup all older records, only those with hour numbers more than 4 hours ago. We run this every 21 minutes, so this is not a problem.

Tweaking the interval and purge window:

If you run this every night at midnight, it will never purge hours 18-24. So those tables will grow and never be purged. This will need to be run at least once from 4am - 8pm for the purging to work.

The safe rule of thumb is: interval + window < 12 hours. The current defaults 21.minutes + 4.hours < 12 hours is well within these limits. Increasing interval to 1.hour may even be better

Numbers

Metric.count = 84,110, batch size = 1,000

ms bytes objects query qry ms rows comments
30,612.2 4,923,487 412,321 85 30,442.2 (85) before
173.4 585,229 5,702 23 130.9 (23) after

Extrapolated numbers

For the db with an issue, there are many more records. metrics_21 alone has 6,184,590. All metrics is 32,778,292 records.
Since each query takes ~10 minutes - and it would need to perform ~330 of them, it will take 2 1/4 days to delete a days worth of metrics. We'd never get out.

The new process takes closer to 1 minute on an appliance

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think what you have here is a really sweet performance improvement, but with that speed, it seems like it is a more risky approach.

I am fine with that risk, given that we document/test things heavily, which I think is the one thing that I think this PR is heavily lacking. Plenty of proof that it is faster (and honestly, it just makes sense that it is), but almost no info on how the change was implemented.


Regarding the description here:

The metrics table is separated into sub tables: metrics_00 - metrics_23, one table per hour.

You should add some context that realtime metrics are not kept around for more than a day (with this change, no more than 12 hours, ideally), and that it is by design. I had to kind of infer this from the BZ and some of the code and table layout: for example, I learned from this PR, though not directly, that the sub Metric tables are divided by hours of the day. This is in no way obvious to me, and I wouldn't expect that to be obvious to anyone, which made things like Metric.truncate seemed to come out of no where, and seems super scary to do unless that context is known.

Maybe this exists somewhere in some documentation somewhere, but I was never made aware of it, and I work here!

def self.with_table_for_time(hour = Time.now.utc.hour)
Thread.current[:metric_table] = reindex_table_name(hour)
yield
self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are returning self here? Doesn't seem to be used anywhere, as far as I can tell.

end

# reindex_table_name # tablename for "now"
# reindex_Table_name(4) # tablename for hour number 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments barely make sense to me, and only after I understood everything that was going on around them could I figure out what the purpose of this was.

Can these be written yard format or something more standard?

end

def self.with_table_for_time(hour = Time.now.utc.hour)
Thread.current[:metric_table] = reindex_table_name(hour)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't give a better suggestion here, but with this affecting whether either one or "all" of the tables gets truncated, this seems super scary to me doing this in a Thread var (and this is coming from someone who used thread variables as well to do stuff in ManageIQPerformance).

Just for slight piece of mind, can we make this variable name more verbose, something like :targeted_metric_subtable_name, just it avoids the possibility of it being collided by some other code messing with Thread variables and possibly re-using that indentifier.

@@ -61,6 +61,8 @@ def self.purge_count(older_than, interval)
purge_scope(older_than, interval).count
end

# TODO: delete these using the same created_at queries by date
# rather than by joining to metrics - then hourly could be truncated vs deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having more comments around the truncation strategy and workflow would allow this comment to reference those and be more meaningful.

@@ -82,4 +82,24 @@
end
end
end

context "#purge_realtime" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With how risky I see this change being, I feel like having more than 1 test here would be a good idea. Didn't look myself, but based on seeing you are creating this context for "#purge_realtime" yourself, I am a bit discouraged that we didn't have tests around this in the first place 😞 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible edge cases that come to mind:

  • Daylight savings time issues
  • Different source times zones
  • New years

Not saying that what you have here has bugs, but extra tests couldn't hurt with confidence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along these same lines, would it make sense to write these tests in a separate PR so that they work regardless of the underlying implementation, merge them in, and then make sure this PR still passes with them in place?

With a BZ attached to this PR, I don't know how much pressure is on to take the time to do that though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

end
end

def self.determine_target_hours(older_than, end_date = Time.now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the default value for end_date should match how you are invoking it above with Time.now.utc instead of just Time.now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly just omit the above argument, and default end_date to Time.now.utc?

end

def self.determine_target_hours(older_than, end_date = Time.now)
return [] if (end_date - older_than) > 24.hours
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we could possibly never run metrics purging if the purge timer is greater than 24.hours? Not here, but maybe we should fall back to the old way of doing things if the truncation strategy would be acting on no tables, just incase someone mis-configures things.

Otherwise, I don't think we have a safe guard or way of warning a user that this setting should remain under 12.hours/24.hours (haven't done the math on your claim in the PR description 😉 ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what I was concerned about it my comments in gitter. I am almost thinking we should completely prevent setting a time greater than or equal to 24 hours in config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that in the lib/vmdb/config/validator.rb by adding a new method, VMDB::Config::Validator#performance, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a safeguard in here. but it seemed out of place to know about the interval and the duration.

I think it is probably: duration <= 22.hours ; duration + interval + 1.hour < 24.hours - but it just takes longer than a day to purge each hour. So I said 12.hours to be safe.

return [] if (end_date - older_than) > 24.hours

start_hour = older_than.utc.hour
end_hour = end_date.utc.hour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I could really see this being botched up during daylight savings...

end

def self.truncate_child_tables(older_than)
Copy link
Member

@NickLaMuro NickLaMuro Feb 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the below method should really be documented I think. There is a decent amount of domain knowledge that comes with them to understand how they work fully (for example, the mod division logic you do to figure out the good hours for an older_than and end_date that might be between two different days). I would also be surprised if most developers even knew that the metrics table was split into 24 parts on hour (again, I didn't until today... and only new that it was sharded).

@cben
Copy link
Contributor

cben commented Feb 19, 2018

.. I had to kind of infer this from the BZ and some of the code and table layout: for example, I learned from this PR, though not directly, that the sub Metric tables are divided by hours of the day. This is in no way obvious to me, and I wouldn't expect that to be obvious to anyone ..

Anecdotal "me too": I was aware we had some table partitioning, and (unrelatedly) got curious a couple weeks ago how it works and started digging in the code, together with @yaacov (plus googling how partitioning is normally done in rails). => We guessed from the table names metrics_00..23 is by hours, and metric_rollups_01..12 is by month, and found one function computing a table name with number, but failed to find the magic that ties parent model to subtables, and after about half a hour gave up with a shrug...
(Not to mention the why questions... why is this partitioning beneficial enough to justify the ugliness, and why it's not an optimization that DBs can do transparently. I don't doubt that it was done for good reasons, just saying this kind of background can be priceless to newcomers.)

@Fryguy
Copy link
Member

Fryguy commented Feb 19, 2018

@cben Sorry you spent so much time...I could have told you it all because I wrote it. I don't mind writing it yup but where is the best place to help something like this. I considered guides/architecture but I'm curious if you had tried to look there at all in the first place. If not, then putting it there wouldn't have helped you.

@Fryguy
Copy link
Member

Fryguy commented Feb 19, 2018

@cben Maybe just a huge code comment on the Metric model?

@cben
Copy link
Contributor

cben commented Feb 19, 2018 via email

@NickLaMuro
Copy link
Member

Sorry you spent so much time...

@Fryguy For the record, my comments weren't meant to be a "boo, this was designed bad" when I was talking about the table partitioning being not obvious. It was more a comment on how the new implementation of purging now very much relies on knowing the database table partitioning strategy, where before, it was not necessary (which arguably is kind of impressive).

Maybe just a huge code comment on the Metric model?

I would give a +1 to this idea, and also suggest referencing it when talking adding other comments around the partitioning strategy, since the comments around the purging cod would be in a different file (app/models/metric.rb versus app/models/metric/purging.rb). Mostly what I am worried about is someone going in to change the purging code, and not understanding the database structure fully.

@Fryguy
Copy link
Member

Fryguy commented Feb 20, 2018

@NickLaMuro No worries... I didn't hear it that way at all. I more felt bad that time was spent trying to discover something I already knew and could have documented, but didn't, and that time was "wasted" in my mind which I feel bad for. But I can fix it :)

@kbrock
Copy link
Member Author

kbrock commented Feb 20, 2018

@NickLaMuro would you feel more comfortable with truncate_table_for_hour or something like that?
I put this in there because I was finding I wanted to insert records into a particular table and select from a particular table.

Even if we avoided the truncate portion, the previous purging solution works much better using with_table_for_time.

@kbrock kbrock force-pushed the purge_truncate branch 3 times, most recently from 121c8b2 to f3d82ad Compare February 20, 2018 23:17
@kbrock
Copy link
Member Author

kbrock commented Feb 20, 2018

put in a bunch of tests. that pass before/after.

I will admit to a little jury rigging with the tests.
Since the code after rounds the hour down, the new code could potentially not delete 59 minutes of data.

example:

@23:59 keep 4 hours: (time_before = 19:59)
old:  will delete 0:00..19:58 (hours 0..18 + most of 19 but not all)
new: will delete 0:00..18:59 (hours 0..18)

When the code runs 21 minutes later (or a few hours later), the hour 19 bucket will be truncated.

I feel that while this is technically different, it stays within the intent of the code.

Also, if the purging code takes more than 21 minutes (current code being that case), it could potentially be truncated sooner than the existing code.

@Fryguy
Copy link
Member

Fryguy commented Feb 21, 2018

Opened #17034 to document as discussed above.

Copy link
Member Author

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return if target_hours.blank?

target_hours.each do |hour|
Metric.with_table_for_time(hour) do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per GT: please use

ActiveRecord::Base.connection.execute("truncate table #{table_name}"

@@ -3,7 +3,22 @@ class Metric < ApplicationRecord

include Metric::Common

def self.reindex_table_name
"metrics_#{Time.now.utc.hour + 1}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we lost the +1 here
this may effect reindexing which actually wants to reindex the next hour

end
end

def self.determine_target_hours(older_than, end_date)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe edge cases and why it is written

# Used for realtime metrics.
def self.truncate_child_tables(older_than)
target_hours = determine_target_hours(older_than, Time.now.utc)
return if target_hours.blank?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add validations for these settings
if this is blank, let the user know there is a problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: lib/vmdb/config/validator.rb is what we can do to check this (ignore the line that this comment is on)

@rvalente
Copy link
Contributor

I opened a PR for some postgresql.conf tuning options that allows for the delete calls to complete in our environment. We are slowly reclaiming space (92% capacity down to 72% capacity) after ~ 1 hour. IMHO I think truncate is the best approach for mass deletes of metrics data, hope this other PR helps out in combination.

ManageIQ/manageiq-appliance#181

@rvalente
Copy link
Contributor

Regarding metrics: our symptom was disk usage and swapping on our database appliance. Essentially we started swapping hard (~90-100%) even though we had massive amounts of free + buff/cache memory available on the system.

After ~ 14 hours, we have reduced our swap usage on the appliance to less than 1%.

              total        used        free      shared  buff/cache   available
Mem:            63G         12G        310M         16G         50G         34G
Swap:          8.5G         85M        8.4G

By increasing maintenance_work_mem the vacuum was able to finish much more efficiently and we were able to reclaim disk space.

The combination of these two PRs will make a large positive impact in overall application performance.

@kbrock
Copy link
Member Author

kbrock commented Feb 26, 2018

if we want to be more cautious we can patch this in tools/purge_metrics.rb only:

+  opt :truncate, "Truncate child tables for speed"

-    Metric::Purging.purge(dates[interval], interval, opts[:window], opts[:limit]) do |count, _|
-      pbar.progress += count
+    if interval == "realtime" && opts[:truncate]
+      Metric::Purging.truncate_child_tables(dates[interval])
+    else
+      Metric::Purging.purge(dates[interval], interval, opts[:window], opts[:limit]) do |count, _|
+        pbar.progress += count
+      end

@kbrock
Copy link
Member Author

kbrock commented Feb 26, 2018

kicking

@kbrock kbrock closed this Feb 26, 2018
@kbrock kbrock reopened this Feb 26, 2018
@miq-bot
Copy link
Member

miq-bot commented Feb 26, 2018

This pull request is not mergeable. Please rebase and repush.

@jrafanie
Copy link
Member

Looks good as we discussed. Kickstarted travis. Hopefully it doesn't timeout again.

@NickLaMuro
Copy link
Member

@kbrock This seems to still be part of this PR, does it make sense to just remove that bullet from the PR description? Or did you still think it made sense to have that separate?

@kbrock
Copy link
Member Author

kbrock commented Feb 28, 2018

#17051 fix for reindex_table_name

@kbrock This seems to still be part of this PR, does it make sense to just remove that bullet from the PR description? Or did you still think it made sense to have that separate?

@NickLaMuro I extracted that commit into a PR because I was thinking that we may want to backport just that functionality. It fixes a bug. It also seems easier to merge.
I like having the dependency listed for future me to help if we wanted to backport.

@jrafanie
Copy link
Member

@kbrock I merged #17051, is this ready to go? Keep in mind, we'll need to backport both or none since this one depends on the other.

@NickLaMuro
Copy link
Member

NickLaMuro commented Feb 28, 2018

@NickLaMuro I extracted that commit into a PR because I was thinking that we may want to backport just that functionality. It fixes a bug. It also seems easier to merge.
I like having the dependency listed for future me to help if we wanted to backport.

I get this, but let me rephrase based things that have changed since my original comment:

This PR and #17051 BOTH have the def self.reindex_table_name code changes, and the tests associated with them. This shouldn't be the case. Period. Full Stop.


Seeing that #17051 has now been merged, I think you need to rebase the branch once again with a fresh copy of master, and then we are good. Also, you can check off the box in the PR description. I will leave that to you.

@jrafanie
Copy link
Member

jrafanie commented Mar 1, 2018

Seeing that #17051 has now been merged, I think you need to rebase the branch once again with a fresh copy of master, and then we are good. Also, you can check off the box in the PR description. I will leave that to you.

yeah, rebasing should fix that 👍

The metrics table is separated into sub tables:
metrics_00 - metrics_23, one table per hour.

Deleting from the parent table delegating the deletes to the child
tables. The batching for these deletes sometimes conflicts with the
delegation and causes the process to run slowly.

The purging process is aware of the delegation. It truncates the hourly
table not associated with dates in the retain window. This removes the
need for batching. Also, truncate is a more streamlined procedure.

As long as this runs often, this will properly purge the metrics.

* rule of thumb: (interval + retain) < 12.hours.
* May work with total higher than 12 hours, but it is not advised
* Default retain period is to keep 4 hours worth of data
* Default purge interval is to run every 21 minutes.

- https://bugzilla.redhat.com/show_bug.cgi?id=1537733
@miq-bot
Copy link
Member

miq-bot commented Mar 1, 2018

Checked commit kbrock@a027c9c with ruby 2.3.3, rubocop 0.52.0, haml-lint 0.20.0, and yamllint 1.10.0
2 files checked, 0 offenses detected
Everything looks fine. 👍

return if target_hours.blank?

target_hours.each do |hour|
Metric.connection.truncate(Metric.reindex_table_name(hour), "Metric Truncate table #{hour}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. 👍

I think a developer who didn't happen to read the comments might see this now and say "huh, what is reindex_table_name", and hopefully would be able to grok what is happening with this before trying to make a change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thnx

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am cool with this. I think the tests will be a great initial indicator that things are working as expected, and the code should definitely be an improvement from a performance end.

Thanks for the work on this!

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!! 👍

@gtanzillo gtanzillo added this to the Sprint 81 Ending Mar 12, 2018 milestone Mar 5, 2018
@gtanzillo gtanzillo merged commit 7d4db70 into ManageIQ:master Mar 5, 2018
@kbrock kbrock deleted the purge_truncate branch March 5, 2018 22:21
gtanzillo added a commit to gtanzillo/manageiq that referenced this pull request Mar 8, 2018
truncate realtime metrics instead of purging
(cherry picked from commit 7d4db70)
simaishi pushed a commit that referenced this pull request Mar 9, 2018
truncate realtime metrics instead of purging
(cherry picked from commit 7d4db70)

https://bugzilla.redhat.com/show_bug.cgi?id=1553472
@simaishi
Copy link
Contributor

simaishi commented Mar 9, 2018

Gaprindashvili backport details:

$ git log -1
commit 5d84e8f7c2766a613ed0bd02f7a7196432789534
Author: Gregg Tanzillo <gtanzill@redhat.com>
Date:   Mon Mar 5 16:43:02 2018 -0500

    Merge pull request #17017 from kbrock/purge_truncate
    
    truncate realtime metrics instead of purging
    (cherry picked from commit 7d4db70e26ca74c3f67e446bf7887735bfbcab92)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1553472

@simaishi
Copy link
Contributor

Backported to Fine via #17124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants