truncate realtime metrics instead of purging #17017

kbrock · 2018-02-17T00:35:41Z

extracted:

Add reindex to job scheduler #16929 previous PR that this depends upon
test realtime metrics purging #17050 introduce metrics purging tests
Allow individual tables to be specified for metrics #17051 fix for reindex_table_name

Background

The metrics table holds realtime metrics and is separated into sub tables: metrics_00 - metrics_23, one table per hour. A trigger on metrics will put into the corresponding table based upon the timestamp column.

Every 21 minutes, the scheduler calls Metric::Purging#purge_realtime_timer to delete all metrics older than 4 hours old. So only 5 tables should be populated at a time (note: rounding of time makes it 5 tables not 4).

The purging is taking too long, and metrics is growing. This causes the query to run even slower.

Before

Deleting from the parent table, metrics, delegating the deletes to the child tables (e.g.: metrics_23). We delete in batches of 1,000 by default.
The batching for these deletes doesn't quite work with the multi table abstraction and causes ~23x23 queries per delete.

For each run, we run # records / 1,000 of the above query.

After

The purging process is aware of the delegation. It truncates the hourly table not
associated with dates in the retain window. This removes the need for batching.
Also, truncate is a more streamlined procedure.

Constant number of queries.

The purging will not pickup all older records, only those with hour numbers more than 4 hours ago. We run this every 21 minutes, so this is not a problem.

Tweaking the interval and purge window:

If you run this every night at midnight, it will never purge hours 18-24. So those tables will grow and never be purged. This will need to be run at least once from 4am - 8pm for the purging to work.

The safe rule of thumb is: interval + window < 12 hours. The current defaults 21.minutes + 4.hours < 12 hours is well within these limits. Increasing interval to 1.hour may even be better

Numbers

Metric.count = 84,110, batch size = 1,000

ms	bytes	objects	query	qry ms	rows	`comments`
30,612.2	4,923,487	412,321	85	30,442.2	(85)	before
173.4	585,229	5,702	23	130.9	(23)	after

Extrapolated numbers

For the db with an issue, there are many more records. metrics_21 alone has 6,184,590. All metrics is 32,778,292 records.
Since each query takes ~10 minutes - and it would need to perform ~330 of them, it will take 2 1/4 days to delete a days worth of metrics. We'd never get out.

The new process takes closer to 1 minute on an appliance

https://bugzilla.redhat.com/show_bug.cgi?id=1537733

NickLaMuro

Overall, I think what you have here is a really sweet performance improvement, but with that speed, it seems like it is a more risky approach.

I am fine with that risk, given that we document/test things heavily, which I think is the one thing that I think this PR is heavily lacking. Plenty of proof that it is faster (and honestly, it just makes sense that it is), but almost no info on how the change was implemented.

Regarding the description here:

The metrics table is separated into sub tables: metrics_00 - metrics_23, one table per hour.

You should add some context that realtime metrics are not kept around for more than a day (with this change, no more than 12 hours, ideally), and that it is by design. I had to kind of infer this from the BZ and some of the code and table layout: for example, I learned from this PR, though not directly, that the sub Metric tables are divided by hours of the day. This is in no way obvious to me, and I wouldn't expect that to be obvious to anyone, which made things like Metric.truncate seemed to come out of no where, and seems super scary to do unless that context is known.

Maybe this exists somewhere in some documentation somewhere, but I was never made aware of it, and I work here!

NickLaMuro · 2018-02-17T01:09:01Z

app/models/metric.rb

+  def self.with_table_for_time(hour = Time.now.utc.hour)
+    Thread.current[:metric_table] = reindex_table_name(hour)
+    yield
+    self


Is there a reason you are returning self here? Doesn't seem to be used anywhere, as far as I can tell.

NickLaMuro · 2018-02-17T01:11:15Z

app/models/metric.rb

+  end
+
+  # reindex_table_name # tablename for "now"
+  # reindex_Table_name(4) # tablename for hour number 4


These comments barely make sense to me, and only after I understood everything that was going on around them could I figure out what the purpose of this was.

Can these be written yard format or something more standard?

NickLaMuro · 2018-02-17T01:15:07Z

app/models/metric.rb

+  end
+
+  def self.with_table_for_time(hour = Time.now.utc.hour)
+    Thread.current[:metric_table] = reindex_table_name(hour)


Can't give a better suggestion here, but with this affecting whether either one or "all" of the tables gets truncated, this seems super scary to me doing this in a Thread var (and this is coming from someone who used thread variables as well to do stuff in ManageIQPerformance).

Just for slight piece of mind, can we make this variable name more verbose, something like :targeted_metric_subtable_name, just it avoids the possibility of it being collided by some other code messing with Thread variables and possibly re-using that indentifier.

NickLaMuro · 2018-02-17T01:16:28Z

app/models/metric/purging.rb

@@ -61,6 +61,8 @@ def self.purge_count(older_than, interval)
    purge_scope(older_than, interval).count
  end

+  # TODO: delete these using the same created_at queries by date
+  # rather than by joining to metrics - then hourly could be truncated vs deleted


Having more comments around the truncation strategy and workflow would allow this comment to reference those and be more meaningful.

NickLaMuro · 2018-02-17T01:21:42Z

spec/models/metric/purging_spec.rb

@@ -82,4 +82,24 @@
      end
    end
  end
+
+  context "#purge_realtime" do


With how risky I see this change being, I feel like having more than 1 test here would be a good idea. Didn't look myself, but based on seeing you are creating this context for "#purge_realtime" yourself, I am a bit discouraged that we didn't have tests around this in the first place 😞 .

Possible edge cases that come to mind:

Daylight savings time issues

Different source times zones

New years

Not saying that what you have here has bugs, but extra tests couldn't hurt with confidence.

Along these same lines, would it make sense to write these tests in a separate PR so that they work regardless of the underlying implementation, merge them in, and then make sure this PR still passes with them in place?

With a BZ attached to this PR, I don't know how much pressure is on to take the time to do that though.

NickLaMuro · 2018-02-17T01:29:15Z

app/models/metric/purging.rb

+    end
+  end
+
+  def self.determine_target_hours(older_than, end_date = Time.now)


Seems like the default value for end_date should match how you are invoking it above with Time.now.utc instead of just Time.now.

Possibly just omit the above argument, and default end_date to Time.now.utc?

NickLaMuro · 2018-02-17T01:40:53Z

app/models/metric/purging.rb

+  end
+
+  def self.determine_target_hours(older_than, end_date = Time.now)
+    return [] if (end_date - older_than) > 24.hours


Does this mean we could possibly never run metrics purging if the purge timer is greater than 24.hours? Not here, but maybe we should fall back to the old way of doing things if the truncation strategy would be acting on no tables, just incase someone mis-configures things.

Otherwise, I don't think we have a safe guard or way of warning a user that this setting should remain under 12.hours/24.hours (haven't done the math on your claim in the PR description 😉 ).

This is exactly what I was concerned about it my comments in gitter. I am almost thinking we should completely prevent setting a time greater than or equal to 24 hours in config.

We can do that in the lib/vmdb/config/validator.rb by adding a new method, VMDB::Config::Validator#performance, correct?

I had a safeguard in here. but it seemed out of place to know about the interval and the duration.

I think it is probably: duration <= 22.hours ; duration + interval + 1.hour < 24.hours - but it just takes longer than a day to purge each hour. So I said 12.hours to be safe.

NickLaMuro · 2018-02-17T01:41:47Z

app/models/metric/purging.rb

+    return [] if (end_date - older_than) > 24.hours
+
+    start_hour = older_than.utc.hour
+    end_hour = end_date.utc.hour


Yeah, I could really see this being botched up during daylight savings...

NickLaMuro · 2018-02-17T01:46:16Z

app/models/metric/purging.rb

  end

+  def self.truncate_child_tables(older_than)


This and the below method should really be documented I think. There is a decent amount of domain knowledge that comes with them to understand how they work fully (for example, the mod division logic you do to figure out the good hours for an older_than and end_date that might be between two different days). I would also be surprised if most developers even knew that the metrics table was split into 24 parts on hour (again, I didn't until today... and only new that it was sharded).

cben · 2018-02-19T10:28:14Z

.. I had to kind of infer this from the BZ and some of the code and table layout: for example, I learned from this PR, though not directly, that the sub Metric tables are divided by hours of the day. This is in no way obvious to me, and I wouldn't expect that to be obvious to anyone ..

Anecdotal "me too": I was aware we had some table partitioning, and (unrelatedly) got curious a couple weeks ago how it works and started digging in the code, together with @yaacov (plus googling how partitioning is normally done in rails). => We guessed from the table names metrics_00..23 is by hours, and metric_rollups_01..12 is by month, and found one function computing a table name with number, but failed to find the magic that ties parent model to subtables, and after about half a hour gave up with a shrug...
(Not to mention the why questions... why is this partitioning beneficial enough to justify the ugliness, and why it's not an optimization that DBs can do transparently. I don't doubt that it was done for good reasons, just saying this kind of background can be priceless to newcomers.)

Fryguy · 2018-02-19T13:23:58Z

@cben Sorry you spent so much time...I could have told you it all because I wrote it. I don't mind writing it yup but where is the best place to help something like this. I considered guides/architecture but I'm curious if you had tried to look there at all in the first place. If not, then putting it there wouldn't have helped you.

Fryguy · 2018-02-19T13:25:07Z

@cben Maybe just a huge code comment on the Metric model?

cben · 2018-02-19T13:28:48Z

No, I didn't even think of looking at guides. Yes, first place I looked is Metric and MetricRollup models.

NickLaMuro · 2018-02-19T15:59:44Z

Sorry you spent so much time...

@Fryguy For the record, my comments weren't meant to be a "boo, this was designed bad" when I was talking about the table partitioning being not obvious. It was more a comment on how the new implementation of purging now very much relies on knowing the database table partitioning strategy, where before, it was not necessary (which arguably is kind of impressive).

Maybe just a huge code comment on the Metric model?

I would give a +1 to this idea, and also suggest referencing it when talking adding other comments around the partitioning strategy, since the comments around the purging cod would be in a different file (app/models/metric.rb versus app/models/metric/purging.rb). Mostly what I am worried about is someone going in to change the purging code, and not understanding the database structure fully.

Fryguy · 2018-02-20T04:13:58Z

@NickLaMuro No worries... I didn't hear it that way at all. I more felt bad that time was spent trying to discover something I already knew and could have documented, but didn't, and that time was "wasted" in my mind which I feel bad for. But I can fix it :)

kbrock · 2018-02-20T15:28:15Z

@NickLaMuro would you feel more comfortable with truncate_table_for_hour or something like that?
I put this in there because I was finding I wanted to insert records into a particular table and select from a particular table.

Even if we avoided the truncate portion, the previous purging solution works much better using with_table_for_time.

kbrock · 2018-02-20T23:24:08Z

put in a bunch of tests. that pass before/after.

I will admit to a little jury rigging with the tests.
Since the code after rounds the hour down, the new code could potentially not delete 59 minutes of data.

example:

@23:59 keep 4 hours: (time_before = 19:59)
old:  will delete 0:00..19:58 (hours 0..18 + most of 19 but not all)
new: will delete 0:00..18:59 (hours 0..18)

When the code runs 21 minutes later (or a few hours later), the hour 19 bucket will be truncated.

I feel that while this is technically different, it stays within the intent of the code.

Also, if the purging code takes more than 21 minutes (current code being that case), it could potentially be truncated sooner than the existing code.

Fryguy · 2018-02-21T15:30:32Z

Opened #17034 to document as discussed above.

kbrock

from @NickLaMuro @jrafanie and @gtanzillo

kbrock · 2018-02-21T20:46:27Z

app/models/metric/purging.rb

+    return if target_hours.blank?
+
+    target_hours.each do |hour|
+      Metric.with_table_for_time(hour) do


per GT: please use

ActiveRecord::Base.connection.execute("truncate table #{table_name}"

kbrock · 2018-02-21T20:47:07Z

app/models/metric.rb

@@ -3,7 +3,22 @@ class Metric < ApplicationRecord

  include Metric::Common

-  def self.reindex_table_name
-    "metrics_#{Time.now.utc.hour + 1}"


we lost the +1 here
this may effect reindexing which actually wants to reindex the next hour

kbrock · 2018-02-21T21:02:40Z

app/models/metric/purging.rb

+    end
+  end
+
+  def self.determine_target_hours(older_than, end_date)


describe edge cases and why it is written

kbrock · 2018-02-21T21:03:17Z

app/models/metric/purging.rb

+  # Used for realtime metrics.
+  def self.truncate_child_tables(older_than)
+    target_hours = determine_target_hours(older_than, Time.now.utc)
+    return if target_hours.blank?


lets add validations for these settings
if this is blank, let the user know there is a problem

To clarify: lib/vmdb/config/validator.rb is what we can do to check this (ignore the line that this comment is on)

rvalente · 2018-02-21T21:46:00Z

I opened a PR for some postgresql.conf tuning options that allows for the delete calls to complete in our environment. We are slowly reclaiming space (92% capacity down to 72% capacity) after ~ 1 hour. IMHO I think truncate is the best approach for mass deletes of metrics data, hope this other PR helps out in combination.

ManageIQ/manageiq-appliance#181

rvalente · 2018-02-22T11:25:37Z

Regarding metrics: our symptom was disk usage and swapping on our database appliance. Essentially we started swapping hard (~90-100%) even though we had massive amounts of free + buff/cache memory available on the system.

After ~ 14 hours, we have reduced our swap usage on the appliance to less than 1%.

              total        used        free      shared  buff/cache   available
Mem:            63G         12G        310M         16G         50G         34G
Swap:          8.5G         85M        8.4G

By increasing maintenance_work_mem the vacuum was able to finish much more efficiently and we were able to reclaim disk space.

The combination of these two PRs will make a large positive impact in overall application performance.

kbrock · 2018-02-26T18:57:44Z

if we want to be more cautious we can patch this in tools/purge_metrics.rb only:

+  opt :truncate, "Truncate child tables for speed"

-    Metric::Purging.purge(dates[interval], interval, opts[:window], opts[:limit]) do |count, _|
-      pbar.progress += count
+    if interval == "realtime" && opts[:truncate]
+      Metric::Purging.truncate_child_tables(dates[interval])
+    else
+      Metric::Purging.purge(dates[interval], interval, opts[:window], opts[:limit]) do |count, _|
+        pbar.progress += count
+      end

kbrock · 2018-02-26T21:32:12Z

kicking

miq-bot · 2018-02-26T22:47:16Z

This pull request is not mergeable. Please rebase and repush.

jrafanie · 2018-02-27T19:20:29Z

Looks good as we discussed. Kickstarted travis. Hopefully it doesn't timeout again.

NickLaMuro · 2018-02-27T19:26:31Z

Allow individual tables to be specified for metrics #17051 fix for reindex_table_name

@kbrock This seems to still be part of this PR, does it make sense to just remove that bullet from the PR description? Or did you still think it made sense to have that separate?

kbrock · 2018-02-28T15:34:49Z

#17051 fix for reindex_table_name

@kbrock This seems to still be part of this PR, does it make sense to just remove that bullet from the PR description? Or did you still think it made sense to have that separate?

@NickLaMuro I extracted that commit into a PR because I was thinking that we may want to backport just that functionality. It fixes a bug. It also seems easier to merge.
I like having the dependency listed for future me to help if we wanted to backport.

jrafanie · 2018-02-28T16:45:20Z

@kbrock I merged #17051, is this ready to go? Keep in mind, we'll need to backport both or none since this one depends on the other.

NickLaMuro · 2018-02-28T17:08:48Z

@NickLaMuro I extracted that commit into a PR because I was thinking that we may want to backport just that functionality. It fixes a bug. It also seems easier to merge.
I like having the dependency listed for future me to help if we wanted to backport.

I get this, but let me rephrase based things that have changed since my original comment:

This PR and #17051 BOTH have the def self.reindex_table_name code changes, and the tests associated with them. This shouldn't be the case. Period. Full Stop.

Seeing that #17051 has now been merged, I think you need to rebase the branch once again with a fresh copy of master, and then we are good. Also, you can check off the box in the PR description. I will leave that to you.

jrafanie · 2018-03-01T17:11:25Z

Seeing that #17051 has now been merged, I think you need to rebase the branch once again with a fresh copy of master, and then we are good. Also, you can check off the box in the PR description. I will leave that to you.

yeah, rebasing should fix that 👍

The metrics table is separated into sub tables: metrics_00 - metrics_23, one table per hour. Deleting from the parent table delegating the deletes to the child tables. The batching for these deletes sometimes conflicts with the delegation and causes the process to run slowly. The purging process is aware of the delegation. It truncates the hourly table not associated with dates in the retain window. This removes the need for batching. Also, truncate is a more streamlined procedure. As long as this runs often, this will properly purge the metrics. * rule of thumb: (interval + retain) < 12.hours. * May work with total higher than 12 hours, but it is not advised * Default retain period is to keep 4 hours worth of data * Default purge interval is to run every 21 minutes. - https://bugzilla.redhat.com/show_bug.cgi?id=1537733

miq-bot · 2018-03-01T22:02:39Z

Checked commit kbrock@a027c9c with ruby 2.3.3, rubocop 0.52.0, haml-lint 0.20.0, and yamllint 1.10.0
2 files checked, 0 offenses detected
Everything looks fine. 👍

NickLaMuro · 2018-03-01T22:07:04Z

app/models/metric/purging.rb

+    return if target_hours.blank?
+
+    target_hours.each do |hour|
+      Metric.connection.truncate(Metric.reindex_table_name(hour), "Metric Truncate table #{hour}")


I like this. 👍

I think a developer who didn't happen to read the comments might see this now and say "huh, what is reindex_table_name", and hopefully would be able to grok what is happening with this before trying to make a change.

NickLaMuro

I am cool with this. I think the tests will be a great initial indicator that things are working as expected, and the code should definitely be an improvement from a performance end.

Thanks for the work on this!

gtanzillo

This looks great!! 👍

Manual backport of: ManageIQ#17017 ManageIQ#17051 ManageIQ#17050 Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1537733

truncate realtime metrics instead of purging (cherry picked from commit 7d4db70)

truncate realtime metrics instead of purging (cherry picked from commit 7d4db70) https://bugzilla.redhat.com/show_bug.cgi?id=1553472

simaishi · 2018-03-09T15:02:26Z

Gaprindashvili backport details:

$ git log -1
commit 5d84e8f7c2766a613ed0bd02f7a7196432789534
Author: Gregg Tanzillo <gtanzill@redhat.com>
Date:   Mon Mar 5 16:43:02 2018 -0500

    Merge pull request #17017 from kbrock/purge_truncate
    
    truncate realtime metrics instead of purging
    (cherry picked from commit 7d4db70e26ca74c3f67e446bf7887735bfbcab92)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1553472

simaishi · 2018-03-16T13:49:02Z

Backported to Fine via #17124

kbrock added bug performance providers/metrics labels Feb 17, 2018

kbrock assigned Fryguy Feb 17, 2018

NickLaMuro reviewed Feb 17, 2018

View reviewed changes

kbrock force-pushed the purge_truncate branch 3 times, most recently from 121c8b2 to f3d82ad Compare February 20, 2018 23:17

kbrock commented Feb 21, 2018

View reviewed changes

kbrock force-pushed the purge_truncate branch 2 times, most recently from d05f158 to f14ab4c Compare February 26, 2018 18:36

This was referenced Feb 26, 2018

Allow individual tables to be specified for metrics #17051

Merged

test realtime metrics purging #17050

Merged

kbrock force-pushed the purge_truncate branch from f14ab4c to e6ce2f6 Compare February 26, 2018 18:44

kbrock closed this Feb 26, 2018

kbrock reopened this Feb 26, 2018

kbrock force-pushed the purge_truncate branch from e6ce2f6 to 424105e Compare February 27, 2018 02:29

miq-bot removed the unmergeable label Feb 27, 2018

kbrock force-pushed the purge_truncate branch from 424105e to a027c9c Compare March 1, 2018 21:54

NickLaMuro reviewed Mar 1, 2018

View reviewed changes

NickLaMuro approved these changes Mar 1, 2018

View reviewed changes

gtanzillo approved these changes Mar 5, 2018

View reviewed changes

gtanzillo added this to the Sprint 81 Ending Mar 12, 2018 milestone Mar 5, 2018

gtanzillo merged commit 7d4db70 into ManageIQ:master Mar 5, 2018

gtanzillo added fine/yes labels Mar 5, 2018

kbrock deleted the purge_truncate branch March 5, 2018 22:21

gtanzillo added a commit to gtanzillo/manageiq that referenced this pull request Mar 8, 2018

Truncate realtime metrics instead of purging

537202b

Manual backport of: ManageIQ#17017 ManageIQ#17051 ManageIQ#17050 Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1537733

gtanzillo mentioned this pull request Mar 8, 2018

[FINE] Truncate realtime metrics instead of purging #17124

Merged

gtanzillo added a commit to gtanzillo/manageiq that referenced this pull request Mar 8, 2018

Merge pull request ManageIQ#17017 from kbrock/purge_truncate

79a4fba

truncate realtime metrics instead of purging (cherry picked from commit 7d4db70)

simaishi pushed a commit that referenced this pull request Mar 9, 2018

Merge pull request #17017 from kbrock/purge_truncate

5d84e8f

truncate realtime metrics instead of purging (cherry picked from commit 7d4db70) https://bugzilla.redhat.com/show_bug.cgi?id=1553472

simaishi added gaprindashvili/backported and removed gaprindashvili/yes labels Mar 9, 2018

gtanzillo mentioned this pull request Mar 13, 2018

[GAPRINDASHVILI] Truncate realtime metrics instead of purging #17151

Closed

simaishi added fine/backported and removed fine/yes labels Mar 16, 2018

truncate realtime metrics instead of purging #17017

truncate realtime metrics instead of purging #17017

Conversation

kbrock commented Feb 17, 2018 • edited Loading

Background

Before

After

Tweaking the interval and purge window:

Numbers

Extrapolated numbers

NickLaMuro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro Feb 17, 2018 • edited Loading

Choose a reason for hiding this comment

cben commented Feb 19, 2018

Fryguy commented Feb 19, 2018

Fryguy commented Feb 19, 2018

cben commented Feb 19, 2018 via email

NickLaMuro commented Feb 19, 2018

Fryguy commented Feb 20, 2018

kbrock commented Feb 20, 2018 • edited Loading

kbrock commented Feb 20, 2018

Fryguy commented Feb 21, 2018

kbrock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvalente commented Feb 21, 2018

rvalente commented Feb 22, 2018

kbrock commented Feb 26, 2018

kbrock commented Feb 26, 2018

miq-bot commented Feb 26, 2018

jrafanie commented Feb 27, 2018

NickLaMuro commented Feb 27, 2018

kbrock commented Feb 28, 2018 • edited Loading

jrafanie commented Feb 28, 2018

NickLaMuro commented Feb 28, 2018 • edited Loading

jrafanie commented Mar 1, 2018

miq-bot commented Mar 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro left a comment

Choose a reason for hiding this comment

gtanzillo left a comment

Choose a reason for hiding this comment

simaishi commented Mar 9, 2018

simaishi commented Mar 16, 2018

kbrock commented Feb 17, 2018 •

edited

Loading

NickLaMuro Feb 17, 2018 •

edited

Loading

kbrock commented Feb 20, 2018 •

edited

Loading

kbrock commented Feb 28, 2018 •

edited

Loading

NickLaMuro commented Feb 28, 2018 •

edited

Loading