ref(sentry-metrics): Add MetricsKeyIndexer table #28914

MeredithAnya · 2021-09-28T20:40:51Z

context
This PR added the metrics consumer (take messages from relay, translate it, and pass it along to snuba) but just used redis as a dummy indexer to store the string -> int (and vice versa) translations. The next phase is to add persistence, which will be Postgres.

MetricsKeyIndexer
This table will ultimately be on its own postgres instance. My understanding is that for the sake of development I can run this migration as normal. If that's not the case, well.. I'll need to figure out what I should be doing.

Schema
From my understanding, the basic columns for this table are:

string = str                 // the metric name, tag key, or tag value
id = int64             // the int version of the metric name, tag key or tag value

pk --> (id) /
btree index --> (string)

BTree index on string
We are going to want to be able to look up the string from the id as well as the id from the string, so it needs to be indexed both ways

Sequence vs UUID
We want the ids to be 64bit integers, (uuids would be twice as long at 128bit) so that ClickHouse is more efficient whenn querying. Additionally, the UUID type does not work with the bloom filter index¹. Ultimately we are going to want to pre-allocate multiple values at a time in redis so that we don't have to hit postgres every time we record a new metric (or tag key or value). For these reasons we can use a sequence (for now, could change later).

[EDITED]: Keeping below as historical context, but is outdated

It seems to me there are three main ways to use a sequence:

created independently from a table, literally just CREATE SEQUENCE sequence_name
created independently from a table, but owned by a column in the table CREATE SEQUENCE sequence_name OWNED BY table.column_name
created behind the scenes when using an auto increment field, (what all our id fields do with BoundedBigAutoField)
- Django doesn't seem to let you use an auto increment field that isn't a primary key, even if you specify another field as the primary key. We could probably make a new Field to override this from happening but I wasn't sure this was worth it.
- If we used an auto increment field and tried to INSERT a different value (that wasn't generated from postgres) the auto increment isn't aware of that . ex. you are at value 5, INSERT value = 10, and then go back to using the increment, you'd be at 6 (and then at 10 youd get an IntegrityError)
- I don't think there is a way to change the INCREMENT value from 1 so if we wanted it to use an INCREMENT of 100, I dk how'd we do that. (maybe we'd just have to ALTER SEQUENCE)

For reason listed above I ~~went~~ started with option 2, but there are some caveats, issues:

Unlike the auto increment, you must already have the value when you save the record. Since this is still in dev (and we aren't pre-allocating values yet), so I've kind of implemented that myself in the save() method for now.
For some reason, the test db doesn't seem to have the sequence created even though the MetricsIndexerTable is in there. I haven't figured out why, but in the mean time I have to create the sequence in setUp (and then drop in tearDown) for the tests. I don't have this problem just locally accessing postgres so idk.

needs to be verified, but heard from a reputable source: @fpacifici ↩

github-actions · 2021-09-28T20:45:16Z

This PR has a migration; here is the generated SQL

BEGIN;
--
-- Create model MetricsKeyIndexer
--
CREATE TABLE "sentry_metricskeyindexer" ("id" bigserial NOT NULL PRIMARY KEY, "string" varchar(200) NOT NULL, "date_added" timestamp with time zone NOT NULL);
--
-- Create constraint unique_string on model metricskeyindexer
--
ALTER TABLE "sentry_metricskeyindexer" ADD CONSTRAINT "unique_string" UNIQUE ("string");
COMMIT;

MeredithAnya · 2021-09-30T00:26:16Z

src/sentry/conf/server.py

@@ -1357,7 +1358,7 @@ def create_partitioned_queues(name):
 SENTRY_METRICS_SKIP_INTERNAL_PREFIXES = []  # Order this by most frequent prefixes.

 # Metrics product
-SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer"
+SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.postgres.PGStringIndexer"


I don't really like the naming, I just wanted to make it different that the redis indexer, and I wasn't sure if we were ready to use this as the actual string indexer yet

MeredithAnya · 2021-09-30T00:28:47Z

src/sentry/new_migrations/monkey/executor.py

@@ -37,7 +37,9 @@ def _check_db_routing(migration):
        def _check_operations(operations):
            failed_ops = []
            for operation in operations:
-                if isinstance(operation, (FieldOperation, ModelOperation, RenameContentType)):
+                if isinstance(
+                    operation, (FieldOperation, ModelOperation, RenameContentType, IndexOperation)


I had to add the IndexOperation here because otherwise I got the missing "hints={'tables':..} argument" for AddIndex. It seemed to me that since the AddIndex operation is model specific that I could put this here cc @wedamija

That looks good to me

MeredithAnya · 2021-09-30T00:31:33Z

src/sentry/sentry_metrics/indexer/models.py

+        ]
+
+    @classmethod
+    def get_next_values(cls, num: int):


This is not used right now but was my thought process for later, to test out get n number of next values. This does load all the integers which may not be ideal, so the other option would be to have the increment be larger and we just get the range from the last val to the next val e.g currval is 100 and next is 200, when we have 101 -200 or whatever

jjbayer

This is way more advanced db design than anything I've ever done, but I wonder if we could meet the same requirements with the following model:

class MetricsStringIndex(sentry.db.models.Model):
    organization_id = BoundedBigIntegerField()
    string = models.CharField(max_length=200)
    class Meta:
        unique_together = (("organization_id", "string"),)

The id field inherited from Model could serve as primary key and integer value. This would provide a sequence and efficient int -> string lookups out of the box. The unique_together prevents duplicates and adds an index under the hood, enabling efficient (org_id, string) -> int lookups.

You would still be able to do

SELECT nextval('sentry_metricsstringindex_id_seq') FROM generate_series(1, 100);

for pre-allocated values.

Maybe I'm completely missing some constraints though.

src/sentry/sentry_metrics/indexer/models.py

MeredithAnya · 2021-09-30T16:53:12Z

This is way more advanced db design than anything I've ever done, but I wonder if we could meet the same requirements with the following model:
class MetricsStringIndex(sentry.db.models.Model):
    organization_id = BoundedBigIntegerField()
    string = models.CharField(max_length=200)
    class Meta:
        unique_together = (("organization_id", "string"),)
The id field inherited from Model could serve as primary key and integer value. This would provide a sequence and efficient int -> string lookups out of the box. The unique_together prevents duplicates and adds an index under the hood, enabling efficient (org_id, string) -> int lookups.

You would still be able to do
SELECT nextval('sentry_metricsstringindex_id_seq') FROM generate_series(1, 100);
for pre-allocated values.

Maybe I'm completely missing some constraints though.

If we want to make the value or id (whatever we wanna call it) the primary key then what you've outlined makes sense to me. Maybe the best thing for now is to use an auto increment field.

wedamija · 2021-09-30T18:30:55Z

src/sentry/sentry_metrics/indexer/models.py

+    def get_next_values(cls, num: int) -> Any:
+        using = router.db_for_write(cls)
+        connection = connections[using].cursor()
+
+        connection.execute(
+            "SELECT nextval('sentry_metricskeyindexer_id_seq') from generate_series(1,%s)", [num]
+        )
+        return connection.fetchall()


I don't have a lot of context on this project, how will we use the values from this sequence?

It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?

@wedamija Sorry for not giving enough context in the PR description, I can go back an update in a bit but yeah:

It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?

Thats basically it. Eventually we want to have postgres be off the critical path, but in order to do that we need to know the ids ahead of time. What I am unsure about is what kind of ranges we are talking, is it 100, 1000, 10000? Since this metrics indexer will be used by metrics names, tag keys, and tag values, it could be a lot of writes for high cardinality tags

Looks good for now. Once we know how many ids we're allocating per second we can decide whether we need to do something more complex here.

I'm not sure if there's a performance hit to calling nextval 10k times, as opposed to doing something like https://www.depesz.com/2008/03/20/getting-multiple-values-from-sequences/. Something we can possibly benchmark in the future.

I bet it will be desirable to avoid calling nextval 10k times (that would be 10k writes I believe). We may not strictly need a sequence at that point but just a counter.
I think we could discuss the solutions then. At this point I am not sure it is very useful to have this method at this stage. Probably better removing it for now in case somebody decided to start depending on it for some reason.

fpacifici

Mostly some clarification on how the model is built. Otherwise seems good to me.

fpacifici · 2021-10-01T05:03:39Z

src/sentry/migrations/0235_add_metrics_indexer_table.py

+            constraint=models.UniqueConstraint(
+                fields=("organization_id", "string"), name="unique_org_string"
+            ),


Does this also add the index ?

yeah, and also the auto increment happens to the id because the default is set to the nextval() for the sequence

sentry=# \d sentry_metricskeyindexer Table "public.sentry_metricskeyindexer" Column | Type | Modifiers -----------------+------------------------+----------------------------------------------------------------------- id | bigint | not null default nextval('sentry_metricskeyindexer_id_seq'::regclass) organization_id | bigint | not null string | character varying(200) | not null Indexes: "sentry_metricskeyindexer_pkey" PRIMARY KEY, btree (id) "unique_org_string" UNIQUE CONSTRAINT, btree (organization_id, string)

src/sentry/sentry_metrics/indexer/models.py

fpacifici · 2021-10-01T05:33:08Z

src/sentry/sentry_metrics/indexer/postgres.py

+    def _bulk_record(self, org_id: int, unmapped_strings: Set[str]) -> List[Record]:
+        records = []
+        for string in unmapped_strings:
+            obj = MetricsKeyIndexer.objects.create(organization_id=org_id, string=string)


Do we have a way to create objects in batch? It should be more efficient for postgres.

I wasn't using bulk_create before because it doesn't call save() (which I was using to set the value) but now that I've gone back to using an auto increment field I think we can use that.

wedamija · 2021-10-01T17:06:37Z

@MeredithAnya I'm curious about whether we need to use this sequence at all. What are the requirements for these ids? I'm wondering whether we could use UUIDs here

MeredithAnya · 2021-10-01T20:03:44Z

@MeredithAnya I'm curious about whether we need to use this sequence at all. What are the requirements for these ids? I'm wondering whether we could use UUIDs here

@wedamija the ids have to be 64 bit integers and according to the maths (not done by me tho) we can't use a hashing system because the probability of collisions is too high for comfort

vercel bot deployed to Preview – sentry September 28, 2021 20:41 View deployment

vercel bot deployed to Preview – storybook September 28, 2021 20:41 View deployment

vercel bot deployed to Preview – sentry September 29, 2021 19:23 View deployment

vercel bot deployed to Preview – storybook September 29, 2021 19:23 View deployment

vercel bot deployed to Preview – sentry September 29, 2021 23:26 View deployment

vercel bot deployed to Preview – storybook September 29, 2021 23:26 View deployment

MeredithAnya changed the title ~~ref(sentry-metrics): Add StringIndexer table [WIP]~~ ref(sentry-metrics): Add MetricsKeyIndexer table [WIP] Sep 29, 2021

MeredithAnya commented Sep 30, 2021

View reviewed changes

MeredithAnya force-pushed the sentry-metrics/SNS-398 branch from 56d0416 to 3a316c0 Compare September 30, 2021 00:44

vercel bot deployed to Preview – sentry September 30, 2021 00:44 View deployment

vercel bot deployed to Preview – storybook September 30, 2021 00:44 View deployment

vercel bot deployed to Preview – sentry September 30, 2021 01:22 View deployment

vercel bot deployed to Preview – storybook September 30, 2021 01:22 View deployment

jjbayer reviewed Sep 30, 2021

View reviewed changes

src/sentry/sentry_metrics/indexer/models.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview – sentry September 30, 2021 17:59 View deployment

vercel bot deployed to Preview – storybook September 30, 2021 17:59 View deployment

wedamija reviewed Sep 30, 2021

View reviewed changes

fpacifici reviewed Oct 1, 2021

View reviewed changes

vercel bot deployed to Preview – storybook October 1, 2021 18:25 View deployment

vercel bot deployed to Preview – sentry October 1, 2021 18:25 View deployment

vercel bot deployed to Preview – storybook October 1, 2021 19:50 View deployment

vercel bot deployed to Preview – sentry October 1, 2021 19:50 View deployment

vercel bot deployed to Preview – storybook October 1, 2021 20:07 View deployment

vercel bot deployed to Preview – sentry October 1, 2021 20:07 View deployment

MeredithAnya requested a review from a team as a code owner October 5, 2021 18:19

vercel bot deployed to Preview – sentry October 5, 2021 18:19 View deployment

vercel bot deployed to Preview – storybook October 5, 2021 18:19 View deployment

MeredithAnya added 13 commits October 5, 2021 11:22

ref(sentry-metrics): Add StringIndexer table

afee685

le migration

dcf7e4c

next try

b94843a

lets see if tests pass

6535086

update migration number

9e7ae1a

fix typings

d30430d

simplify things

39e2479

use bulk_create

c84a69e

update test_metrics

e4b5aa4

Any it is

b1906b4

update table: remove org_id, add date_added

256cbb3

default to using mock indexer

5efe976

fix typo

7b552e8

MeredithAnya force-pushed the sentry-metrics/SNS-398 branch from 0617619 to 7b552e8 Compare October 5, 2021 18:28

vercel bot deployed to Preview – sentry October 5, 2021 18:28 View deployment

vercel bot deployed to Preview – storybook October 5, 2021 18:28 View deployment

MeredithAnya mentioned this pull request Oct 5, 2021

ref(sentry-metrics): USE_INDEXER getsentry/snuba#2133

Merged

oopsies

613b141

vercel bot deployed to Preview – sentry October 5, 2021 18:51 View deployment

vercel bot deployed to Preview – storybook October 5, 2021 18:51 View deployment

MeredithAnya merged commit 669cccc into master Oct 5, 2021

MeredithAnya deleted the sentry-metrics/SNS-398 branch October 5, 2021 20:17

MeredithAnya mentioned this pull request Oct 6, 2021

ref(sentry-metrics): Add basic caching layer to indexer #29115

Merged

MeredithAnya mentioned this pull request Oct 20, 2021

ref(sentry-metrics): Add async task #29358

Merged

github-actions bot locked and limited conversation to collaborators Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(sentry-metrics): Add MetricsKeyIndexer table #28914

ref(sentry-metrics): Add MetricsKeyIndexer table #28914

MeredithAnya commented Sep 28, 2021 •

edited

Loading

github-actions bot commented Sep 28, 2021 •

edited

Loading

MeredithAnya Sep 30, 2021

MeredithAnya Sep 30, 2021

wedamija Sep 30, 2021

MeredithAnya Sep 30, 2021

jjbayer left a comment

MeredithAnya commented Sep 30, 2021

wedamija Sep 30, 2021

MeredithAnya Sep 30, 2021

wedamija Oct 2, 2021 •

edited

Loading

fpacifici Oct 2, 2021

fpacifici left a comment

fpacifici Oct 1, 2021

MeredithAnya Oct 1, 2021

fpacifici Oct 1, 2021

MeredithAnya Oct 1, 2021

wedamija commented Oct 1, 2021

MeredithAnya commented Oct 1, 2021

ref(sentry-metrics): Add MetricsKeyIndexer table #28914

ref(sentry-metrics): Add MetricsKeyIndexer table #28914

Conversation

MeredithAnya commented Sep 28, 2021 • edited Loading

Footnotes

github-actions bot commented Sep 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjbayer left a comment

Choose a reason for hiding this comment

MeredithAnya commented Sep 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wedamija Oct 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpacifici left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wedamija commented Oct 1, 2021

MeredithAnya commented Oct 1, 2021

MeredithAnya commented Sep 28, 2021 •

edited

Loading

github-actions bot commented Sep 28, 2021 •

edited

Loading

wedamija Oct 2, 2021 •

edited

Loading