Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(sentry-metrics): Add MetricsKeyIndexer table #28914

Merged
merged 14 commits into from
Oct 5, 2021

Conversation

MeredithAnya
Copy link
Member

@MeredithAnya MeredithAnya commented Sep 28, 2021

context
This PR added the metrics consumer (take messages from relay, translate it, and pass it along to snuba) but just used redis as a dummy indexer to store the string -> int (and vice versa) translations. The next phase is to add persistence, which will be Postgres.

MetricsKeyIndexer
This table will ultimately be on its own postgres instance. My understanding is that for the sake of development I can run this migration as normal. If that's not the case, well.. I'll need to figure out what I should be doing.

Schema
From my understanding, the basic columns for this table are:

string = str                 // the metric name, tag key, or tag value
id = int64             // the int version of the metric name, tag key or tag value

pk --> (id) /
btree index --> (string)

BTree index on string
We are going to want to be able to look up the string from the id as well as the id from the string, so it needs to be indexed both ways

Sequence vs UUID
We want the ids to be 64bit integers, (uuids would be twice as long at 128bit) so that ClickHouse is more efficient whenn querying. Additionally, the UUID type does not work with the bloom filter index1. Ultimately we are going to want to pre-allocate multiple values at a time in redis so that we don't have to hit postgres every time we record a new metric (or tag key or value). For these reasons we can use a sequence (for now, could change later).


[EDITED]: Keeping below as historical context, but is outdated

It seems to me there are three main ways to use a sequence:

  1. created independently from a table, literally just CREATE SEQUENCE sequence_name
  2. created independently from a table, but owned by a column in the table CREATE SEQUENCE sequence_name OWNED BY table.column_name
  3. created behind the scenes when using an auto increment field, (what all our id fields do with BoundedBigAutoField)
    • Django doesn't seem to let you use an auto increment field that isn't a primary key, even if you specify another field as the primary key. We could probably make a new Field to override this from happening but I wasn't sure this was worth it.
    • If we used an auto increment field and tried to INSERT a different value (that wasn't generated from postgres) the auto increment isn't aware of that . ex. you are at value 5, INSERT value = 10, and then go back to using the increment, you'd be at 6 (and then at 10 youd get an IntegrityError)
    • I don't think there is a way to change the INCREMENT value from 1 so if we wanted it to use an INCREMENT of 100, I dk how'd we do that. (maybe we'd just have to ALTER SEQUENCE)

For reason listed above I went started with option 2, but there are some caveats, issues:

  • Unlike the auto increment, you must already have the value when you save the record. Since this is still in dev (and we aren't pre-allocating values yet), so I've kind of implemented that myself in the save() method for now.
  • For some reason, the test db doesn't seem to have the sequence created even though the MetricsIndexerTable is in there. I haven't figured out why, but in the mean time I have to create the sequence in setUp (and then drop in tearDown) for the tests. I don't have this problem just locally accessing postgres so idk.

Footnotes

  1. needs to be verified, but heard from a reputable source: @fpacifici

@github-actions
Copy link
Contributor

github-actions bot commented Sep 28, 2021

This PR has a migration; here is the generated SQL

BEGIN;
--
-- Create model MetricsKeyIndexer
--
CREATE TABLE "sentry_metricskeyindexer" ("id" bigserial NOT NULL PRIMARY KEY, "string" varchar(200) NOT NULL, "date_added" timestamp with time zone NOT NULL);
--
-- Create constraint unique_string on model metricskeyindexer
--
ALTER TABLE "sentry_metricskeyindexer" ADD CONSTRAINT "unique_string" UNIQUE ("string");
COMMIT;

@@ -1357,7 +1358,7 @@ def create_partitioned_queues(name):
SENTRY_METRICS_SKIP_INTERNAL_PREFIXES = [] # Order this by most frequent prefixes.

# Metrics product
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer"
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.postgres.PGStringIndexer"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the naming, I just wanted to make it different that the redis indexer, and I wasn't sure if we were ready to use this as the actual string indexer yet

@@ -37,7 +37,9 @@ def _check_db_routing(migration):
def _check_operations(operations):
failed_ops = []
for operation in operations:
if isinstance(operation, (FieldOperation, ModelOperation, RenameContentType)):
if isinstance(
operation, (FieldOperation, ModelOperation, RenameContentType, IndexOperation)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add the IndexOperation here because otherwise I got the missing "hints={'tables':..} argument" for AddIndex. It seemed to me that since the AddIndex operation is model specific that I could put this here cc @wedamija

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me

]

@classmethod
def get_next_values(cls, num: int):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used right now but was my thought process for later, to test out get n number of next values. This does load all the integers which may not be ideal, so the other option would be to have the increment be larger and we just get the range from the last val to the next val e.g currval is 100 and next is 200, when we have 101 -200 or whatever

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way more advanced db design than anything I've ever done, but I wonder if we could meet the same requirements with the following model:

class MetricsStringIndex(sentry.db.models.Model):
    organization_id = BoundedBigIntegerField()
    string = models.CharField(max_length=200)
    class Meta:
        unique_together = (("organization_id", "string"),)

The id field inherited from Model could serve as primary key and integer value. This would provide a sequence and efficient int -> string lookups out of the box. The unique_together prevents duplicates and adds an index under the hood, enabling efficient (org_id, string) -> int lookups.

You would still be able to do

SELECT nextval('sentry_metricsstringindex_id_seq') FROM generate_series(1, 100);

for pre-allocated values.

Maybe I'm completely missing some constraints though.

@MeredithAnya
Copy link
Member Author

This is way more advanced db design than anything I've ever done, but I wonder if we could meet the same requirements with the following model:

class MetricsStringIndex(sentry.db.models.Model):
    organization_id = BoundedBigIntegerField()
    string = models.CharField(max_length=200)
    class Meta:
        unique_together = (("organization_id", "string"),)

The id field inherited from Model could serve as primary key and integer value. This would provide a sequence and efficient int -> string lookups out of the box. The unique_together prevents duplicates and adds an index under the hood, enabling efficient (org_id, string) -> int lookups.

You would still be able to do

SELECT nextval('sentry_metricsstringindex_id_seq') FROM generate_series(1, 100);

for pre-allocated values.

Maybe I'm completely missing some constraints though.

If we want to make the value or id (whatever we wanna call it) the primary key then what you've outlined makes sense to me. Maybe the best thing for now is to use an auto increment field.

Comment on lines +23 to +30
def get_next_values(cls, num: int) -> Any:
using = router.db_for_write(cls)
connection = connections[using].cursor()

connection.execute(
"SELECT nextval('sentry_metricskeyindexer_id_seq') from generate_series(1,%s)", [num]
)
return connection.fetchall()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a lot of context on this project, how will we use the values from this sequence?

It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wedamija Sorry for not giving enough context in the PR description, I can go back an update in a bit but yeah:

It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?

Thats basically it. Eventually we want to have postgres be off the critical path, but in order to do that we need to know the ids ahead of time. What I am unsure about is what kind of ranges we are talking, is it 100, 1000, 10000? Since this metrics indexer will be used by metrics names, tag keys, and tag values, it could be a lot of writes for high cardinality tags

Copy link
Member

@wedamija wedamija Oct 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for now. Once we know how many ids we're allocating per second we can decide whether we need to do something more complex here.

I'm not sure if there's a performance hit to calling nextval 10k times, as opposed to doing something like https://www.depesz.com/2008/03/20/getting-multiple-values-from-sequences/. Something we can possibly benchmark in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet it will be desirable to avoid calling nextval 10k times (that would be 10k writes I believe). We may not strictly need a sequence at that point but just a counter.
I think we could discuss the solutions then. At this point I am not sure it is very useful to have this method at this stage. Probably better removing it for now in case somebody decided to start depending on it for some reason.

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some clarification on how the model is built. Otherwise seems good to me.

Comment on lines 52 to 54
constraint=models.UniqueConstraint(
fields=("organization_id", "string"), name="unique_org_string"
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also add the index ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and also the auto increment happens to the id because the default is set to the nextval() for the sequence

sentry=# \d sentry_metricskeyindexer
                                     Table "public.sentry_metricskeyindexer"
     Column      |          Type          |                               Modifiers
-----------------+------------------------+-----------------------------------------------------------------------
 id              | bigint                 | not null default nextval('sentry_metricskeyindexer_id_seq'::regclass)
 organization_id | bigint                 | not null
 string          | character varying(200) | not null
Indexes:
    "sentry_metricskeyindexer_pkey" PRIMARY KEY, btree (id)
    "unique_org_string" UNIQUE CONSTRAINT, btree (organization_id, string)

def _bulk_record(self, org_id: int, unmapped_strings: Set[str]) -> List[Record]:
records = []
for string in unmapped_strings:
obj = MetricsKeyIndexer.objects.create(organization_id=org_id, string=string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to create objects in batch? It should be more efficient for postgres.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't using bulk_create before because it doesn't call save() (which I was using to set the value) but now that I've gone back to using an auto increment field I think we can use that.

@wedamija
Copy link
Member

wedamija commented Oct 1, 2021

@MeredithAnya I'm curious about whether we need to use this sequence at all. What are the requirements for these ids? I'm wondering whether we could use UUIDs here

@MeredithAnya
Copy link
Member Author

@MeredithAnya I'm curious about whether we need to use this sequence at all. What are the requirements for these ids? I'm wondering whether we could use UUIDs here

@wedamija the ids have to be 64 bit integers and according to the maths (not done by me tho) we can't use a hashing system because the probability of collisions is too high for comfort

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants