-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(sentry-metrics): Add MetricsKeyIndexer table #28914
Conversation
This PR has a migration; here is the generated SQL BEGIN;
--
-- Create model MetricsKeyIndexer
--
CREATE TABLE "sentry_metricskeyindexer" ("id" bigserial NOT NULL PRIMARY KEY, "string" varchar(200) NOT NULL, "date_added" timestamp with time zone NOT NULL);
--
-- Create constraint unique_string on model metricskeyindexer
--
ALTER TABLE "sentry_metricskeyindexer" ADD CONSTRAINT "unique_string" UNIQUE ("string");
COMMIT; |
src/sentry/conf/server.py
Outdated
@@ -1357,7 +1358,7 @@ def create_partitioned_queues(name): | |||
SENTRY_METRICS_SKIP_INTERNAL_PREFIXES = [] # Order this by most frequent prefixes. | |||
|
|||
# Metrics product | |||
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer" | |||
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.postgres.PGStringIndexer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like the naming, I just wanted to make it different that the redis indexer, and I wasn't sure if we were ready to use this as the actual string indexer yet
@@ -37,7 +37,9 @@ def _check_db_routing(migration): | |||
def _check_operations(operations): | |||
failed_ops = [] | |||
for operation in operations: | |||
if isinstance(operation, (FieldOperation, ModelOperation, RenameContentType)): | |||
if isinstance( | |||
operation, (FieldOperation, ModelOperation, RenameContentType, IndexOperation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add the IndexOperation
here because otherwise I got the missing "hints={'tables':..}
argument" for AddIndex
. It seemed to me that since the AddIndex
operation is model specific that I could put this here cc @wedamija
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me
] | ||
|
||
@classmethod | ||
def get_next_values(cls, num: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used right now but was my thought process for later, to test out get n number of next values. This does load all the integers which may not be ideal, so the other option would be to have the increment be larger and we just get the range from the last val to the next val e.g currval is 100 and next is 200, when we have 101 -200 or whatever
56d0416
to
3a316c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way more advanced db design than anything I've ever done, but I wonder if we could meet the same requirements with the following model:
class MetricsStringIndex(sentry.db.models.Model):
organization_id = BoundedBigIntegerField()
string = models.CharField(max_length=200)
class Meta:
unique_together = (("organization_id", "string"),)
The id
field inherited from Model
could serve as primary key and integer value. This would provide a sequence and efficient int -> string
lookups out of the box. The unique_together
prevents duplicates and adds an index under the hood, enabling efficient (org_id, string) -> int
lookups.
You would still be able to do
SELECT nextval('sentry_metricsstringindex_id_seq') FROM generate_series(1, 100);
for pre-allocated values.
Maybe I'm completely missing some constraints though.
If we want to make the |
def get_next_values(cls, num: int) -> Any: | ||
using = router.db_for_write(cls) | ||
connection = connections[using].cursor() | ||
|
||
connection.execute( | ||
"SELECT nextval('sentry_metricskeyindexer_id_seq') from generate_series(1,%s)", [num] | ||
) | ||
return connection.fetchall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a lot of context on this project, how will we use the values from this sequence?
It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wedamija Sorry for not giving enough context in the PR description, I can go back an update in a bit but yeah:
It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?
Thats basically it. Eventually we want to have postgres be off the critical path, but in order to do that we need to know the ids ahead of time. What I am unsure about is what kind of ranges we are talking, is it 100, 1000, 10000? Since this metrics indexer will be used by metrics names, tag keys, and tag values, it could be a lot of writes for high cardinality tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for now. Once we know how many ids we're allocating per second we can decide whether we need to do something more complex here.
I'm not sure if there's a performance hit to calling nextval
10k times, as opposed to doing something like https://www.depesz.com/2008/03/20/getting-multiple-values-from-sequences/. Something we can possibly benchmark in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet it will be desirable to avoid calling nextval 10k times (that would be 10k writes I believe). We may not strictly need a sequence at that point but just a counter.
I think we could discuss the solutions then. At this point I am not sure it is very useful to have this method at this stage. Probably better removing it for now in case somebody decided to start depending on it for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly some clarification on how the model is built. Otherwise seems good to me.
constraint=models.UniqueConstraint( | ||
fields=("organization_id", "string"), name="unique_org_string" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also add the index ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, and also the auto increment happens to the id
because the default is set to the nextval()
for the sequence
sentry=# \d sentry_metricskeyindexer
Table "public.sentry_metricskeyindexer"
Column | Type | Modifiers
-----------------+------------------------+-----------------------------------------------------------------------
id | bigint | not null default nextval('sentry_metricskeyindexer_id_seq'::regclass)
organization_id | bigint | not null
string | character varying(200) | not null
Indexes:
"sentry_metricskeyindexer_pkey" PRIMARY KEY, btree (id)
"unique_org_string" UNIQUE CONSTRAINT, btree (organization_id, string)
def _bulk_record(self, org_id: int, unmapped_strings: Set[str]) -> List[Record]: | ||
records = [] | ||
for string in unmapped_strings: | ||
obj = MetricsKeyIndexer.objects.create(organization_id=org_id, string=string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way to create objects in batch? It should be more efficient for postgres.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't using bulk_create
before because it doesn't call save()
(which I was using to set the value) but now that I've gone back to using an auto increment field I think we can use that.
@MeredithAnya I'm curious about whether we need to use this sequence at all. What are the requirements for these ids? I'm wondering whether we could use UUIDs here |
@wedamija the ids have to be 64 bit integers and according to the maths (not done by me tho) we can't use a hashing system because the probability of collisions is too high for comfort |
0617619
to
7b552e8
Compare
context
This PR added the metrics consumer (take messages from relay, translate it, and pass it along to snuba) but just used redis as a dummy indexer to store the string -> int (and vice versa) translations. The next phase is to add persistence, which will be Postgres.
MetricsKeyIndexer
This table will ultimately be on its own postgres instance. My understanding is that for the sake of development I can run this migration as normal. If that's not the case, well.. I'll need to figure out what I should be doing.
Schema
From my understanding, the basic columns for this table are:
BTree index on
string
We are going to want to be able to look up the
string
from theid
as well as theid
from thestring
, so it needs to be indexed both waysSequence vs UUID
We want the ids to be 64bit integers, (uuids would be twice as long at 128bit) so that ClickHouse is more efficient whenn querying. Additionally, the UUID type does not work with the bloom filter index1. Ultimately we are going to want to pre-allocate multiple values at a time in redis so that we don't have to hit postgres every time we record a new metric (or tag key or value). For these reasons we can use a sequence (for now, could change later).
[EDITED]: Keeping below as historical context, but is outdated
It seems to me there are three main ways to use a sequence:
CREATE SEQUENCE sequence_name
CREATE SEQUENCE sequence_name OWNED BY table.column_name
id
fields do withBoundedBigAutoField
)5
, INSERT value =10
, and then go back to using the increment, you'd be at6
(and then at10
youd get anIntegrityError
)INCREMENT
value from1
so if we wanted it to use anINCREMENT
of100
, I dk how'd we do that. (maybe we'd just have toALTER SEQUENCE
)For reason listed above I
wentstarted with option 2, but there are some caveats, issues:value
when you save the record. Since this is still in dev (and we aren't pre-allocating values yet), so I've kind of implemented that myself in thesave()
method for now.MetricsIndexerTable
is in there. I haven't figured out why, but in the mean time I have to create the sequence insetUp
(and then drop intearDown
) for the tests. I don't have this problem just locally accessing postgres so idk.Footnotes
needs to be verified, but heard from a reputable source: @fpacifici ↩