-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(sentry-metrics): Add basic caching layer to indexer #29115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to use the cache provided by the default model manager at this stage (even though later, if we remove postgres from the ingestion path, we will need something like this).
from typing import Any, Dict, List, Optional, Set | ||
from typing import Any, List, Mapping, MutableMapping, Optional, Set, Union | ||
|
||
from django.core.cache import cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are looking for sentry.util.cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the same cache though, right?
sentry/src/sentry/utils/cache.py
Line 3 in 237406b
from django.core.cache import cache |
def bulk_record(self, strings: List[str]) -> Mapping[str, int]: | ||
# first and foremoset check the cache | ||
keys = [self._build_indexer_cache_key(string) for string in strings] | ||
cache_results: Mapping[str, int] = cache.get_many(keys) | ||
|
||
uncached = set(strings).difference(cache_results.keys()) | ||
|
||
if not uncached: | ||
return cache_results | ||
|
||
# next look up any values that have been created, but aren't in the cache | ||
records = MetricsKeyIndexer.objects.filter(string__in=uncached) | ||
result: MutableMapping[str, int] = {**cache_results} | ||
|
||
to_cache: MutableMapping[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is simpler at this stage.
It is true that, if/when we completely remove postgres from the critical path we will need to deal with the cache manually, but right now that you are using the standard django Model directly on the critical path, you can rely on the cache that is already provided by the model itself.
https://github.com/getsentry/sentry/blob/master/src/sentry/db/models/manager/base.py#L296
Specifically get_many_from_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we would do something like this in the MetricsKeyIndexer
model?
objects = BaseManager(cache_fields=("pk", "string"), cache_ttl=something_long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpacifici so because we are using bulk_create
, the post_save
signal doesn't get called on those inserts. Additionally, bulk_create
doesn't return the pk
for auto-incrementing fields if ignore_conflicts=True
, so even if I tried to send the post_save
signal, the cache would store the pk
values as None
, because the objects return by bulk_create
don't have the pk
set on the model.
I don't know the best way around this because it seems like if you use bulk_create
in the way we are, the cache will be populated when we query (using get_many_from_cache
or get_from_cache
) and not when we create the values. Is that reasonable though? Maybe it is?
edit: okay actually I think maybe I can just use the objects returned by MetricsKeyIndexer.objects.filter(string__in=unmapped_strings)
instead for the post_save
calls
another edit: I'm not sure if doing ^ will result in possibly calling post_save
twice for records if potentially there was a race condition when calling bulk_create
, I don't think that will affect populating the cache or anything, just not sure if it would be a problem for anything else
if to_cache: | ||
# caches both newly created indexer values as well as previously created | ||
# indexer records that weren't in the cache already | ||
cache.set_many(to_cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will want to manually set the TTL (likely it will be a long one) we should have the TTL as a setting.
@@ -12,6 +14,8 @@ class MetricsKeyIndexer(Model): # type: ignore | |||
string = models.CharField(max_length=200) | |||
date_added = models.DateTimeField(default=timezone.now) | |||
|
|||
objects = BaseManager(cache_fields=("pk", "string"), cache_ttl=settings.SENTRY_METRICS_INDEXER_CACHE_TTL) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how to type this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the issue actually? I don't see it. Is it because it doesn't know that settings.SENTRY_METRICS_INDEXER_CACHE_TTL is an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it wanted the type for objects
src/sentry/sentry_metrics/indexer/models.py:17: error: Need type annotation for 'objects'
I mean I could have typed it as Any
i guess but I dk if that's any better
src/sentry/conf/server.py
Outdated
@@ -1377,6 +1377,7 @@ def create_partitioned_queues(name): | |||
# Metrics product | |||
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer" | |||
SENTRY_METRICS_INDEXER_OPTIONS = {} | |||
SENTRY_METRICS_INDEXER_CACHE_TTL = 3600 * 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrary number, don't really know what we want this to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this apply to all the indexer backends or just the Postgres one? If just Postgres then should it be in SENTRY_METRICS_INDEXER_OPTIONS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm well the mock indexer doesn't use a cache, and I think I can probs get rid of the redis indexer now since that was just temporary, so I guess I could put it in the options if we wanted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with a shorter TTL. 2 hours will be enough. Just in case we start creating tons of keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! See comment on populating the cache.
# possible concerns: | ||
# could this be sending `post_save` for an obj that is already created ? | ||
# and if so, does that matter ? | ||
post_save.send(sender=type(indexer_obj), instance=indexer_obj, created=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is even necessary to populate the cache here:
bulk_record
should run as fast as possible, and populating the cache here requires an extra db query.- There is already a call to
get_many_from_cache
at the beginning ofbulk_record
which populates the cache for existing entries, so the chance that an entry created here will be cached in the next call tobulk_record
is pretty high. - At the latest, the entry will be cached by the first call to
resolve
, which might be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bulk_record
should run as fast as possible, and populating the cache here requires an extra db query.
well I think it's either an extra query here after its created, OR it's an extra query when we call get_many_from_cache
on the next batch, assuming we query for the same metric name/tag that was just created (but not saved in the cache). right?
also I think get_many_from_cache
does kind of the same thing that im doing in that it does a filter and then calls post_save (it's own __post_save on each instance)
sentry/src/sentry/db/models/manager/base.py
Lines 407 to 409 in 03fe65c
# XXX: Should use set_many here, but __post_save code is too complex | |
for instance in cache_writes: | |
self.__post_save(instance=instance) |
so i mean i'm not sure for now if it makes a huge difference, especially since things will change when we need to pre-allocate the ids. I can take out the calls to post_save here and just let things be cache when they come in again or like you said, at the very latest when we call resolve
cc @jjbayer (@fpacifici any thoughts?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not add a post save here. The amount of queries in the end is comparable so that is likely to be an irrelevant difference, though from the code clarity, here we are calling into something that is internals of the model manager. There is no reason for this coupling here.
Also if you replace the filter with get_many_from_cache at line 26 the problem does not exist. as the post save will be called there.
@@ -50,39 +23,33 @@ def _bulk_record(self, unmapped_strings: Set[str]) -> Any: | |||
# Using `ignore_conflicts=True` prevents the pk from being set on the model | |||
# instances. Re-query the database to fetch the rows, they should all exist at this | |||
# point. | |||
return MetricsKeyIndexer.objects.filter(string__in=unmapped_strings) | |||
indexer_objs = MetricsKeyIndexer.objects.filter(string__in=unmapped_strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we called get_many_from_cache
here? This would automatically populate the cache, making the individual calls to post_save
unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace filter with get_many_from_cache, then it is good to me.
src/sentry/conf/server.py
Outdated
@@ -1377,6 +1377,7 @@ def create_partitioned_queues(name): | |||
# Metrics product | |||
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer" | |||
SENTRY_METRICS_INDEXER_OPTIONS = {} | |||
SENTRY_METRICS_INDEXER_CACHE_TTL = 3600 * 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with a shorter TTL. 2 hours will be enough. Just in case we start creating tons of keys.
@@ -50,39 +23,33 @@ def _bulk_record(self, unmapped_strings: Set[str]) -> Any: | |||
# Using `ignore_conflicts=True` prevents the pk from being set on the model | |||
# instances. Re-query the database to fetch the rows, they should all exist at this | |||
# point. | |||
return MetricsKeyIndexer.objects.filter(string__in=unmapped_strings) | |||
indexer_objs = MetricsKeyIndexer.objects.filter(string__in=unmapped_strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
# possible concerns: | ||
# could this be sending `post_save` for an obj that is already created ? | ||
# and if so, does that matter ? | ||
post_save.send(sender=type(indexer_obj), instance=indexer_obj, created=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not add a post save here. The amount of queries in the end is comparable so that is likely to be an irrelevant difference, though from the code clarity, here we are calling into something that is internals of the model manager. There is no reason for this coupling here.
Also if you replace the filter with get_many_from_cache at line 26 the problem does not exist. as the post save will be called there.
context
More information about what the metrics indexer is used for can be found in #28914 (and the PRs linked from that PR).
This PR
This adds a basic caching layer to the indexer. Since we'll want to use the indexer to look up both
string
->id
andid
->string
, we'll cache both of those relationships.To keep things simpler, this uses the default caching on django models. In the future we will probably have to handle the caching manually