Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(sentry-metrics): Add basic caching layer to indexer #29115

Merged
merged 4 commits into from
Oct 8, 2021

Conversation

MeredithAnya
Copy link
Member

@MeredithAnya MeredithAnya commented Oct 6, 2021

context
More information about what the metrics indexer is used for can be found in #28914 (and the PRs linked from that PR).

This PR
This adds a basic caching layer to the indexer. Since we'll want to use the indexer to look up both string -> id and id -> string, we'll cache both of those relationships.

To keep things simpler, this uses the default caching on django models. In the future we will probably have to handle the caching manually

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to use the cache provided by the default model manager at this stage (even though later, if we remove postgres from the ingestion path, we will need something like this).

from typing import Any, Dict, List, Optional, Set
from typing import Any, List, Mapping, MutableMapping, Optional, Set, Union

from django.core.cache import cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are looking for sentry.util.cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same cache though, right?

from django.core.cache import cache

Comment on lines 55 to 69
def bulk_record(self, strings: List[str]) -> Mapping[str, int]:
# first and foremoset check the cache
keys = [self._build_indexer_cache_key(string) for string in strings]
cache_results: Mapping[str, int] = cache.get_many(keys)

uncached = set(strings).difference(cache_results.keys())

if not uncached:
return cache_results

# next look up any values that have been created, but aren't in the cache
records = MetricsKeyIndexer.objects.filter(string__in=uncached)
result: MutableMapping[str, int] = {**cache_results}

to_cache: MutableMapping[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is simpler at this stage.
It is true that, if/when we completely remove postgres from the critical path we will need to deal with the cache manually, but right now that you are using the standard django Model directly on the critical path, you can rely on the cache that is already provided by the model itself.

https://github.com/getsentry/sentry/blob/master/src/sentry/db/models/manager/base.py#L296

Specifically get_many_from_cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would do something like this in the MetricsKeyIndexer model?

objects = BaseManager(cache_fields=("pk", "string"), cache_ttl=something_long)

Copy link
Member Author

@MeredithAnya MeredithAnya Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpacifici so because we are using bulk_create, the post_save signal doesn't get called on those inserts. Additionally, bulk_create doesn't return the pk for auto-incrementing fields if ignore_conflicts=True, so even if I tried to send the post_save signal, the cache would store the pk values as None, because the objects return by bulk_create don't have the pk set on the model.

I don't know the best way around this because it seems like if you use bulk_create in the way we are, the cache will be populated when we query (using get_many_from_cache or get_from_cache) and not when we create the values. Is that reasonable though? Maybe it is?

edit: okay actually I think maybe I can just use the objects returned by MetricsKeyIndexer.objects.filter(string__in=unmapped_strings) instead for the post_save calls

another edit: I'm not sure if doing ^ will result in possibly calling post_save twice for records if potentially there was a race condition when calling bulk_create, I don't think that will affect populating the cache or anything, just not sure if it would be a problem for anything else

if to_cache:
# caches both newly created indexer values as well as previously created
# indexer records that weren't in the cache already
cache.set_many(to_cache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will want to manually set the TTL (likely it will be a long one) we should have the TTL as a setting.

@@ -12,6 +14,8 @@ class MetricsKeyIndexer(Model): # type: ignore
string = models.CharField(max_length=200)
date_added = models.DateTimeField(default=timezone.now)

objects = BaseManager(cache_fields=("pk", "string"), cache_ttl=settings.SENTRY_METRICS_INDEXER_CACHE_TTL) # type: ignore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how to type this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue actually? I don't see it. Is it because it doesn't know that settings.SENTRY_METRICS_INDEXER_CACHE_TTL is an int?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it wanted the type for objects

src/sentry/sentry_metrics/indexer/models.py:17: error: Need type annotation for 'objects'

I mean I could have typed it as Any i guess but I dk if that's any better

@MeredithAnya MeredithAnya requested a review from a team October 7, 2021 19:16
@@ -1377,6 +1377,7 @@ def create_partitioned_queues(name):
# Metrics product
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer"
SENTRY_METRICS_INDEXER_OPTIONS = {}
SENTRY_METRICS_INDEXER_CACHE_TTL = 3600 * 24
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arbitrary number, don't really know what we want this to be

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to all the indexer backends or just the Postgres one? If just Postgres then should it be in SENTRY_METRICS_INDEXER_OPTIONS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm well the mock indexer doesn't use a cache, and I think I can probs get rid of the redis indexer now since that was just temporary, so I guess I could put it in the options if we wanted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with a shorter TTL. 2 hours will be enough. Just in case we start creating tons of keys.

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! See comment on populating the cache.

# possible concerns:
# could this be sending `post_save` for an obj that is already created ?
# and if so, does that matter ?
post_save.send(sender=type(indexer_obj), instance=indexer_obj, created=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is even necessary to populate the cache here:

  • bulk_record should run as fast as possible, and populating the cache here requires an extra db query.
  • There is already a call to get_many_from_cache at the beginning of bulk_record which populates the cache for existing entries, so the chance that an entry created here will be cached in the next call to bulk_record is pretty high.
  • At the latest, the entry will be cached by the first call to resolve, which might be good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bulk_record should run as fast as possible, and populating the cache here requires an extra db query.

well I think it's either an extra query here after its created, OR it's an extra query when we call get_many_from_cache on the next batch, assuming we query for the same metric name/tag that was just created (but not saved in the cache). right?

also I think get_many_from_cache does kind of the same thing that im doing in that it does a filter and then calls post_save (it's own __post_save on each instance)

# XXX: Should use set_many here, but __post_save code is too complex
for instance in cache_writes:
self.__post_save(instance=instance)

so i mean i'm not sure for now if it makes a huge difference, especially since things will change when we need to pre-allocate the ids. I can take out the calls to post_save here and just let things be cache when they come in again or like you said, at the very latest when we call resolve cc @jjbayer (@fpacifici any thoughts?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add a post save here. The amount of queries in the end is comparable so that is likely to be an irrelevant difference, though from the code clarity, here we are calling into something that is internals of the model manager. There is no reason for this coupling here.

Also if you replace the filter with get_many_from_cache at line 26 the problem does not exist. as the post save will be called there.

@@ -50,39 +23,33 @@ def _bulk_record(self, unmapped_strings: Set[str]) -> Any:
# Using `ignore_conflicts=True` prevents the pk from being set on the model
# instances. Re-query the database to fetch the rows, they should all exist at this
# point.
return MetricsKeyIndexer.objects.filter(string__in=unmapped_strings)
indexer_objs = MetricsKeyIndexer.objects.filter(string__in=unmapped_strings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we called get_many_from_cache here? This would automatically populate the cache, making the individual calls to post_save unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace filter with get_many_from_cache, then it is good to me.

@@ -1377,6 +1377,7 @@ def create_partitioned_queues(name):
# Metrics product
SENTRY_METRICS_INDEXER = "sentry.sentry_metrics.indexer.mock.MockIndexer"
SENTRY_METRICS_INDEXER_OPTIONS = {}
SENTRY_METRICS_INDEXER_CACHE_TTL = 3600 * 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with a shorter TTL. 2 hours will be enough. Just in case we start creating tons of keys.

@@ -50,39 +23,33 @@ def _bulk_record(self, unmapped_strings: Set[str]) -> Any:
# Using `ignore_conflicts=True` prevents the pk from being set on the model
# instances. Re-query the database to fetch the rows, they should all exist at this
# point.
return MetricsKeyIndexer.objects.filter(string__in=unmapped_strings)
indexer_objs = MetricsKeyIndexer.objects.filter(string__in=unmapped_strings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

# possible concerns:
# could this be sending `post_save` for an obj that is already created ?
# and if so, does that matter ?
post_save.send(sender=type(indexer_obj), instance=indexer_obj, created=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add a post save here. The amount of queries in the end is comparable so that is likely to be an irrelevant difference, though from the code clarity, here we are calling into something that is internals of the model manager. There is no reason for this coupling here.

Also if you replace the filter with get_many_from_cache at line 26 the problem does not exist. as the post save will be called there.

@MeredithAnya MeredithAnya merged commit 7decd33 into master Oct 8, 2021
@MeredithAnya MeredithAnya deleted the sentry-metrics/SNS-399 branch October 8, 2021 22:04
@github-actions github-actions bot locked and limited conversation to collaborators Oct 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants