feat: Set redis maxmemory and memory policy for docker compose #9573

tiina303 · 2022-04-28T14:34:59Z

Problem

We're starting to add a lot more data to Redis for persons on events.

We need to make sure we don't run out of memory. Discussion here

If you turn maxmemory off, Redis will start using virtual memory (i.e. swap), and performance will drop tremendously.

It is also worth noting that setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient since there is no need for an expire configuration for the key to be evicted under memory pressure.

Picked 100M based on https://redis.io/docs/getting-started/faq/#whats-the-redis-memory-footprint

1 Million small Keys -> String Value pairs use ~ 85MB of memory.

Changes

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Set the max memory for all of our docker compose files to 100M & eviction policy to allkeys-lru

How did you test this code?

Before:

docker compose -f docker-compose.dev.yml up redis
redis-cli
127.0.0.1:6379> info memory
...
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
...

After

docker compose -f docker-compose.dev.yml up redis
redis-cli
127.0.0.1:6379> info memory
...
maxmemory:104857600
maxmemory_human:100.00M
maxmemory_policy:allkeys-lru
...

yakkomajuri

looks good overall but I'd put the limit higher than that.

We cache things like insight results which take a lot more memory.

let's maybe do 200mb

hazzadous · 2022-04-28T14:53:36Z

This might be fine, but it's worth thinking if we are using Redis for other uses than caching, where we don't need a guarantee that keys will stick around.

When looking at e.g. the snowflake exporter it looked like we were using it for things where we cared about durability, but I could be wrong. If not, it would make sense to have something that is explicitly a cache, and another that has the durability requirement.

Is there anywhere we expect durability @tiina303 @yakkomajuri @guidoiaquinti ?

guidoiaquinti · 2022-04-28T14:56:34Z

another that has the durability requirement.

🎯 we should expect the data in Redis to be ephemeral. If we want data persistence we should use another datastore.

tiina303 · 2022-04-28T14:59:03Z

And we already have postgres, so it's not like it would be hard to use another non-ephemeral data source. That said we'll look into what we're currently putting into Redis and the assumptions around that.

yakkomajuri · 2022-04-28T16:44:31Z

Yeah Redis data should be ephemeral. In the Snowflake case the usability of the Redis APIs makes up for the fact that we might lose a batch here or there if Redis goes down. In any case the user will have the files in S3 and can manually import them.

It's a tradeoff

hazzadous · 2022-04-29T10:09:32Z

Yeah Redis data should be ephemeral. In the Snowflake case the usability of the Redis APIs makes up for the fact that we might lose a batch here or there if Redis goes down. In any case the user will have the files in S3 and can manually import them.

It's a tradeoff

I would not trade correct functionality for ease of use and future self or others scratching their heads in #community-support, we should be kind to support hero 🥰

But I don’t have context on how important snowflake is and what the failure cases are, so I will defer to others to make the call.

tiina303 · 2022-05-02T14:12:56Z

Looked at the usage:

✅ https://github.com/PostHog/posthog/blob/master/plugin-server/src/utils/db/db.ts#L1223 - export events could get a bit less efficient if expired much faster, but no problem
⚠️ https://github.com/PostHog/posthog/blob/master/plugin-server/src/main/services/mmdb.ts#L26 - looks like we're relaying on the expiry here to decide if the next worker should try to download or not, and we already store the attempts into postgres, so we should extend the redis lookup to go to postgres if the value isn't in redis anymore and add a timestamp to that table.
⚠️ is_plugin_server_alive & version relies on a value in Redis that's set every 5 seconds (60sec expiry) https://github.com/PostHog/posthog/blob/master/posthog/utils.py#L539 so if we get into a state, where more recent keys are expired this one could be as well. Though looks like this is only used in preflight and instance status page, so would just lead to misleading information there.
⚠️ plugin server enabled job queues as far as I can tell are also just used for instance status page. This is set and never expires. Lookup: https://github.com/PostHog/posthog/blob/master/posthog/utils.py#L554 We should move this to be in postgres.
❓ Celery broker uses redisLPush https://github.com/PostHog/posthog/blob/master/plugin-server/src/utils/celery/broker.ts#L83
❓ Plugins usage

I couldn't see the other functions being used in code, unless we do some trickery with the way we call these functions, e.g. redisLLen https://github.com/search?q=repo%3Aposthog%2Fposthog+redisLLen&type=code - only the definition atm.

yakkomajuri · 2022-05-02T16:52:24Z

👍
Yeah
This is such a bad check 🤦 I guess it does check if a plugin server is alive which is maybe useful. We can move it to be a pub/sub PING
Can ignore this for now, no biggie. Might as well also hardcode Graphile for now given we don't have any other queues available.
We stopped using Celery for ingestion but still use it for public jobs. However, I believe we should never produce to it from the plugin server, although seems we copy pasted some code that allows us to both consume and produce. Should be fine to ignore + maybe we should move public jobs to be Kafka later on.
Search for cache.set at the org level. Most plugins should take this as ephemeral but some (Snowflake) could use a refactor for sure.

tiina303 · 2022-05-03T10:09:44Z

As in we should make the change now before turning on the key expiry and writing person/groups info?
What about using the _health endpoint?

Plugins:

⚠️ https://github.com/PostHog/currency-normalization-plugin/blob/main/index.js#L28-L29 - needs to be minimally updated to make fetchRatesIfNeeded return the rates as otherwise the key could expire in between. Impact if keys expire fast is that we'd potentially query openexchangerates more, but that should be fine.
⚠️ https://github.com/PostHog/ingestion-alert-plugin/blob/main/index.js#L5 - we might never trigger the resolve if the key expires, should set the active_alert in cache as triggered or not and then if the key isn't there always either trigger or resolve depending on the state. This would potentially lead to more webhook usage if keys expire faster than 1 minute.
⚠️ https://github.com/PostHog/posthog-pagerduty-plugin/blob/main/index.js#L4 - same as above, for the pagerduty side we should use a dedupe key (see) and we can run into API limits, which could mean the alert would get cleared on follow-up retries, which should be fine.
⚠️ https://github.com/PostHog/session-tracker-plugin/blob/main/index.ts#L34-L35 - relies on the key not expiring until it's specified expiry minutes later. In https://github.com/PostHog/session-tracker-plugin/blob/main/index.ts#L42 looks like we set it as how long inactivity we tolerate & then look up in https://github.com/PostHog/session-tracker-plugin/blob/main/index.ts#L53
✅ https://github.com/PostHog/github-star-sync-plugin/blob/main/index.js#L38 - probably fine, we'll just continue to query and hit rate limiting more
⚠️ https://github.com/PostHog/posthog-hello-world-plugin/blob/main/index.js#L16 & https://github.com/PostHog/posthog-plugin-starter-kit/blob/main/index.js & https://github.com/PostHog/posthog-plugin-advanced-kit/blob/main/src/index.ts#L23 - breaks, which is bad as those are our examples
✅ https://github.com/PostHog/google-ad-conversions-plugin/blob/master/index.ts#L34 we'll just fetch the access token more frequently
✅ https://github.com/PostHog/posthog-plugin-geoip/blob/main/index.ts#L85 we'll just end up setting person props more frequently
⚠️ https://github.com/PostHog/postgres-import-events-plugin/blob/30a63a75cf1fb5f11b0f449520359d7a075b80c5/index.ts#L59 - this would break
✅ https://github.com/PostHog/posthog-plugin-migrator3000/blob/d5e0b094946cff50fa2cd5f5e4cc97537bd10736/index.ts#L101 - seems fine we'll just try to run more frequently
✅ https://github.com/PostHog/github-release-tracking-plugin/blob/a94d37d779d094cd641da4424bf07d227e2a5c23/index.js#L47 - we just run more frequently
⚠️ https://github.com/PostHog/posthog-redshift-import-plugin/blob/03a6039f49a3c01a1c5e58d51b8004474443d5f9/index.ts#L105-L107 could break, i.e. ingesting twice
⚠️ https://github.com/PostHog/snowflake-export-plugin/blob/f69c3efdc6fb5adbd85c5f891ca1ac384cdae861/index.ts#L382 - breaks we might miss files

^ should we change all of these now or where can I look up plugins usage to see which ones we care about?

Others found from our code:
20. ✅ https://github.com/PostHog/posthog/blob/master/posthog/tasks/cohorts_in_feature_flag.py#L23 - that's probably not great if it expires too fast and we recompute the cohorts all the time, but the cache is used reasonably.
21. ✅ https://github.com/PostHog/posthog/blob/master/posthog/decorators.py#L55 & https://github.com/PostHog/posthog/blob/master/posthog/tasks/update_cache.py#L72 - we might be querying insights more if keys expire fast, but the cache is used reasonably.
22. ✅ https://github.com/PostHog/posthog-cloud/blob/7f9e3f9b41370f213125f57fe9d7653e4720d801/multi_tenancy/utils.py#L99 - we'll just be querying CH more
23. ⚠️ https://github.com/PostHog/posthog.com/blob/d2f8d0a9c053cd3cf17d05a171e93ef39ba30749/contents/docs/plugins/build/reference.md#cache - we migth want to clarify in our docs about expiry

Plugins usage: https://app.posthog.com/insights/kqU7e7dr only currency normalizer & hello-world from ⚠️ above.

tiina303 · 2022-05-03T11:45:38Z

Based on plugins currently used on cloud https://metabase.posthog.net/question/310-plugins-currently-enabled-on-cloud
In conclusion we care about (combined that with insights):

7. ⚠️  currency normalizer
11. ✅  github star sync
12. ⚠️ hello world
13. ✅ google ad conversion sync
14. ✅ geoip
16. ✅ migrator
17. ✅ github release tracker
19. ⚠️  snowflake export

yakkomajuri · 2022-05-03T11:51:44Z

Ah sorry, didn't write the rest on 2 🤦

So from a practical perspective, we have a few options:

1. Easiest

Start with the initial suggestion of volatile-lru. In this case, we should guarantee that a lot of this stuff doesn't break with the new cache usage. Given we set expiries on persons/groups stuff, we'll make sure that any additional memory we're using can be cleared.

Problem: Our Cloud cluster currently operates on allkeys-lru so we might blow it up by flipping this.

2. Better

Fix the relevant stuff out of what you've mentioned.

I think this is a good exercise for us in infra upkeep. Will pay dividends in the long run to do these refactors. Also a great opportunity for you to work with plugins (I love that you've been reading plugin code to compile this!).

Thus, let's probably go with 2. Here's my opinion for what needs fixing, in some order of priority:

MMDB: a little circuit breaker should do. I'm not super worried about keys expiring too fast. At the end of the day it's a LRU algorithm, and the key is being queried often. But still worth to prevent us from blowing up here.
Snowflake plugin
Currency converter: Could be worth fetchRatesIfNeeded just returning the rates given it has them, but else not worried here. Hell of an edge case if that key is always expiring within milliseconds of being set
Plugins using incr to determine if that worker should be the one to operate. We have cursor utils now that should be used here also to keep the code up to date. Probably worth adding a mechanism for expiring cursors easily too without plugin devs having to do this.

1 and 2 are very important. Everything else is a lot less important. Happy to discuss why on individual cases if you like.

But e.g. Postgres import is not live anywhere + doesn't work, hello world doesn't matter. Redshift import is Alpha, etc.

yakkomajuri · 2022-05-03T11:54:22Z

As for docs, we can make things more clear, but in general one should expect cache to be ephemeral, although some of us (myself included!) have cut corners and relied a bit extensively on cache

tiina303 · 2022-05-03T11:56:26Z

Start with the initial suggestion of volatile-lru. In this case, we should guarantee that a lot of this stuff doesn't break with the new cache usage. Given we set expiries on persons/groups stuff, we'll make sure that any additional memory we're using can be cleared.

Actually this won't help us because most of the usages here do set expiry times, which means they can be expired early, only keys that won't have expiry won't be expired. So I'd rather just fix it to work like a cache should than remove the expiry there and adjust to that assumption. So agreed to go with approach 2

Problem: Our Cloud cluster currently operates on allkeys-lru so we might blow it up by flipping this.

How did we verify this?

yakkomajuri

this seems sane for local + hobby

tiina303 · 2022-05-04T13:13:05Z

Looked into mmdb again, we actually use postgres only at the end of the download, so we can't use that for lookup. So there isn't an easy way to delay downloading if they key expired faster than the 120s expiry. But this should be fine.

tiina303 added 2 commits April 28, 2022 16:34

feat: Set redis maxmemory and memory policy for docker compose

a2ae914

switch to allkeys-lru

037b4dc

tiina303 requested review from yakkomajuri, fuziontech, guidoiaquinti and hazzadous April 28, 2022 14:42

yakkomajuri suggested changes Apr 28, 2022

View reviewed changes

change to 200mb

bc401e5

tiina303 mentioned this pull request May 3, 2022

feat: Add redis memory policy PostHog/charts-clickhouse#363

Merged

6 tasks

yakkomajuri approved these changes May 3, 2022

View reviewed changes

Merge branch 'master' into redis-docker

6fc7281

tiina303 merged commit 35a5dd6 into master May 5, 2022

tiina303 deleted the redis-docker branch May 5, 2022 12:17

tiina303 mentioned this pull request May 5, 2022

Cache efficiency for cloud and self-hosted #9655

Closed

alexkim205 pushed a commit that referenced this pull request May 23, 2022

feat: Set redis maxmemory and memory policy for docker compose (#9573)

ea619ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Set redis maxmemory and memory policy for docker compose #9573

feat: Set redis maxmemory and memory policy for docker compose #9573

tiina303 commented Apr 28, 2022 •

edited

Loading

yakkomajuri left a comment

hazzadous commented Apr 28, 2022

guidoiaquinti commented Apr 28, 2022

tiina303 commented Apr 28, 2022

yakkomajuri commented Apr 28, 2022

hazzadous commented Apr 29, 2022

tiina303 commented May 2, 2022 •

edited

Loading

yakkomajuri commented May 2, 2022

tiina303 commented May 3, 2022 •

edited

Loading

tiina303 commented May 3, 2022 •

edited

Loading

yakkomajuri commented May 3, 2022

yakkomajuri commented May 3, 2022

tiina303 commented May 3, 2022 •

edited

Loading

yakkomajuri left a comment

tiina303 commented May 4, 2022

feat: Set redis maxmemory and memory policy for docker compose #9573

feat: Set redis maxmemory and memory policy for docker compose #9573

Conversation

tiina303 commented Apr 28, 2022 • edited Loading

Problem

Changes

How did you test this code?

yakkomajuri left a comment

Choose a reason for hiding this comment

hazzadous commented Apr 28, 2022

guidoiaquinti commented Apr 28, 2022

tiina303 commented Apr 28, 2022

yakkomajuri commented Apr 28, 2022

hazzadous commented Apr 29, 2022

tiina303 commented May 2, 2022 • edited Loading

yakkomajuri commented May 2, 2022

tiina303 commented May 3, 2022 • edited Loading

tiina303 commented May 3, 2022 • edited Loading

yakkomajuri commented May 3, 2022

1. Easiest

2. Better

yakkomajuri commented May 3, 2022

tiina303 commented May 3, 2022 • edited Loading

yakkomajuri left a comment

Choose a reason for hiding this comment

tiina303 commented May 4, 2022

tiina303 commented Apr 28, 2022 •

edited

Loading

tiina303 commented May 2, 2022 •

edited

Loading

tiina303 commented May 3, 2022 •

edited

Loading

tiina303 commented May 3, 2022 •

edited

Loading

tiina303 commented May 3, 2022 •

edited

Loading