Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Set redis maxmemory and memory policy for docker compose #9573

Merged
merged 4 commits into from
May 5, 2022

Conversation

tiina303
Copy link
Contributor

@tiina303 tiina303 commented Apr 28, 2022

Problem

We're starting to add a lot more data to Redis for persons on events.

We need to make sure we don't run out of memory. Discussion here

If you turn maxmemory off, Redis will start using virtual memory (i.e. swap), and performance will drop tremendously.

It is also worth noting that setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient since there is no need for an expire configuration for the key to be evicted under memory pressure.

Picked 100M based on https://redis.io/docs/getting-started/faq/#whats-the-redis-memory-footprint

1 Million small Keys -> String Value pairs use ~ 85MB of memory.

Changes

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Set the max memory for all of our docker compose files to 100M & eviction policy to allkeys-lru

How did you test this code?

Before:

docker compose -f docker-compose.dev.yml up redis
redis-cli
127.0.0.1:6379> info memory
...
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
...

After

docker compose -f docker-compose.dev.yml up redis
redis-cli
127.0.0.1:6379> info memory
...
maxmemory:104857600
maxmemory_human:100.00M
maxmemory_policy:allkeys-lru
...

Copy link
Contributor

@yakkomajuri yakkomajuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall but I'd put the limit higher than that.

We cache things like insight results which take a lot more memory.

let's maybe do 200mb

@hazzadous
Copy link
Contributor

This might be fine, but it's worth thinking if we are using Redis for other uses than caching, where we don't need a guarantee that keys will stick around.

When looking at e.g. the snowflake exporter it looked like we were using it for things where we cared about durability, but I could be wrong. If not, it would make sense to have something that is explicitly a cache, and another that has the durability requirement.

Is there anywhere we expect durability @tiina303 @yakkomajuri @guidoiaquinti ?

@guidoiaquinti
Copy link
Contributor

another that has the durability requirement.

🎯 we should expect the data in Redis to be ephemeral. If we want data persistence we should use another datastore.

@tiina303
Copy link
Contributor Author

And we already have postgres, so it's not like it would be hard to use another non-ephemeral data source. That said we'll look into what we're currently putting into Redis and the assumptions around that.

@yakkomajuri
Copy link
Contributor

Yeah Redis data should be ephemeral. In the Snowflake case the usability of the Redis APIs makes up for the fact that we might lose a batch here or there if Redis goes down. In any case the user will have the files in S3 and can manually import them.

It's a tradeoff

@hazzadous
Copy link
Contributor

Yeah Redis data should be ephemeral. In the Snowflake case the usability of the Redis APIs makes up for the fact that we might lose a batch here or there if Redis goes down. In any case the user will have the files in S3 and can manually import them.

It's a tradeoff

I would not trade correct functionality for ease of use and future self or others scratching their heads in #community-support, we should be kind to support hero 🥰

But I don’t have context on how important snowflake is and what the failure cases are, so I will defer to others to make the call.

@tiina303
Copy link
Contributor Author

tiina303 commented May 2, 2022

Looked at the usage:

  1. https://github.com/PostHog/posthog/blob/master/plugin-server/src/utils/db/db.ts#L1223 - export events could get a bit less efficient if expired much faster, but no problem
  2. ⚠️ https://github.com/PostHog/posthog/blob/master/plugin-server/src/main/services/mmdb.ts#L26 - looks like we're relaying on the expiry here to decide if the next worker should try to download or not, and we already store the attempts into postgres, so we should extend the redis lookup to go to postgres if the value isn't in redis anymore and add a timestamp to that table.
  3. ⚠️ is_plugin_server_alive & version relies on a value in Redis that's set every 5 seconds (60sec expiry) https://github.com/PostHog/posthog/blob/master/posthog/utils.py#L539 so if we get into a state, where more recent keys are expired this one could be as well. Though looks like this is only used in preflight and instance status page, so would just lead to misleading information there.
  4. ⚠️ plugin server enabled job queues as far as I can tell are also just used for instance status page. This is set and never expires. Lookup: https://github.com/PostHog/posthog/blob/master/posthog/utils.py#L554 We should move this to be in postgres.
  5. ❓ Celery broker uses redisLPush https://github.com/PostHog/posthog/blob/master/plugin-server/src/utils/celery/broker.ts#L83
  6. ❓ Plugins usage

I couldn't see the other functions being used in code, unless we do some trickery with the way we call these functions, e.g. redisLLen https://github.com/search?q=repo%3Aposthog%2Fposthog+redisLLen&type=code - only the definition atm.

@yakkomajuri
Copy link
Contributor

  1. 👍
  2. Yeah
  3. This is such a bad check 🤦 I guess it does check if a plugin server is alive which is maybe useful. We can move it to be a pub/sub PING
  4. Can ignore this for now, no biggie. Might as well also hardcode Graphile for now given we don't have any other queues available.
  5. We stopped using Celery for ingestion but still use it for public jobs. However, I believe we should never produce to it from the plugin server, although seems we copy pasted some code that allows us to both consume and produce. Should be fine to ignore + maybe we should move public jobs to be Kafka later on.
  6. Search for cache.set at the org level. Most plugins should take this as ephemeral but some (Snowflake) could use a refactor for sure.

@tiina303
Copy link
Contributor Author

tiina303 commented May 3, 2022

  1. As in we should make the change now before turning on the key expiry and writing person/groups info?
  2. What about using the _health endpoint?

Plugins:

  1. ⚠️ https://github.com/PostHog/currency-normalization-plugin/blob/main/index.js#L28-L29 - needs to be minimally updated to make fetchRatesIfNeeded return the rates as otherwise the key could expire in between. Impact if keys expire fast is that we'd potentially query openexchangerates more, but that should be fine.
  2. ⚠️ https://github.com/PostHog/ingestion-alert-plugin/blob/main/index.js#L5 - we might never trigger the resolve if the key expires, should set the active_alert in cache as triggered or not and then if the key isn't there always either trigger or resolve depending on the state. This would potentially lead to more webhook usage if keys expire faster than 1 minute.
  3. ⚠️ https://github.com/PostHog/posthog-pagerduty-plugin/blob/main/index.js#L4 - same as above, for the pagerduty side we should use a dedupe key (see) and we can run into API limits, which could mean the alert would get cleared on follow-up retries, which should be fine.
  4. ⚠️ https://github.com/PostHog/session-tracker-plugin/blob/main/index.ts#L34-L35 - relies on the key not expiring until it's specified expiry minutes later. In https://github.com/PostHog/session-tracker-plugin/blob/main/index.ts#L42 looks like we set it as how long inactivity we tolerate & then look up in https://github.com/PostHog/session-tracker-plugin/blob/main/index.ts#L53
  5. https://github.com/PostHog/github-star-sync-plugin/blob/main/index.js#L38 - probably fine, we'll just continue to query and hit rate limiting more
  6. ⚠️ https://github.com/PostHog/posthog-hello-world-plugin/blob/main/index.js#L16 & https://github.com/PostHog/posthog-plugin-starter-kit/blob/main/index.js & https://github.com/PostHog/posthog-plugin-advanced-kit/blob/main/src/index.ts#L23 - breaks, which is bad as those are our examples
  7. https://github.com/PostHog/google-ad-conversions-plugin/blob/master/index.ts#L34 we'll just fetch the access token more frequently
  8. https://github.com/PostHog/posthog-plugin-geoip/blob/main/index.ts#L85 we'll just end up setting person props more frequently
  9. ⚠️ https://github.com/PostHog/postgres-import-events-plugin/blob/30a63a75cf1fb5f11b0f449520359d7a075b80c5/index.ts#L59 - this would break
  10. https://github.com/PostHog/posthog-plugin-migrator3000/blob/d5e0b094946cff50fa2cd5f5e4cc97537bd10736/index.ts#L101 - seems fine we'll just try to run more frequently
  11. https://github.com/PostHog/github-release-tracking-plugin/blob/a94d37d779d094cd641da4424bf07d227e2a5c23/index.js#L47 - we just run more frequently
  12. ⚠️ https://github.com/PostHog/posthog-redshift-import-plugin/blob/03a6039f49a3c01a1c5e58d51b8004474443d5f9/index.ts#L105-L107 could break, i.e. ingesting twice
  13. ⚠️ https://github.com/PostHog/snowflake-export-plugin/blob/f69c3efdc6fb5adbd85c5f891ca1ac384cdae861/index.ts#L382 - breaks we might miss files

^ should we change all of these now or where can I look up plugins usage to see which ones we care about?

Others found from our code:
20. ✅ https://github.com/PostHog/posthog/blob/master/posthog/tasks/cohorts_in_feature_flag.py#L23 - that's probably not great if it expires too fast and we recompute the cohorts all the time, but the cache is used reasonably.
21. ✅ https://github.com/PostHog/posthog/blob/master/posthog/decorators.py#L55 & https://github.com/PostHog/posthog/blob/master/posthog/tasks/update_cache.py#L72 - we might be querying insights more if keys expire fast, but the cache is used reasonably.
22. ✅ https://github.com/PostHog/posthog-cloud/blob/7f9e3f9b41370f213125f57fe9d7653e4720d801/multi_tenancy/utils.py#L99 - we'll just be querying CH more
23. ⚠️ https://github.com/PostHog/posthog.com/blob/d2f8d0a9c053cd3cf17d05a171e93ef39ba30749/contents/docs/plugins/build/reference.md#cache - we migth want to clarify in our docs about expiry

Plugins usage: https://app.posthog.com/insights/kqU7e7dr only currency normalizer & hello-world from ⚠️ above.

@tiina303
Copy link
Contributor Author

tiina303 commented May 3, 2022

Based on plugins currently used on cloud https://metabase.posthog.net/question/310-plugins-currently-enabled-on-cloud
In conclusion we care about (combined that with insights):

7. ⚠️  currency normalizer
11. ✅  github star sync
12. ⚠️ hello world
13. ✅ google ad conversion sync
14. ✅ geoip
16. ✅ migrator
17. ✅ github release tracker
19. ⚠️  snowflake export

@yakkomajuri
Copy link
Contributor

Ah sorry, didn't write the rest on 2 🤦

So from a practical perspective, we have a few options:

1. Easiest

Start with the initial suggestion of volatile-lru. In this case, we should guarantee that a lot of this stuff doesn't break with the new cache usage. Given we set expiries on persons/groups stuff, we'll make sure that any additional memory we're using can be cleared.

Problem: Our Cloud cluster currently operates on allkeys-lru so we might blow it up by flipping this.

2. Better

Fix the relevant stuff out of what you've mentioned.

I think this is a good exercise for us in infra upkeep. Will pay dividends in the long run to do these refactors. Also a great opportunity for you to work with plugins (I love that you've been reading plugin code to compile this!).


Thus, let's probably go with 2. Here's my opinion for what needs fixing, in some order of priority:

  1. MMDB: a little circuit breaker should do. I'm not super worried about keys expiring too fast. At the end of the day it's a LRU algorithm, and the key is being queried often. But still worth to prevent us from blowing up here.
  2. Snowflake plugin
  3. Currency converter: Could be worth fetchRatesIfNeeded just returning the rates given it has them, but else not worried here. Hell of an edge case if that key is always expiring within milliseconds of being set
  4. Plugins using incr to determine if that worker should be the one to operate. We have cursor utils now that should be used here also to keep the code up to date. Probably worth adding a mechanism for expiring cursors easily too without plugin devs having to do this.

1 and 2 are very important. Everything else is a lot less important. Happy to discuss why on individual cases if you like.

But e.g. Postgres import is not live anywhere + doesn't work, hello world doesn't matter. Redshift import is Alpha, etc.

@yakkomajuri
Copy link
Contributor

As for docs, we can make things more clear, but in general one should expect cache to be ephemeral, although some of us (myself included!) have cut corners and relied a bit extensively on cache

@tiina303
Copy link
Contributor Author

tiina303 commented May 3, 2022

Start with the initial suggestion of volatile-lru. In this case, we should guarantee that a lot of this stuff doesn't break with the new cache usage. Given we set expiries on persons/groups stuff, we'll make sure that any additional memory we're using can be cleared.

Actually this won't help us because most of the usages here do set expiry times, which means they can be expired early, only keys that won't have expiry won't be expired. So I'd rather just fix it to work like a cache should than remove the expiry there and adjust to that assumption. So agreed to go with approach 2

Problem: Our Cloud cluster currently operates on allkeys-lru so we might blow it up by flipping this.

How did we verify this?

Copy link
Contributor

@yakkomajuri yakkomajuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems sane for local + hobby

@tiina303
Copy link
Contributor Author

tiina303 commented May 4, 2022

Looked into mmdb again, we actually use postgres only at the end of the download, so we can't use that for lookup. So there isn't an easy way to delay downloading if they key expired faster than the 120s expiry. But this should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants