Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix array-cache-bypass - in FastArrays on has() #24156

Merged
merged 1 commit into from
Aug 17, 2022

Conversation

eileenmcnaughton
Copy link
Contributor

@eileenmcnaughton eileenmcnaughton commented Aug 4, 2022

Overview

Fix array-cache-bypass in FastArrays on has()

When running the tokenProcessor (notably in processGreetings when creating a contact) it uses the metadata cache. This cache stores .. erm ... metadata using the preferred cache method for the site aided by an in-process cache on the CRM_Utils_Cache_FastArrayDecorator class. If the cache key has already been hit in the current php process it is loaded from this php array cache, otherwise it falls back the the preferred cache method (in our case Redis).

However, when the method has is being called it was by-passing the php array cache and going straight to Redis - which was causing a significant performance hit for us.

Before

Calling Civi::cache('metadata')->get() accesses the array cache & only falls back to (eg.) Redis if no key exists. However, Civi::cache('metadata')->has() always by-passed the array cache & hits (eg) Redis

After

Calling Civi::cache('metadata')->has() also accesses the array cache, where possible.

Technical Details

This appears to just be a coding over-sight as the CRM_Utils_Cache_ArrayDecorator class was not impacted.

Comments

I feel like the performance hit I was seeing on our staging server sing Redis was surprisingly high (higher than using the sqlCache would have been) so I think there is also an investigation at our end to see if it was higher than it should be

@civibot
Copy link

civibot bot commented Aug 4, 2022

(Standard links)

@civibot civibot bot added the master label Aug 4, 2022
@eileenmcnaughton eileenmcnaughton changed the title Fix cache miss in FastArrays on has() Fix array-cache-bypass - in FastArrays on has() Aug 4, 2022
@seamuslee001
Copy link
Contributor

@johntwyman @andrew-cormick-dockery you might be interested in this given you also use redis

@eileenmcnaughton
Copy link
Contributor Author

@johntwyman @andrew-cormick-dockery I am VERY interested in your experience - the impact of this change on our staging server was to double our throughput whereas removing 60% of the sql queries in my work last year didn't move the needle. I'm going to open an internal ticket to see if our Redis is slower than our other components for some reason but would love to hear if this helps you

@johntwyman
Copy link
Contributor

@eileenmcnaughton we've had a lot of real problems with mail construction/delivery of late, so I'm definitely keen to try this out. I'll discuss this with @andrew-cormick-dockery today.

In our instance Redis is doing an enormous amount of work but we're not even touching the sides of what the Redis instance could be doing. At the peak of a mailing job, Redis is handling ~80-100 connections, 15k keys and pushing out ~60Mb/s of data. But CPU and RAM utilisation is minimal. So if Redis is a bottleneck I think it's only by virtue of how much Civi is relying on it.

@johntwyman
Copy link
Contributor

@eileenmcnaughton we're using New Relic at the moment and have a growing number of traces that might be of interest too. I'm not sure how easily I can share these traces but if you do want to have a look, let me know. I'm sure we can figure something out.

@eileenmcnaughton
Copy link
Contributor Author

@johntwyman lets connect next week when you have tested this out & I have some input from our tech-ops team. The big open question for me at the moment is whether the performance improvement I saw with this will be replicated on production or if the staging environment has some additional latency - this is our WMF ticket on the issue https://phabricator.wikimedia.org/T314619

@eileenmcnaughton
Copy link
Contributor Author

We just deployed this - you can see clearly when in the graph excerpt - the action tracked is basically updating a contact including their greetings

image

@colemanw
Copy link
Member

I've reviewed the code and it looks good. Merging based on review and testing by WMF.

@colemanw colemanw merged commit ebe2be7 into civicrm:master Aug 17, 2022
@colemanw colemanw deleted the cache_miss branch August 17, 2022 22:14
@johntwyman
Copy link
Contributor

We haven't seen as dramatic an improvement in our testing but it's definitely made a difference. Thanks @eileenmcnaughton

@eileenmcnaughton
Copy link
Contributor Author

@johntwyman two possible reasons

  • our redis is different & there is something in there and
  • the above chart is very targetted on a particular bit of code with already-warm caches

@johntwyman
Copy link
Contributor

We had a recent mailing with ~280,000 recipients. Prior to this patch it would have sent at ~1,000 emails per minute or worse. With this patch it sent at ~4,000 emails per minute.

That being said, last year a mailing of this size for us would have sent at 8-10,000 emails per minute. The big changes in that time that I can point to are Civi version upgrades and making FlexMailer the default mailing subsystem.

I haven't had time to fully dig into what exactly is going on but I know this much:

  • Redis calls are through the roof (hence the improvement with this patch)
  • The payload of the cache calls seems fairly large
  • Legacy token evaluation is one bottleneck

I won't keep flooding this merged PR with comments but over the next few weeks we'll be looking into this in more detail. Specifically we'll be examining New Relic traces to see the tree of function calls and the data that's being stored in Redis. Redis itself doesn't seem bothered (ie. no CPU, RAM, etc., constraints) but it's certainly handling a large volume of calls relating to mailings.

@eileenmcnaughton
Copy link
Contributor Author

@johntwyman any chance you could test d36378f - we are testing it now - with unclear results but I think something else is in play. In the staging environment I got another small improvement (only maybe ten percent) on this one

@eileenmcnaughton
Copy link
Contributor Author

Also this follow up #24292

@eileenmcnaughton
Copy link
Contributor Author

#24294 is #24156 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants