Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwarranted large QPcache sizes because of not hashing the operation name for queryPlanStoreKey #2309

Closed
AneethAnand opened this issue Jan 3, 2023 · 6 comments · Fixed by #2310
Assignees
Labels
P2 An issue that needs to be addressed on a reasonable timescale. status/needs-review

Comments

@AneethAnand
Copy link
Contributor

AneethAnand commented Jan 3, 2023

Description

We are using the Federation Gateway in production and recently added the operation name to the request along with the query. After the change, we noticed an increase in the time taken to generate the query plan and all of the cache misses. Our analysis showed that the issue was that the operation name is not hashed for the query plan store key.

const queryPlanStoreKey = queryHash + (request.operationName || '');

https://github.com/apollographql/federation/blob/main/gateway-js/src/index.ts#L758

Impact

The increased cache size taken to store the query plan and cache misses are impacting the performance of our application.

Possible Solution

We could hash the operation name along with the query hash to properly incorporate the operation name into the key.
By hashing both the query and operation name saves more space/ cache size.

For example:

          const queryPlanKey = print(document) + (request.operationName || '');
          const queryPlanStoreKey = computeQueryPlanKeyHash(queryPlanKey);

This would generate a unique hash for the query and operation name, which could be used as the query plan store key.

@AneethAnand
Copy link
Contributor Author

Created a PR:
#2310

@korinne
Copy link
Contributor

korinne commented Feb 1, 2023

Thanks for raising this issue and submitting a pull request! Our team will review and provide feedback shortly.

@trevor-scheer
Copy link
Member

trevor-scheer commented Feb 17, 2023

Hey @AneethAnand, can you help me understand the problem a bit better or provide a reproduction of the issue you're seeing? Based on your explanation I feel like I'm missing something. Here's my understanding:

TL;DR, the only effect I see in this PR is changing the cache key from one unique ID to a slightly different one.

For simplicity I'll assume your documents contain only one operation. This is how I imagine the scenario you've described.

  1. You send an operation with no operationName. A query plan is cached with some key: abc123 (hash only).
  2. You send the same operation with operationName. A duplicate query plan is cached with a new key: abc123MyOperation (hash + operationName).

For a long-running gateway, this results in 1 cache miss and 2 cache entries for the same query plan. But the next time (2) happens you shouldn't see another new cache entry or a cache miss. And once your gateway is restarted or your cache starts evicting old plans I wouldn't expect duplicate cache entries.

All of what I've said above would be expected whether or not the key included the operationName in the hash, but those are the consequences I would anticipate going from no operationName to including it.

we noticed an increase in the time taken to generate the query plan

I don't understand how this could be a side effect of the cache key we use. Are there possibly some other side effects of including operationName that have an effect on performance which this PR doesn't address?

@AneethAnand
Copy link
Contributor Author

AneethAnand commented Feb 18, 2023

Hi @trevor-scheer,

Thank you for the detailed explanation and the scenario you've described.
Based on the scenario you've presented, I can see how not being consistent in adding the operation name to the request would result in a significant increase in cache size or cache misses. Also I am not questioning the unique query plan store key. I am just saying that hashing both the query and operation name is a good solution to avoid increased queryPlanCache size as the length of the key remains constant.

Background

Most of our input query has only one operation, but in few test scenarios we add multiple operations. I understand adding the operation name to the key mostly helps solve multiple operations per document scenarios.

Once we added operation name to all the request along with the query we saw a sharp increase in the time taken for gateway.plan consistently. This could happen only if there were cache misses all the time. So I was also thinking to increase the cache size through config. But the real issue is that adding operation name increases the cache size as the operation name is not SHAed in the cache key. The moment we reverted the PR that includes request.operation name, we saw the numbers back to normal.

image

We will also keep track of cache misses and hit and that's the reason I created the other PR to add cache config to the QueryPlannerConfig.

Overall, by hashing both the query and operation name to generate the cache key,

  • the length of the key remains constant (256 long bits) and is not dependent on the length of the operation name.
  • The search on the key (cache.get or cache.contains) is also constant not variable and
  • more importantly the cache size does not increase if the operation names are huge

The proposed change to hash both the query and operation name is intended to address these issues

Thank you again for your insights and guidance in resolving this issue!

@trevor-scheer
Copy link
Member

Ok, so what you're suggesting is this:

  1. Cache keys are longer than they need to be (and variable in length)
  2. This results in the cache growing larger and holding less query plans, causing more evictions
  3. This causes more query planning to happen due to cache misses

Thanks for the explanation, I've got a much better understanding of the problem now. I'll have a closer look at your PR. Just out of curiosity, how long are your operation names? Seems like typical names wouldn't be such a major contributor to the size of a cache entry unless they're massive. The size of the key would definitely pale in comparison to the size of a query plan (generally speaking).

@AneethAnand
Copy link
Contributor Author

Ok, so what you're suggesting is this:

  1. Cache keys are longer than they need to be (and variable in length)
  2. This results in the cache growing larger and holding less query plans, causing more evictions
  3. This causes more query planning to happen due to cache misses

Thanks for the explanation, I've got a much better understanding of the problem now. I'll have a closer look at your PR. Just out of curiosity, how long are your operation names? Seems like typical names wouldn't be such a major contributor to the size of a cache entry unless they're massive. The size of the key would definitely pale in comparison to the size of a query plan (generally speaking).

The average length of the operation name is 30-40 characters and there are 2500 unique operations running in the instances. Therefore, including the operation name in the request would increase the cache size by 87,500 bytes (2500 * 35).
However, to better understand the impact of these numbers, I created the other pull request (PR) to add the cache configuration to the query plan. This would help in gauging the cache hit and misses and provide more insights into the performance of the cache by including the operation name. Also we could have it persist in redis and load it during deploy.

@korinne korinne added the P2 An issue that needs to be addressed on a reasonable timescale. label Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 An issue that needs to be addressed on a reasonable timescale. status/needs-review
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants