Unwarranted large QPcache sizes because of not hashing the operation name for queryPlanStoreKey #2309

AneethAnand · 2023-01-03T06:19:30Z

Description

We are using the Federation Gateway in production and recently added the operation name to the request along with the query. After the change, we noticed an increase in the time taken to generate the query plan and all of the cache misses. Our analysis showed that the issue was that the operation name is not hashed for the query plan store key.

const queryPlanStoreKey = queryHash + (request.operationName || '');

https://github.com/apollographql/federation/blob/main/gateway-js/src/index.ts#L758

Impact

The increased cache size taken to store the query plan and cache misses are impacting the performance of our application.

Possible Solution

We could hash the operation name along with the query hash to properly incorporate the operation name into the key.
By hashing both the query and operation name saves more space/ cache size.

For example:

          const queryPlanKey = print(document) + (request.operationName || '');
          const queryPlanStoreKey = computeQueryPlanKeyHash(queryPlanKey);

This would generate a unique hash for the query and operation name, which could be used as the query plan store key.

AneethAnand · 2023-01-03T06:20:11Z

Created a PR:
#2310

korinne · 2023-02-01T18:31:49Z

Thanks for raising this issue and submitting a pull request! Our team will review and provide feedback shortly.

trevor-scheer · 2023-02-17T20:00:23Z

Hey @AneethAnand, can you help me understand the problem a bit better or provide a reproduction of the issue you're seeing? Based on your explanation I feel like I'm missing something. Here's my understanding:

TL;DR, the only effect I see in this PR is changing the cache key from one unique ID to a slightly different one.

For simplicity I'll assume your documents contain only one operation. This is how I imagine the scenario you've described.

You send an operation with no operationName. A query plan is cached with some key: abc123 (hash only).
You send the same operation with operationName. A duplicate query plan is cached with a new key: abc123MyOperation (hash + operationName).

For a long-running gateway, this results in 1 cache miss and 2 cache entries for the same query plan. But the next time (2) happens you shouldn't see another new cache entry or a cache miss. And once your gateway is restarted or your cache starts evicting old plans I wouldn't expect duplicate cache entries.

All of what I've said above would be expected whether or not the key included the operationName in the hash, but those are the consequences I would anticipate going from no operationName to including it.

we noticed an increase in the time taken to generate the query plan

I don't understand how this could be a side effect of the cache key we use. Are there possibly some other side effects of including operationName that have an effect on performance which this PR doesn't address?

AneethAnand · 2023-02-18T18:45:44Z

Hi @trevor-scheer,

Thank you for the detailed explanation and the scenario you've described.
Based on the scenario you've presented, I can see how not being consistent in adding the operation name to the request would result in a significant increase in cache size or cache misses. Also I am not questioning the unique query plan store key. I am just saying that hashing both the query and operation name is a good solution to avoid increased queryPlanCache size as the length of the key remains constant.

Background

Most of our input query has only one operation, but in few test scenarios we add multiple operations. I understand adding the operation name to the key mostly helps solve multiple operations per document scenarios.

Once we added operation name to all the request along with the query we saw a sharp increase in the time taken for gateway.plan consistently. This could happen only if there were cache misses all the time. So I was also thinking to increase the cache size through config. But the real issue is that adding operation name increases the cache size as the operation name is not SHAed in the cache key. The moment we reverted the PR that includes request.operation name, we saw the numbers back to normal.

We will also keep track of cache misses and hit and that's the reason I created the other PR to add cache config to the QueryPlannerConfig.

Overall, by hashing both the query and operation name to generate the cache key,

the length of the key remains constant (256 long bits) and is not dependent on the length of the operation name.
The search on the key (cache.get or cache.contains) is also constant not variable and
more importantly the cache size does not increase if the operation names are huge

The proposed change to hash both the query and operation name is intended to address these issues

Thank you again for your insights and guidance in resolving this issue!

trevor-scheer · 2023-02-20T18:12:42Z

Ok, so what you're suggesting is this:

Cache keys are longer than they need to be (and variable in length)
This results in the cache growing larger and holding less query plans, causing more evictions
This causes more query planning to happen due to cache misses

Thanks for the explanation, I've got a much better understanding of the problem now. I'll have a closer look at your PR. Just out of curiosity, how long are your operation names? Seems like typical names wouldn't be such a major contributor to the size of a cache entry unless they're massive. The size of the key would definitely pale in comparison to the size of a query plan (generally speaking).

AneethAnand · 2023-02-22T16:19:52Z

Ok, so what you're suggesting is this:

Cache keys are longer than they need to be (and variable in length)

This results in the cache growing larger and holding less query plans, causing more evictions

This causes more query planning to happen due to cache misses

Thanks for the explanation, I've got a much better understanding of the problem now. I'll have a closer look at your PR. Just out of curiosity, how long are your operation names? Seems like typical names wouldn't be such a major contributor to the size of a cache entry unless they're massive. The size of the key would definitely pale in comparison to the size of a query plan (generally speaking).

The average length of the operation name is 30-40 characters and there are 2500 unique operations running in the instances. Therefore, including the operation name in the request would increase the cache size by 87,500 bytes (2500 * 35).
However, to better understand the impact of these numbers, I created the other pull request (PR) to add the cache configuration to the query plan. This would help in gauging the cache hit and misses and provide more insights into the performance of the cache by including the operation name. Also we could have it persist in redis and load it during deploy.

korinne mentioned this issue Feb 1, 2023

Make queryPlanStoreKey a hash of both the query and Operation name #2310

Merged

korinne added the status/needs-review label Feb 9, 2023

korinne assigned trevor-scheer Feb 17, 2023

korinne added the P2 An issue that needs to be addressed on a reasonable timescale. label Feb 22, 2023

trevor-scheer closed this as completed in #2310 Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unwarranted large QPcache sizes because of not hashing the operation name for queryPlanStoreKey #2309

Unwarranted large QPcache sizes because of not hashing the operation name for queryPlanStoreKey #2309

AneethAnand commented Jan 3, 2023 •

edited

Loading

AneethAnand commented Jan 3, 2023

korinne commented Feb 1, 2023

trevor-scheer commented Feb 17, 2023 •

edited

Loading

AneethAnand commented Feb 18, 2023 •

edited

Loading

trevor-scheer commented Feb 20, 2023

AneethAnand commented Feb 22, 2023

Unwarranted large QPcache sizes because of not hashing the operation name for queryPlanStoreKey #2309

Unwarranted large QPcache sizes because of not hashing the operation name for queryPlanStoreKey #2309

Comments

AneethAnand commented Jan 3, 2023 • edited Loading

Description

Impact

Possible Solution

AneethAnand commented Jan 3, 2023

korinne commented Feb 1, 2023

trevor-scheer commented Feb 17, 2023 • edited Loading

AneethAnand commented Feb 18, 2023 • edited Loading

Background

trevor-scheer commented Feb 20, 2023

AneethAnand commented Feb 22, 2023

AneethAnand commented Jan 3, 2023 •

edited

Loading

trevor-scheer commented Feb 17, 2023 •

edited

Loading

AneethAnand commented Feb 18, 2023 •

edited

Loading