Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore: Using new datastore aggregation queries for count #1781

Closed
its-snorlax opened this issue Apr 28, 2023 · 7 comments · Fixed by #1808
Closed

Datastore: Using new datastore aggregation queries for count #1781

its-snorlax opened this issue Apr 28, 2023 · 7 comments · Fixed by #1808

Comments

@its-snorlax
Copy link
Contributor

In the current implementation, the library performs client-side aggregations by fetching all the keys on the client side and then figuring out the count. In theory, the new aggregation queries and COUNT aggregation should be faster than the current one, as the current implementation makes use of the lazy iterator from the underlying client library that initiates multiple backend calls in cases where the number of keys is greater than the page size specified. And on top of that getting all the keys on the client side will have more egress cost compared to getting a count value.

Reference: COUNT aggregation and Aggregation queries in datastore

@meltsufin
Copy link
Member

@kolea2 Would you mind giving us some feedback on this? This looks like a very reasonable and seamless improvement to the way we do count() in the DatastoreTemplate. Should we be aware of any caveats or incompatibility with the current findAllKeys(entityClass).length aproach?

@meltsufin meltsufin added datastore type: enhancement New feature or request labels Apr 28, 2023
@kolea2
Copy link

kolea2 commented Apr 28, 2023

@meltsufin I'm unfamiliar with findAllKeys (seems like custom code in this library), but using count() from the Datastore API directly sounds great to me! CC @jainsahab as fyi

@meltsufin
Copy link
Member

@kolea2 Thanks for feedback. The way we did the count before is this:

private Key[] findAllKeys(Class entityClass) {
Iterable<Key> keysFound =
queryKeys(
Query.newKeyQueryBuilder()
.setKind(getPersistentEntity(entityClass).kindName())
.build());
return StreamSupport.stream(keysFound.spliterator(), false).toArray(Key[]::new);
}

@jainsahab
Copy link
Contributor

jainsahab commented May 4, 2023

Indeed, current implementation triggers multiple calls to fetch all the entities in a paginated way, the link given below points to the code responsible for making subsequent calls when calling QueryResults#next.
https://github.com/googleapis/java-datastore/blob/11cef9ffb4737886aa24a70e8fc2577330f3e50a/google-cloud-datastore/src/main/java/com/google/cloud/datastore/QueryResultsImpl.java#L92-L106


I would like to focus on the cost aspect of a key only query and aggregation query. There is a small cost (though negligible) associated with running the aggregation query whereas the key only query is categorised under Small operations which are free.

Regarding the performance aspect, I agree Aggregation Query should be faster, as it offloads the heavy lifting of calculating the count value to the backend and egress traffic (response of aggregation query) is bare minimum (just an aggregated value). It would be interesting to prove the performance of Aggregation Query through some numbers by code profiling.

Also the memory footprint of realizing the iterator in the last statement would be O(n) on the client side (where n is the number of total entities of that kind. 😨

StreamSupport.stream(keysFound.spliterator(), false).toArray(Key[]::new); 

@its-snorlax
Copy link
Contributor Author

Hi @meltsufin , @kolea2 and @jainsahab I've tried to measure the performance of both implementations by measuring the time it takes to get the count value using this simple java program , and here are the results.

# Entities Key Only Query Count duration (ms) Aggregation Query Count duration (ms)
~5k 829 221
~10k 1318 262
~20k 2140 297
~50k 4845 372
~100k 8532 417
~200k 18414 549
~500k 45527 663
~1m 81831 1055

@meltsufin
Copy link
Member

I think the performance advantage is pretty striking. @jainsahab Thanks for pointing out the cost aspect, but I don't actually see a difference. Both methods seem to incur a cost of one entity read, for which there is a free tier as well. Can you clarify the cost difference?

@jainsahab
Copy link
Contributor

This is what doc says:
A keys-only query is counted as a single entity read for the query itself. The individual results are counted as small operations.
I got confused and was under the impression that for keys only queries users will only be charged only 1 entity read regardless the number of keys (millions or billions) returned by that key only query, whereas aggregation query cost will grow linearly Math.ceil(NUMBER_OF_INDEX_ENTERIES_SCANNED / 1000)
For ex: count() operations that match between 0 and 1000 index entries are billed for one entity read. For a count() operation that matches 1500 index entries, you are billed 2 entity reads.

But yeah, cost should actually be same, as the a key only query will get 1000 results per response and will be charged 1 entity read, and underlying client library will make multiple requests to satisfy the query and that's how pricing will end up being the same.

In a nutshell since aggregation queries do not have any cost implications, they are definitely better over keys only queries (less egress cost and runtime efficient).

renovate-bot pushed a commit to renovate-bot/GoogleCloudPlatform-_-spring-cloud-gcp that referenced this issue May 8, 2023
)

Modifying the implementation of DatastoreTemplate#count to use recently introduced [COUNT aggregation and Aggregation queries in datastore](https://cloud.google.com/datastore/docs/aggregation-queries).

Fixes. GoogleCloudPlatform#1781
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants