Datastore: Using new datastore aggregation queries for count #1781

its-snorlax · 2023-04-28T12:00:18Z

In the current implementation, the library performs client-side aggregations by fetching all the keys on the client side and then figuring out the count. In theory, the new aggregation queries and COUNT aggregation should be faster than the current one, as the current implementation makes use of the lazy iterator from the underlying client library that initiates multiple backend calls in cases where the number of keys is greater than the page size specified. And on top of that getting all the keys on the client side will have more egress cost compared to getting a count value.

Reference: COUNT aggregation and Aggregation queries in datastore

meltsufin · 2023-04-28T13:58:32Z

@kolea2 Would you mind giving us some feedback on this? This looks like a very reasonable and seamless improvement to the way we do count() in the DatastoreTemplate. Should we be aware of any caveats or incompatibility with the current findAllKeys(entityClass).length aproach?

kolea2 · 2023-04-28T14:38:48Z

@meltsufin I'm unfamiliar with findAllKeys (seems like custom code in this library), but using count() from the Datastore API directly sounds great to me! CC @jainsahab as fyi

meltsufin · 2023-04-28T15:06:15Z

@kolea2 Thanks for feedback. The way we did the count before is this:

spring-cloud-gcp/spring-cloud-gcp-data-datastore/src/main/java/com/google/cloud/spring/data/datastore/core/DatastoreTemplate.java

Lines 865 to 872 in a478264

    
           private Key[] findAllKeys(Class entityClass) { 
        
             Iterable<Key> keysFound = 
        
                 queryKeys( 
        
                     Query.newKeyQueryBuilder() 
        
                         .setKind(getPersistentEntity(entityClass).kindName()) 
        
                         .build()); 
        
             return StreamSupport.stream(keysFound.spliterator(), false).toArray(Key[]::new); 
        
           }

jainsahab · 2023-05-04T12:29:53Z

Indeed, current implementation triggers multiple calls to fetch all the entities in a paginated way, the link given below points to the code responsible for making subsequent calls when calling QueryResults#next.
https://github.com/googleapis/java-datastore/blob/11cef9ffb4737886aa24a70e8fc2577330f3e50a/google-cloud-datastore/src/main/java/com/google/cloud/datastore/QueryResultsImpl.java#L92-L106

I would like to focus on the cost aspect of a key only query and aggregation query. There is a small cost (though negligible) associated with running the aggregation query whereas the key only query is categorised under Small operations which are free.

Regarding the performance aspect, I agree Aggregation Query should be faster, as it offloads the heavy lifting of calculating the count value to the backend and egress traffic (response of aggregation query) is bare minimum (just an aggregated value). It would be interesting to prove the performance of Aggregation Query through some numbers by code profiling.

Also the memory footprint of realizing the iterator in the last statement would be O(n) on the client side (where n is the number of total entities of that kind. 😨

StreamSupport.stream(keysFound.spliterator(), false).toArray(Key[]::new);

its-snorlax · 2023-05-04T15:53:09Z

Hi @meltsufin , @kolea2 and @jainsahab I've tried to measure the performance of both implementations by measuring the time it takes to get the count value using this simple java program , and here are the results.

# Entities	Key Only Query Count duration (ms)	Aggregation Query Count duration (ms)
~5k	829	221
~10k	1318	262
~20k	2140	297
~50k	4845	372
~100k	8532	417
~200k	18414	549
~500k	45527	663
~1m	81831	1055

meltsufin · 2023-05-04T16:49:35Z

I think the performance advantage is pretty striking. @jainsahab Thanks for pointing out the cost aspect, but I don't actually see a difference. Both methods seem to incur a cost of one entity read, for which there is a free tier as well. Can you clarify the cost difference?

jainsahab · 2023-05-05T02:29:50Z

This is what doc says:
A keys-only query is counted as a single entity read for the query itself. The individual results are counted as small operations.
I got confused and was under the impression that for keys only queries users will only be charged only 1 entity read regardless the number of keys (millions or billions) returned by that key only query, whereas aggregation query cost will grow linearly Math.ceil(NUMBER_OF_INDEX_ENTERIES_SCANNED / 1000)
For ex: count() operations that match between 0 and 1000 index entries are billed for one entity read. For a count() operation that matches 1500 index entries, you are billed 2 entity reads.

But yeah, cost should actually be same, as the a key only query will get 1000 results per response and will be charged 1 entity read, and underlying client library will make multiple requests to satisfy the query and that's how pricing will end up being the same.

In a nutshell since aggregation queries do not have any cost implications, they are definitely better over keys only queries (less egress cost and runtime efficient).

) Modifying the implementation of DatastoreTemplate#count to use recently introduced [COUNT aggregation and Aggregation queries in datastore](https://cloud.google.com/datastore/docs/aggregation-queries). Fixes. GoogleCloudPlatform#1781

its-snorlax mentioned this issue Apr 28, 2023

fix: Implementing count with aggregation query #1782

Closed

meltsufin added datastore type: enhancement New feature or request labels Apr 28, 2023

lqiu96 added the priority: p2 label May 1, 2023

release-please bot mentioned this issue May 9, 2023

chore(main): release 4.3.1 #1808

Merged

alicejli closed this as completed in #1808 May 17, 2023

release-please bot mentioned this issue Dec 18, 2023

chore(main): release 5.0.1 #2467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datastore: Using new datastore aggregation queries for count #1781

Datastore: Using new datastore aggregation queries for count #1781

its-snorlax commented Apr 28, 2023

meltsufin commented Apr 28, 2023

kolea2 commented Apr 28, 2023

meltsufin commented Apr 28, 2023

jainsahab commented May 4, 2023 •

edited

Loading

its-snorlax commented May 4, 2023

meltsufin commented May 4, 2023

jainsahab commented May 5, 2023

Datastore: Using new datastore aggregation queries for count #1781

Datastore: Using new datastore aggregation queries for count #1781

Comments

its-snorlax commented Apr 28, 2023

meltsufin commented Apr 28, 2023

kolea2 commented Apr 28, 2023

meltsufin commented Apr 28, 2023

jainsahab commented May 4, 2023 • edited Loading

its-snorlax commented May 4, 2023

meltsufin commented May 4, 2023

jainsahab commented May 5, 2023

jainsahab commented May 4, 2023 •

edited

Loading