Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datastore): query limit more than 300 #5592

Merged

Conversation

vishwarajanand
Copy link
Contributor

@vishwarajanand vishwarajanand commented Nov 7, 2022

Fixes: #5039

Changes

Datastore loops through all pages via EntityPageIterator which uses PageIteratorTrait.
The trait expects resultLimit to be set while doing the subsequent calls, which was missing so any limit value of more than 300 was leading to all results getting fetched. Following bugs were fixed:

  1. Under Grpc mode, when a Gql query string goes through runQuery, and a parsed query object is returned with correct limit array, it doesn't need to be modified in subsequent fetches. Added a check to skip that.
  2. We need to pass resultLimit in Operation->runQuery(...) via the EntityPageIterator instance, so that the iterator can loop through all pages.
  3. Under EntityPageIterator, when we fetch the current page only then we need to save the limit value because in case of GqlQuery we do not have the limit parsed in client library.
  4. System Tests to cover for these scenarios.

Testing

  1. System tests run:
google-cloud-php/Datastore % XDEBUG_MODE=debug vendor/bin/phpunit -c phpunit-system.xml.dist --filter="WithLimit"
PHPUnit 8.5.31 by Sebastian Bergmann and contributors.

......                                                              6 / 6 (100%)

Time: 12.1 seconds, Memory: 14.00 MB

OK (6 tests, 10 assertions)
google-cloud-php/Datastore %

Notes

  1. PR in Ruby: fix(datastore): fixed request limit google-cloud-ruby#19456

@product-auto-label product-auto-label bot added the api: datastore Issues related to the Datastore API. label Nov 7, 2022
@vishwarajanand vishwarajanand added do not merge Indicates a pull request not ready for merge, due to either quality or timing. and removed api: datastore Issues related to the Datastore API. labels Nov 7, 2022
@product-auto-label product-auto-label bot added the api: datastore Issues related to the Datastore API. label Nov 7, 2022
	modified:   tests/System/QueryResultPaginationTest.php
@vishwarajanand vishwarajanand force-pushed the feat_datastore_query_limit branch from 6d4591c to 63655f5 Compare November 9, 2022 06:25
@vishwarajanand vishwarajanand removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Nov 9, 2022
@vishwarajanand vishwarajanand marked this pull request as ready for review November 9, 2022 10:01
@vishwarajanand vishwarajanand requested review from a team as code owners November 9, 2022 10:01
@vishwarajanand vishwarajanand force-pushed the feat_datastore_query_limit branch 2 times, most recently from 2377993 to 59a73da Compare November 9, 2022 15:14
@vishwarajanand vishwarajanand force-pushed the feat_datastore_query_limit branch from 59a73da to f902098 Compare November 9, 2022 15:35
@dwsupplee
Copy link
Contributor

While researching this I came across a similar issue: googleapis/google-cloud-python#1763

It looks like the solution they landed on is to update the limit on subsequent calls in a paged set and utilize offsets. This approach could provide some benefit over using resultLimit, as it looks like it should actually limit how much data is sent over the wire. With resultLimit we could have a response come back with 300 items in it, but only actually send 150 to the user as it is only a client side limiter (more intended to prevent further paging).

WDYT about using that approach here, too? I did some quick testing and it seems like we should be able to support it.

@vishwarajanand
Copy link
Contributor Author

@dwsupplee our doc suggests that we should rather move away from offsets to save costs, so we should continue with cursors (current approach is similar to Ruby, etc).

CGC Screenshot 2022-11-21 at 19 02 15

To prevent over-fetching of records in last page, I switched to managing limits via a new remainingLimit variable passed by reference across callable of determineNextResultToken and runQuery.

I am keeping resultLimit for safety. Are you in favor of removing it?

@dwsupplee
Copy link
Contributor

@vishwarajanand Just a heads up this is on my radar! I'll be reviewing shortly.

@vishwarajanand
Copy link
Contributor Author

@dwsupplee Sure, thanks for the headsup. I have merged the latest main to pull in all changes from multiple-db feature.

@dwsupplee
Copy link
Contributor

@vishwarajanand and I discussed the PR offline a bit, looking good for the most part just a few minor updates to be expected.

Datastore/src/EntityPageIterator.php Outdated Show resolved Hide resolved
Datastore/src/Operation.php Outdated Show resolved Hide resolved
if (isset($res['query']['limit'])) {
$remainingLimit = $res['query']['limit'];
}
if (isset($remainingLimit['value'])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant - ['value'] could be coming from $res['query']['limit'], or from $query->queryObject()['limit']. So why set it above just to set it again here? It seems to me like what you'd want to do is set $remainingLimit to null, above, and then simply make this:

            if (isset($res['query']['limit']['value'])) {
                $remainingLimit = $res['query']['limit']['value'];
            }

But maybe there's something I'm missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several cases here:

  1. In a standard Query, (either GRPC or REST) we don't get a query key in the response. So, we set $remainingLimit using $query->queryObject()['limit']:
    $remainingLimit = $query->queryObject()['limit'];
  2. In GqlQuery, we don't have limit from the query. So while using REST, we get limit in $res['query']['limit']:
    $remainingLimit = $res['query']['limit'];
  3. In GqlQuery while using GRPC, we get limit in $res['query']['limit']['value']:
    if (isset($remainingLimit['value'])) {
    $remainingLimit = $remainingLimit['value'];

Datastore/src/Operation.php Outdated Show resolved Hide resolved
@vishwarajanand vishwarajanand merged commit b272644 into googleapis:main Dec 19, 2022
evan-burrell pushed a commit to evan-burrell/google-cloud-php that referenced this pull request Jan 3, 2023
@@ -242,7 +242,7 @@ public function runQuery(array $args)
$query['filter'] = $this->convertFilterProps($query['filter']);
}

if (isset($query['limit'])) {
if (isset($query['limit']) && is_int($query['limit'])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishwarajanand why was this added? This seems like it could break existing implementations if they supplied something like $query['limit'] = '1'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bshaffer I mentioned this in point #1 in PR description > Changes:

Under Grpc mode, when a Gql query string goes through runQuery, and a parsed query object is returned with correct limit array, it doesn't need to be modified in subsequent fetches. Added a check to skip that.

In Grpc mode, the non-Gql query provides$query['limit'] as an int which needs to be parsed into an array, but your comment suggests we support int-like string also.

If so, we should change the is_int to !is_array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bshaffer I raised another PR to fix this bug: #5826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datastore API: Limit the result not work if the result set has more then 300 rows
3 participants