Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore query limits are silently truncated by the v1beta3 API #1763

Closed
Bogdanp opened this issue Apr 28, 2016 · 19 comments
Closed

Datastore query limits are silently truncated by the v1beta3 API #1763

Bogdanp opened this issue Apr 28, 2016 · 19 comments
Assignees
Labels
api: datastore Issues related to the Datastore API.

Comments

@Bogdanp
Copy link

Bogdanp commented Apr 28, 2016

This bug manifests itself in a couple of "fun" ways:

len(list(client.query(kind="Foo").fetch(limit=300)))

Will yield 300, assuming there are 300 or more entities of that kind.

len(list(client.query(kind="Foo").fetch(limit=500)))

Assuming there are 500 entities of that kind, will yield 500. If there are n entities where n > 500, it will yield n.

This all seems to be caused by the v1beta3 api truncating the limits to 300:

entities, more, cursor = client.query(kind="Foo").fetch(limit=500).next_page()

len(entities) will be 300 and not 500 even if there are 500 or more entities of that kind.

We're currently using the following snippet to work around this issue:

def _all_entities(iterator):
    entities, cursor = [], None
    while iterator._limit > 0:
        xs, more, cursor = iterator.next_page()
        iterator._limit -= len(xs)
        entities.extend(xs)
        if not more:
            break

    return entities, cursor
@tseaver tseaver added the api: datastore Issues related to the Datastore API. label Apr 28, 2016
@dhermes
Copy link
Contributor

dhermes commented Apr 28, 2016

@Bogdanp This is a "feature" not a bug. The API doesn't promise to return your fetch limit, it just promises not to exceed it. It typically times out on the backend after a pre-set amount of time. Is the more value not useful? (The actual value returned by the backend is a bit different.)

@pcostell Can say more, though I am pre-emptively closing this since there isn't anything we can do in the library.

@Bogdanp
Copy link
Author

Bogdanp commented Apr 28, 2016

it just promises not to exceed it

The library does exceed it in the second example I gave though: requesting 500 entities will end up returning more than the limit if there exist more entities than that limit.

@dhermes
Copy link
Contributor

dhermes commented Apr 28, 2016

Oh wow, my bad! Good thing I forgot to click the close button. Reproducing now on my end.

@Bogdanp
Copy link
Author

Bogdanp commented Apr 28, 2016

No worries! Regarding the "feature", is it possible to get that bumped up somehow or is that set in stone? The old API used to allow much higher values for limits.

@Bogdanp
Copy link
Author

Bogdanp commented Apr 28, 2016

The issue is with the way iter works, btw: it naively calls next_page without modifying its limit based on the results it gets back so it eventually exceeds the limit if there are enough entities. This used to work fine because, like I mentioned above, the old API didn't truncate limits (if it did, it did so at a much higher number than 300). The fix would probably look a lot like my _all_entities function above.

@dhermes
Copy link
Contributor

dhermes commented Apr 28, 2016

OK this is totally a "bug" in our implementation, essentially identical to #1467. It occurs because limit=500 is set on every request, but doesn't take into account the results already returned. Bugs like this are a reasons we are hesitant when we implement a __iter__ that makes more than one API call.

As for 300 vs. 500, that's completely out of gcloud-python's control, though maybe @pcostell can weigh in. You could also file an issue on https://github.com/GoogleCloudPlatform/google-cloud-datastore

@dhermes
Copy link
Contributor

dhermes commented Apr 28, 2016

Jinx!

@pcostell
Copy link
Contributor

The Cloud Datastore API has to reserve the right not to return all the results -- we need to protect both our servers and other users on them. This is why we expose our API with support for batching -- but the client libraries need to take advantage of this in order to expose a nicer API to our users.

I think __iter__ needs to make more than one API call -- it should return exactly how many items the user asks for (or less if those items don't exist). Also, I think next_page should probably be implemented in terms of __iter__, not the other way around. This is very similar to gcloud-node#1260, I'm just going to quote in something I said from there:

The way I see it is that our desired behavior is a streaming API. If a user asks for a limit of N, we will give them a stream of up to N entities. We can then add some helpers on top of that to convert the stream to a list.

Cloud Datastore then offers a paging feature, which provides the user a cursor they can use to get extra pages. On top of that, Cloud Datastore has the feature of best effort end condition tracking (more_results). This is very useful for users who are showing pages to their users as they can get information about whether or not they should show a "Next Page" button (this is MORE_RESULTS_AFTER_LIMIT). This feature is conservative (MORE_RESULTS_AFTER_LIMIT may be returned even if the correct result is NO_MORE_RESULTS).

Importantly, Cloud Datastore's paging feature is completely separate from the batching API, which is really just an implementation detail because prior to gRPC we couldn't offer a streaming API.

@dhermes
Copy link
Contributor

dhermes commented Apr 29, 2016

Thanks @pcostell.

@pcostell
Copy link
Contributor

pcostell commented May 4, 2016

@dhermes can we get this bug assigned to someone? The Cloud Datastore batch size shouldn't be something client library users should have to worry about.

@dhermes dhermes self-assigned this May 4, 2016
dhermes added a commit to dhermes/google-cloud-python that referenced this issue May 11, 2016
dhermes added a commit to dhermes/google-cloud-python that referenced this issue May 12, 2016
dhermes added a commit to dhermes/google-cloud-python that referenced this issue May 17, 2016
dhermes added a commit to dhermes/google-cloud-python that referenced this issue May 17, 2016
dhermes added a commit to dhermes/google-cloud-python that referenced this issue May 17, 2016
dhermes added a commit to dhermes/google-cloud-python that referenced this issue May 17, 2016
In the process, also keeping tracking of the number of skipped
results (so that we can update the offset).

Fixes googleapis#1763.
@Rafff
Copy link

Rafff commented Jun 6, 2017

Hey.
So there is no way to get more than 300 results at once ?
I can see that this issue has been closed for a long time but it seems that this "problem" (or feature?) still exists.

@dhermes
Copy link
Contributor

dhermes commented Jun 6, 2017

@Rafff How do you mean? The API has limits based on payload size and there isn't anything we can do about it in the client. However, we do provide an iterator so that you don't have to worry about it.

@ibrahim-abuelalaa
Copy link

ibrahim-abuelalaa commented Dec 11, 2017

Hi @dhermes ,
I have an issue and I believes that it somehow related to this issue my datastore have more than 8000 entities but when i am trying to use offset>350 and limit=2000 it return less than 1900 entities and if I increased my offset to 1000 it returns 0 entities

Hint: all my entities match the query

@dhermes
Copy link
Contributor

dhermes commented Dec 11, 2017

@ibrahim-abuelalaa Would you mind sharing some code so we can try to reproduce?

@ibrahim-abuelalaa
Copy link

ibrahim-abuelalaa commented Dec 11, 2017

@dhermes

client = datastore.Client()
query = client.query(kind = kind)
query.add_filter('projectKey', '=', projectKey)
len(list( query.fetch(limit=limit, offset=offset)))

@dhermes
Copy link
Contributor

dhermes commented Dec 11, 2017

So you have something like:

>>> query_iter1 = query.fetch()
>>> len(list(query_iter1))
8001
>>>
>>> query_iter2 = query.fetch(offset=351, limit=2000)
>>> len(list(query_iter2))
1899
>>>
>>> query_iter3 = query.fetch(offset=1000, limit=2000)
>>> len(list(query_iter3))
0

@dhermes
Copy link
Contributor

dhermes commented Dec 11, 2017

@ibrahim-abuelalaa Is the distinction relevant? (I.e. is "almost" good enough?) I just want to be able to reproduce this issue.

@ibrahim-abuelalaa
Copy link

@dhermes yes it is exactly like that, sorry for confusion

@edwardfung123
Copy link

Hello. Is there any way to work around the 300 entities limit? Is there any documentation about this 300 limit? I searched the library source code and nothing related to this 300 limit. Is it from the server side? To iterating all the entities, what is the best practice? Should we set the limit to None or use the cursor way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API.
Projects
None yet
Development

No branches or pull requests

7 participants