Skip to content
This repository has been archived by the owner on Mar 20, 2018. It is now read-only.

Support sized pages in Page Streaming #86

Closed
tbetbetbe opened this issue Apr 19, 2016 · 17 comments
Closed

Support sized pages in Page Streaming #86

tbetbetbe opened this issue Apr 19, 2016 · 17 comments
Assignees

Comments

@tbetbetbe
Copy link
Contributor

tbetbetbe commented Apr 19, 2016

Early feedback suggests that in addition to simple iteration, a batch mode of operation should also be supported for page-streamed calls, where the user is able to iterate by batches of results. E.g,

E.g, currently for page streaming the following works

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> api = PublisherApi()
>>> topic = api.topic_path('google.com:pubsub-demo', 'my-topic')
>>> all_subs = list(api.list_topic_subscriptions(topic))
>>> all_subs
[‘sub_1’, ‘sub_2’, ‘sub_3’, ‘sub_2’]

There should be a way to support batching:

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> api = PublisherApi()
>>> topic = api.topic_path('google.com:pubsub-demo', 'my-topic')
>>> all_subs = list(??? with batchsize=2)
>>> all_subs
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

Proposals

  1. Add a helper to google.gax to support batch iteration. Users will import it and use it when they need it.
  2. Add a helper to google.gax as in 1. Also, add a field to CallOptions, batch_size which users set to batch in pages

Add a helper to google.gax to support batch iteration

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> from google.gax import batch_iter
>>> api = PublisherApi()
>>> topic, batch_size = api.topic_path('google.com:pubsub-demo', 'my-topic'), 2
>>> all_subs = list(batch_iter(api.list_topic_subscriptions(topic), batch_size))
>>> all_subs
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

Pros

  • the generated code stays simple, there are no changes required to the generator
  • the new feature batch_iter is not difficult, and is reminiscent of the helper methods in the itertools library

Cons

  • users have more of the gax-python surface to learn; the batch_iter func is a new feature to be learned.

Add a helper to google.gax, also add batch_size to CallOptions

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> from google.gax import batch_iter
>>> api = PublisherApi()
>>> topic, batch_size = api.topic_path('google.com:pubsub-demo', 'my-topic'), 2
>>> all_subs = list(api.list_topic_subscriptions(topic, 
...    options=CallOptions(batch_size=batch_size)))
>>> all_subs
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

Pros

  • keeps the change to the surface elements the user needs to know about to a minimum; i.e to use this feature, the user only needs to know about an addition field that we will document in CallOptions

Cons

  • the generated code for methods that support page-streaming is slightly more complex

@jmuk, @geigerj, @bjwatson, @anthmgoogle PTAL and discuss

Decision

  • the surface will look like this
>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> api = PublisherApi()
>>> api.topic_path('google.com:pubsub-demo', 'my-topic')
>>> all_subs = list(api.list_topic_subscriptions(topic, 
...    options=CallOptions(is_page_streaming=False)))
>>> all_subs  # whatever the server page size is 
[(‘sub_1’, ‘sub_2’), (‘sub_3’), ('sub_4')]

There's an open question that is OK to resolve after implementation begins

@geigerj
Copy link
Contributor

geigerj commented Apr 19, 2016

I'm confused by this request. The GAX implementation of page streaming already constructs the result lazily as a Python generator, so there's no performance benefit to this (except perhaps that we can use the batch size to set the page_size field) -- the difference seems to be only in the depth of nesting the result in the object we're returning.

What is the use case for this?

@tbetbetbe
Copy link
Contributor Author

What's the use case for this ?

There is a request to allow the page streaming iterator yield batches of results instead of individual items. @anthmgoogle can provide more details.

@geigerj
Copy link
Contributor

geigerj commented Apr 19, 2016

Is this feature

(a) an abstraction on top of paging, where each element of the response contains a fixed number of instances of an API resource, and where each element may not correspond to a single request?

(b) a way to hide paging concepts like page_token, next_page_token, and page_size, while still exposing the contents of the page itself?

The examples you gave seem more like (a), but I don't see the benefit over the current version of page streaming that we have. (b) seems like a nice way not to hide the API requests under a Python generator like we currently do. It also doesn't require us to define additional concepts like batch_size, since we're just exposing what API already makes available.

In (b), I mean something like:

>>> api = PublisherApi()
>>> topic, batch_size = api.topic_path('google.com:pubsub-demo', 'my-topic'), 2
>>> pages = list(api.list_topic_subscriptions(topic, page_size=4)):
>>> pages
[(‘sub_1’, ‘sub_2’, 'sub_3'), (‘sub_4’, ‘sub_5', 'sub_6', 'sub_7'), ('sub_8', )]   

where each tuple corresponds to exactly one API response, and contains a maximum of, but not necessarily exactly, four elements, depending on what the response was.

@jmuk
Copy link
Contributor

jmuk commented Apr 19, 2016

Correct me if I'm wrong, but the request here is to allow accessing the page data underlying in the data stream, so that users can see the page token (or remember it for later uses), for example.

That means, grouping items by a number is not sufficient, and that could be wrong if the server may not fill the data elements of the requested page size.

@anthmgoogle
Copy link

It's definitely (b), a way of making going page by page also have a nice idiomatic surface rather than mucking around with tokens.

The terms "Pagination" and "Batched" are both problematic descriptions of this. We already rejected "Pagination" because of the disconnect with the english meaning. "Batching" is a very different feature in gRPC. I think Paging or Paged are clearer.

@geigerj
Copy link
Contributor

geigerj commented Apr 19, 2016

@jmuk Users can just save a Python generator that maintains the page token in its state; they shouldn't need to use the token itself. If they do need to see the token, you can always turn off the feature as api.list_topic_subscriptions(topic, options=CallOption(is_page_streaming=False)).

@tbetbetbe tbetbetbe changed the title Support Batched Pagination Support sized pages in Page Streaming Apr 19, 2016
@tbetbetbe
Copy link
Contributor Author

@anthmgoogle: I just updated the issue name, please adjust if necessary

@anthmgoogle
Copy link

Agree this is a good name for the feature.

@jmuk
Copy link
Contributor

jmuk commented Apr 19, 2016

If it's (b), I still think that specifying the size of the batch (or page or whatever) isn't sufficient, because anyways it doesn't respect to the actual page data behind the scene. It's rather a boolean flag of whether iterating over resources or over pages.

@jmuk
Copy link
Contributor

jmuk commented Apr 19, 2016

Also -- I think this is similar to files (or string streams) in a sense; someone wants to look through individual characters, and someone else wants to iterate over lines. In Python, files are iterable but have some utility methods to support other type of iterations.

Thus I suggest a different idea; list_topic_subscriptions() will return an iterable object -- which iterates over resources by default -- but have a method, pages(), for iterating over pages, like as follows:

>>> all_subs = list(api.list_topic_subscriptions(topic))
>>> all_subs
[‘sub_1’, ‘sub_2’, ‘sub_3’, ‘sub_2’]
>>> all_sub_pages = list(api.list_topic_subscriptions(topic).pages())
>>> all_sub_pages
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

@tbetbetbe
Copy link
Contributor Author

tbetbetbe commented Apr 23, 2016

@jgeewax can you comment on a vs b.

I was got the impression that specify page_size at the surface should guarantee that all pages of response data seen by the user (except for the maybe the last) will see the same size, even if the underlying api responses don't guarantee that.

@anthmgoogle
Copy link

For reference, gcloud-java does something very much like what jmuk@ described. If that is also idiomatic to Python it sounds like a good approach.

I was expecting this would present a nicer surface over the paging of the underlying API, not present a new paging view separate and on top of that. I think this may be important for cases of actual UI paging or map-reduce to optimize round-trips. It also seems more in keeping with the principle of making the calls nicer while keeping the relationship between client and server calls simple and clear.

@jgeewax to confirm if this has been the case for paging implementations in the other gcloud languages.

@tbetbetbe
Copy link
Contributor Author

For reference, gcloud-java does something very much like what jmuk@ described. If that is also idiomatic to Python it sounds like a good approach.

It's not. generators in python do not have extra embedded methods like this.

@geigerj
Copy link
Contributor

geigerj commented Apr 25, 2016

We could modify CallOptions to change the "is_page_streaming" flag to a "pagination" flag that takes three values: page streaming, paging, or None.

Alternatively, we could keep the CallOptions the same, and when is_page_streaming=False, then we can return a generator over pages, rather than the proto response. We could attach any other fields of the proto response (e.g., next_page_token) to the generator as attributes so that the generator contains the same information as the proto response. (EDIT: not sure that this is possible.)

@tbetbetbe
Copy link
Contributor Author

Please begin implementing the option you just described:

when is_page_streaming=False, then we can return a generator over pages

I think this issue has fulfilled its role in outlining alternatives.

This option is best as it is most consistent with the existing design; and it should require only documentation changes to the generated code. Let's continue the design discussion on the implementation PR. In particular the question as whether to return the server pages or a view on them can continue there. I will update the starting comment to note this decision

@tbetbetbe
Copy link
Contributor Author

PTAL @tseaver

@geigerj
Copy link
Contributor

geigerj commented Apr 26, 2016

Let's continue the design discussion on the implementation PR.

Please see #94.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants