Support sized pages in Page Streaming #86

tbetbetbe · 2016-04-19T02:21:36Z

Early feedback suggests that in addition to simple iteration, a batch mode of operation should also be supported for page-streamed calls, where the user is able to iterate by batches of results. E.g,

E.g, currently for page streaming the following works

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> api = PublisherApi()
>>> topic = api.topic_path('google.com:pubsub-demo', 'my-topic')
>>> all_subs = list(api.list_topic_subscriptions(topic))
>>> all_subs
[‘sub_1’, ‘sub_2’, ‘sub_3’, ‘sub_2’]

There should be a way to support batching:

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> api = PublisherApi()
>>> topic = api.topic_path('google.com:pubsub-demo', 'my-topic')
>>> all_subs = list(??? with batchsize=2)
>>> all_subs
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

Proposals

Add a helper to google.gax to support batch iteration. Users will import it and use it when they need it.
Add a helper to google.gax as in 1. Also, add a field to CallOptions, batch_size which users set to batch in pages

Add a helper to google.gax to support batch iteration

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> from google.gax import batch_iter
>>> api = PublisherApi()
>>> topic, batch_size = api.topic_path('google.com:pubsub-demo', 'my-topic'), 2
>>> all_subs = list(batch_iter(api.list_topic_subscriptions(topic), batch_size))
>>> all_subs
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

Pros

the generated code stays simple, there are no changes required to the generator
the new feature batch_iter is not difficult, and is reminiscent of the helper methods in the itertools library

Cons

users have more of the gax-python surface to learn; the batch_iter func is a new feature to be learned.

Add a helper to google.gax, also add batch_size to CallOptions

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> from google.gax import batch_iter
>>> api = PublisherApi()
>>> topic, batch_size = api.topic_path('google.com:pubsub-demo', 'my-topic'), 2
>>> all_subs = list(api.list_topic_subscriptions(topic, 
...    options=CallOptions(batch_size=batch_size)))
>>> all_subs
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

Pros

keeps the change to the surface elements the user needs to know about to a minimum; i.e to use this feature, the user only needs to know about an addition field that we will document in CallOptions

Cons

the generated code for methods that support page-streaming is slightly more complex

@jmuk, @geigerj, @bjwatson, @anthmgoogle PTAL and discuss

Decision

the surface will look like this

>>> from google.pubsub.v1.publisher_api import PublisherApi
>>> api = PublisherApi()
>>> api.topic_path('google.com:pubsub-demo', 'my-topic')
>>> all_subs = list(api.list_topic_subscriptions(topic, 
...    options=CallOptions(is_page_streaming=False)))
>>> all_subs  # whatever the server page size is 
[(‘sub_1’, ‘sub_2’), (‘sub_3’), ('sub_4')]

There's an open question that is OK to resolve after implementation begins

geigerj · 2016-04-19T16:21:35Z

I'm confused by this request. The GAX implementation of page streaming already constructs the result lazily as a Python generator, so there's no performance benefit to this (except perhaps that we can use the batch size to set the page_size field) -- the difference seems to be only in the depth of nesting the result in the object we're returning.

What is the use case for this?

tbetbetbe · 2016-04-19T16:50:25Z

What's the use case for this ?

There is a request to allow the page streaming iterator yield batches of results instead of individual items. @anthmgoogle can provide more details.

geigerj · 2016-04-19T18:33:24Z

Is this feature

(a) an abstraction on top of paging, where each element of the response contains a fixed number of instances of an API resource, and where each element may not correspond to a single request?

(b) a way to hide paging concepts like page_token, next_page_token, and page_size, while still exposing the contents of the page itself?

The examples you gave seem more like (a), but I don't see the benefit over the current version of page streaming that we have. (b) seems like a nice way not to hide the API requests under a Python generator like we currently do. It also doesn't require us to define additional concepts like batch_size, since we're just exposing what API already makes available.

In (b), I mean something like:

>>> api = PublisherApi()
>>> topic, batch_size = api.topic_path('google.com:pubsub-demo', 'my-topic'), 2
>>> pages = list(api.list_topic_subscriptions(topic, page_size=4)):
>>> pages
[(‘sub_1’, ‘sub_2’, 'sub_3'), (‘sub_4’, ‘sub_5', 'sub_6', 'sub_7'), ('sub_8', )]

where each tuple corresponds to exactly one API response, and contains a maximum of, but not necessarily exactly, four elements, depending on what the response was.

jmuk · 2016-04-19T18:34:11Z

Correct me if I'm wrong, but the request here is to allow accessing the page data underlying in the data stream, so that users can see the page token (or remember it for later uses), for example.

That means, grouping items by a number is not sufficient, and that could be wrong if the server may not fill the data elements of the requested page size.

anthmgoogle · 2016-04-19T18:37:45Z

It's definitely (b), a way of making going page by page also have a nice idiomatic surface rather than mucking around with tokens.

The terms "Pagination" and "Batched" are both problematic descriptions of this. We already rejected "Pagination" because of the disconnect with the english meaning. "Batching" is a very different feature in gRPC. I think Paging or Paged are clearer.

geigerj · 2016-04-19T18:40:56Z

@jmuk Users can just save a Python generator that maintains the page token in its state; they shouldn't need to use the token itself. If they do need to see the token, you can always turn off the feature as api.list_topic_subscriptions(topic, options=CallOption(is_page_streaming=False)).

tbetbetbe · 2016-04-19T18:50:06Z

@anthmgoogle: I just updated the issue name, please adjust if necessary

anthmgoogle · 2016-04-19T18:51:59Z

Agree this is a good name for the feature.

jmuk · 2016-04-19T19:27:39Z

If it's (b), I still think that specifying the size of the batch (or page or whatever) isn't sufficient, because anyways it doesn't respect to the actual page data behind the scene. It's rather a boolean flag of whether iterating over resources or over pages.

jmuk · 2016-04-19T19:33:55Z

Also -- I think this is similar to files (or string streams) in a sense; someone wants to look through individual characters, and someone else wants to iterate over lines. In Python, files are iterable but have some utility methods to support other type of iterations.

Thus I suggest a different idea; list_topic_subscriptions() will return an iterable object -- which iterates over resources by default -- but have a method, pages(), for iterating over pages, like as follows:

>>> all_subs = list(api.list_topic_subscriptions(topic))
>>> all_subs
[‘sub_1’, ‘sub_2’, ‘sub_3’, ‘sub_2’]
>>> all_sub_pages = list(api.list_topic_subscriptions(topic).pages())
>>> all_sub_pages
[(‘sub_1’, ‘sub_2’), (‘sub_3’, ‘sub_2’)]

tbetbetbe · 2016-04-23T02:28:47Z

@jgeewax can you comment on a vs b.

I was got the impression that specify page_size at the surface should guarantee that all pages of response data seen by the user (except for the maybe the last) will see the same size, even if the underlying api responses don't guarantee that.

anthmgoogle · 2016-04-25T15:04:24Z

For reference, gcloud-java does something very much like what jmuk@ described. If that is also idiomatic to Python it sounds like a good approach.

I was expecting this would present a nicer surface over the paging of the underlying API, not present a new paging view separate and on top of that. I think this may be important for cases of actual UI paging or map-reduce to optimize round-trips. It also seems more in keeping with the principle of making the calls nicer while keeping the relationship between client and server calls simple and clear.

@jgeewax to confirm if this has been the case for paging implementations in the other gcloud languages.

tbetbetbe · 2016-04-25T15:30:45Z

For reference, gcloud-java does something very much like what jmuk@ described. If that is also idiomatic to Python it sounds like a good approach.

It's not. generators in python do not have extra embedded methods like this.

geigerj · 2016-04-25T15:38:11Z

We could modify CallOptions to change the "is_page_streaming" flag to a "pagination" flag that takes three values: page streaming, paging, or None.

Alternatively, we could keep the CallOptions the same, and when is_page_streaming=False, then we can return a generator over pages, rather than the proto response. ~~We could attach any other fields of the proto response (e.g., next_page_token) to the generator as attributes so that the generator contains the same information as the proto response.~~ (EDIT: not sure that this is possible.)

tbetbetbe · 2016-04-25T16:08:57Z

Please begin implementing the option you just described:

when is_page_streaming=False, then we can return a generator over pages

I think this issue has fulfilled its role in outlining alternatives.

This option is best as it is most consistent with the existing design; and it should require only documentation changes to the generated code. Let's continue the design discussion on the implementation PR. In particular the question as whether to return the server pages or a view on them can continue there. I will update the starting comment to note this decision

tbetbetbe · 2016-04-25T21:21:30Z

PTAL @tseaver

geigerj · 2016-04-26T21:04:00Z

Let's continue the design discussion on the implementation PR.

Please see #94.

tbetbetbe added the enhancement label Apr 19, 2016

tbetbetbe changed the title ~~Support Batched Pagination~~ Support sized pages in Page Streaming Apr 19, 2016

tbetbetbe assigned geigerj Apr 25, 2016

geigerj mentioned this issue Apr 26, 2016

Change page-streaming behavior #94

Merged

jmuk mentioned this issue Apr 26, 2016

Support of accessing to pages in page-streaming googleapis/gax-ruby#2

Closed

tseaver mentioned this issue Apr 28, 2016

Implement a GAX version of '_PublisherAPI'. googleapis/google-cloud-python#1764

Merged

geigerj closed this as completed in #94 Apr 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sized pages in Page Streaming #86

Support sized pages in Page Streaming #86

tbetbetbe commented Apr 19, 2016 •

edited

Loading

geigerj commented Apr 19, 2016 •

edited

Loading

tbetbetbe commented Apr 19, 2016

geigerj commented Apr 19, 2016

jmuk commented Apr 19, 2016

anthmgoogle commented Apr 19, 2016

geigerj commented Apr 19, 2016 •

edited

Loading

tbetbetbe commented Apr 19, 2016

anthmgoogle commented Apr 19, 2016

jmuk commented Apr 19, 2016

jmuk commented Apr 19, 2016

tbetbetbe commented Apr 23, 2016 •

edited

Loading

anthmgoogle commented Apr 25, 2016

tbetbetbe commented Apr 25, 2016

geigerj commented Apr 25, 2016 •

edited

Loading

tbetbetbe commented Apr 25, 2016

tbetbetbe commented Apr 25, 2016

geigerj commented Apr 26, 2016 •

edited

Loading

Support sized pages in Page Streaming #86

Support sized pages in Page Streaming #86

Comments

tbetbetbe commented Apr 19, 2016 • edited Loading

Proposals

Add a helper to google.gax to support batch iteration

Pros

Cons

Add a helper to google.gax, also add batch_size to CallOptions

Pros

Cons

Decision

geigerj commented Apr 19, 2016 • edited Loading

tbetbetbe commented Apr 19, 2016

geigerj commented Apr 19, 2016

jmuk commented Apr 19, 2016

anthmgoogle commented Apr 19, 2016

geigerj commented Apr 19, 2016 • edited Loading

tbetbetbe commented Apr 19, 2016

anthmgoogle commented Apr 19, 2016

jmuk commented Apr 19, 2016

jmuk commented Apr 19, 2016

tbetbetbe commented Apr 23, 2016 • edited Loading

anthmgoogle commented Apr 25, 2016

tbetbetbe commented Apr 25, 2016

geigerj commented Apr 25, 2016 • edited Loading

tbetbetbe commented Apr 25, 2016

tbetbetbe commented Apr 25, 2016

geigerj commented Apr 26, 2016 • edited Loading

tbetbetbe commented Apr 19, 2016 •

edited

Loading

geigerj commented Apr 19, 2016 •

edited

Loading

geigerj commented Apr 19, 2016 •

edited

Loading

tbetbetbe commented Apr 23, 2016 •

edited

Loading

geigerj commented Apr 25, 2016 •

edited

Loading

geigerj commented Apr 26, 2016 •

edited

Loading