pagination: scans require an extra request to fetch an empty page of items #20

davepacheco · 2020-08-06T15:43:43Z

Currently, the next_page token on ResultsPage is populated whenever the page that we're returning to the client has any items in it. This works, but it means that for any scan through a collection, the client needs to make an extra request at the end to fetch an empty page in order to see that it's the end of the scan. Ideally, the last page with any results on it would indicate that there were no more results and we could skip this extra request. The problem is that right now, Dropshot doesn't know this information. One simple interface would be for the consumer to provide the limit that they used, and Dropshot could say that if they provided fewer items than the limit, then it's the end of the collection. (This would still require a theoretically request that's theoretically unnecessary if the scan ends with a full page of items, but this seems a lot less likely.) We could also ask the consumer to provide an explicit boolean indicating whether there are more items.

Another consideration: I've worked with systems that would fetched N records from the database, fail to process 1-2 of them, and send the results back to the client. If a consumer worked this way, using the interface I described above, we would erroneously conclude that because the page was non-full, we've finished the scan -- having omitted an arbitrarily large number of results after that. Obviously that's just a bug in the consumer, but it'd be nice to make that impossible if we can.

The text was updated successfully, but these errors were encountered:

david-crespo · 2021-06-30T03:36:58Z

I vote for letting the consumer explicitly indicate the end of results.

ahl · 2021-07-09T06:25:22Z

@davepacheco are you imagining that ResultsPage::new would take a boolean and then we'd do:

    let next_page = if has_more {
        items
            .last()
            .map(|last_item| {
                let selector = get_page_selector(last_item, scan_params);
                serialize_page_token(selector)
            })
            .transpose()?
    } else {
        None
    };

Alternatively it seems like ResultsPage::new could do either more or less for the consumer.

Less: the signature could be like this:

    pub fn new<F, ScanParams, PageSelector>(
        items: Vec<ItemType>,
        scan_params: &ScanParams,
        next_page: Option<PageSelector>,
    ) -> Result<ResultsPage<ItemType>, HttpError>
        PageSelector: Serialize,

i.e. the consumer could be required to do all the work of figuring out the next page

Or more: the signature could be like this:

    pub fn new<F, ScanParams, PageSelector>(
        items: dyn Interator<Item=ItemType>,
        pag_params: &PaginationParams<ScanParams, PageSelector>,
        scan_params: &ScanParams,
        get_page_selector: F,
    ) -> Result<ResultsPage<ItemType>, HttpError>
    where
        F: Fn(&ItemType, &ScanParams) -> PageSelector,
        PageSelector: Serialize,

In this case the consumer would look like this:

    let items = iter.map(|p| (*p).clone());

    Ok(HttpResponseOk(ResultsPage::new(items, &pag_params, &scan_params, page_selector)?))

and new() would look like this:

    let some_items = items.limit(limit).collect(); // <- I guess we'd need the request context to derive the limit
    let next_page = items.next().map(|_| some_items.last().map(...));

i.e. have new() take the Iterator<Item = ItemType>, have it peek off the first limit items and then it would be able to use next() to see if there were any left.

We could potentially use a std::iter::Peekable if we wanted to allow clients to cache and reuse Interators across successive invocations.

davepacheco · 2021-07-23T18:48:48Z

@davepacheco are you imagining that ResultsPage::new would take a boolean and then we'd do:

    let next_page = if has_more {
        items
            .last()
            .map(|last_item| {
                let selector = get_page_selector(last_item, scan_params);
                serialize_page_token(selector)
            })
            .transpose()?
    } else {
        Non
    };

Yes, that is what I was envisioning, but I don't love it because of how easy it is to get wrong.

Alternatively it seems like ResultsPage::new could do either more or less for the consumer.

Less: the signature could be like this:
    pub fn new<F, ScanParams, PageSelector>(
        items: Vec<ItemType>,
        scan_params: &ScanParams,
        next_page: Option<PageSelector>,
    ) -> Result<ResultsPage<ItemType>, HttpError>
        PageSelector: Serialize,
i.e. the consumer could be required to do all the work of figuring out the next page

Specifically, they're doing the part that today looks at the last item and constructs a page selector for it using the given scan params? I was a little worried it would be easy to do the wrong thing here (e.g., make a page selector for the first item, or to just look at the interface and not understand which item you were supposed to use).

Or more: the signature could be like this:

    pub fn new<F, ScanParams, PageSelector>(
        items: dyn Interator<Item=ItemType>,
        pag_params: &PaginationParams<ScanParams, PageSelector>,
        scan_params: &ScanParams,
        get_page_selector: F,
    ) -> Result<ResultsPage<ItemType>, HttpError>
    where
        F: Fn(&ItemType, &ScanParams) -> PageSelector,
        PageSelector: Serialize,

In this case the consumer would look like this:

    let items = iter.map(|p| (*p).clone());

    Ok(HttpResponseOk(ResultsPage::new(items, &pag_params, &scan_params, page_selector)?))

and new() would look like this:

    let some_items = items.limit(limit).collect(); // <- I guess we'd need the request context to derive the limit
    let next_page = items.next().map(|_| some_items.last().map(...));

i.e. have new() take the Iterator<Item = ItemType>, have it peek off the first limit items and then it would be able to use next() to see if there were any left.

We could potentially use a std::iter::Peekable if we wanted to allow clients to cache and reuse Interators across successive invocations.

Is this an Iterator over all the items in the scan, or just the one page of results that the client is getting back? With pagination, I usually think of the scan being backed by a traditional relational database, where you'll want to include the limit in the SQL query and fetch only one page of results. (This can affect the query plan and avoid having database do a bunch of extra throwaway work. I think it's also a safer programming pattern: if you don't do this, and somebody forgets to call limit(), you'll wind up doing a whole table scan in the context of the request.) So you won't have an iterator over all the items in the scan -- just one page. And you can't tell from how many items you have whether there are any more.

The simplest efficient way I can think for an RDBMS-based consumer to know whether there are more results is to apply a limit of $limit + 1 to the SQL query and see if it got back $limit + 1 results. We could provide this information to Dropshot either with a boolean (which I think is easy to get wrong) or using a Vec or Iterator -- but it feels kind of weird to explain how to use this thing. ("You should provide an iterator for the items in the scan starting at the page selector. It should contain at least $limit + 1 items if they exist, or else all of the remaining items in the scan. It's also fine if it contains more than $limit + 1 items. In all cases, at most $limit items will be returned.")

This is all easier if you're willing to accept incorrect behavior when the last page of results contains limit items. I said in my original comment that this was okay, but the more I think about it: it doesn't seem worth it to do a bunch of work, have a more complicated interface, and still get it wrong sometimes. I'm inclined to keep punting on this. Thoughts?

davepacheco · 2021-07-27T18:02:49Z

Another idea discussed in today's tactical huddle: instead of using a boolean to communicate to dropshot that there are no more results, it could be an enum of DontKnow, HaveMore, and DontHaveMore. This could be a new ResultsPage constructor that takes the new enum. This would make it a little harder to accidentally do the wrong thing I was worried about above because if anything, the default would be DontKnow (i.e., if you just use ResultsPage::new), but you can still opt into the slightly better behavior.

smklein · 2021-07-27T18:05:31Z

Another idea discussed in today's tactical huddle: instead of using a boolean to communicate to dropshot that there are no more results, it could be an enum of DontKnow, HaveMore, and DontHaveMore.

This is an extremely nitty nit, but IMO MightBeMore is slightly better than DontKnow, as it communicates to the client that "if you want to be exhaustive, you should ask" rather than implying a value which could be interpreted as an error condition.

davepacheco · 2021-07-27T18:19:13Z

Another idea discussed in today's tactical huddle: instead of using a boolean to communicate to dropshot that there are no more results, it could be an enum of DontKnow, HaveMore, and DontHaveMore.

This is an extremely nitty nit, but IMO MightBeMore is slightly better than DontKnow, as it communicates to the client that "if you want to be exhaustive, you should ask" rather than implying a value which could be interpreted as an error condition.

Sure. Maybe even a two-state NoMore and MightHaveMore. (These names aren't great, I admit.) To be clear, the HTTP client wouldn't see this enum. I'm suggesting that Dropshot would get this enum from the Dropshot consumer (e.g., Nexus) and Dropshot would provide the page token in the response in both the DontKnow and HaveMore cases, but not the DontHaveMore case. All this is really doing over the boolean option is some combination of: making the programmer look up what these values mean so we can warn them not to do the obvious, wrong thing here; and making it easier for them to not bother with any of this so long as they don't mind the problem described in this issue.

zephraph · 2022-06-03T16:47:47Z

This came back up in the 5/27 control plane demo. I'm a little bit wary of the NoMore/MightHaveMore/DontHaveMore api additions. It feels rather non-standard and pushes extra complexity down to api consumers which I'd love to avoid if at all possible.

Is there any updated thoughts on this?

david-crespo · 2022-06-03T16:50:38Z

On the ternary model, is it correct to say that if an endpoint implements the query in such a way that it checks if there are more, the middle "maybe" option would not have to be exposed to the external consumer of that endpoint? (because that endpoint always knows whether it has more or not.) And therefore the maybe option is more about preserving the option for the Dropshot consumer to avoid implementing that check if necessary.

davepacheco · 2022-06-10T16:44:55Z

@zephraph:

It feels rather non-standard and pushes extra complexity down to api consumers which I'd love to avoid if at all possible

Can you say more about this? You mean dropshot consumers are made more complicated by the proposed NoMore/MightHaveMore enum? Is there a standard way to express this?

davepacheco · 2022-06-10T16:46:35Z

@david-crespo:

On the ternary model, is it correct to say that if an endpoint implements the query in such a way that it checks if there are more, the middle "maybe" option would not have to be exposed to the external consumer of that endpoint? (because that endpoint always knows whether it has more or not.) And therefore the maybe option is more about preserving the option for the Dropshot consumer to avoid implementing that check if necessary.

To be clear, in no case is the enum ever exposed to the external (endpoint) consumer. I think I initially suggested three variants because of the three cases where an endpoint implementor (1) went out of their way to check and knows there are more results, (2) went out of their way to check and knows there aren't more results, and (3) didn't go out of their way and doesn't know. The behavior for (1) and (3) is the same, so it could just be DefinitelyNoMore/MaybeMoreWhoKnows.

davepacheco · 2022-06-10T16:50:07Z

I think it's worth considering what we would want to do in our implementations. I think it's most standard that when somebody asks for a page of 500 results, you fetch 500 items from the database, in which case your answer here will be DontKnow. There are also standard patterns (that I think don't scale) where when somebody asks for a page of 500 results, you fetch 500 items and count the number of items that match the query, but that doesn't tell you whether there's another page of results without more information. We could built a standard pattern internally (and at one point had one, I think) where when somebody asks for a page of 500 items, we ask the database for 501 items, check how many we got, and return the first 500. I haven't seen this before but it seems correct, scalable, and adds almost no runtime cost. I'm not sure how easy it is to implement in the current code -- depends how many layers need to know about this distinction between the requested limit and the actual limit.

david-crespo · 2022-06-10T16:52:50Z

I agree the fetch N+1 version is only practical middle ground. Knowing precisely how many things there are in total is rarely useful if N is greater than some small humane number like 20 or 30, and I expect most of our lists will be bigger than that.

This was referenced Aug 6, 2020

pagination: identify when consumers provide the wrong item count #21

Open

Pagination #14

Merged

david-crespo closed this as completed Jul 23, 2021

david-crespo reopened this Jul 23, 2021

david-crespo mentioned this issue Sep 8, 2022

Mechanism for fetching previous page #436

Open

david-crespo mentioned this issue Feb 28, 2023

Tracking: Pagination oxidecomputer/omicron#2449

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pagination: scans require an extra request to fetch an empty page of items #20

pagination: scans require an extra request to fetch an empty page of items #20

davepacheco commented Aug 6, 2020

david-crespo commented Jun 30, 2021

ahl commented Jul 9, 2021 •

edited

Loading

davepacheco commented Jul 23, 2021

davepacheco commented Jul 27, 2021

smklein commented Jul 27, 2021

davepacheco commented Jul 27, 2021

zephraph commented Jun 3, 2022

david-crespo commented Jun 3, 2022

davepacheco commented Jun 10, 2022

davepacheco commented Jun 10, 2022

davepacheco commented Jun 10, 2022

david-crespo commented Jun 10, 2022

pagination: scans require an extra request to fetch an empty page of items #20

pagination: scans require an extra request to fetch an empty page of items #20

Comments

davepacheco commented Aug 6, 2020

david-crespo commented Jun 30, 2021

ahl commented Jul 9, 2021 • edited Loading

davepacheco commented Jul 23, 2021

davepacheco commented Jul 27, 2021

smklein commented Jul 27, 2021

davepacheco commented Jul 27, 2021

zephraph commented Jun 3, 2022

david-crespo commented Jun 3, 2022

davepacheco commented Jun 10, 2022

davepacheco commented Jun 10, 2022

davepacheco commented Jun 10, 2022

david-crespo commented Jun 10, 2022

ahl commented Jul 9, 2021 •

edited

Loading