Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed changes to search results retrieval api #187

Closed
philvarner opened this issue May 27, 2022 · 4 comments
Closed

Proposed changes to search results retrieval api #187

philvarner opened this issue May 27, 2022 · 4 comments
Milestone

Comments

@philvarner
Copy link
Collaborator

philvarner commented May 27, 2022

Motivation:

The motivation behind these changes is to provide a better balance in the API between the desire to hide search pagination from a user and the real-world latency in executing these methods.

For example, a user who naively executes this this call:

pystac_client.Client.from_file("https://planetarycomputer.microsoft.com/api/stac/v1").search().get_all_items()

will be attempting to load 50 million Items into memory. Fortunately this will likely never happen, as they won't wait for all 50,000 sequential requests required to pull them page by page, 1000 at a time.

Proposed Changes:

  • set the default of max_items to the same value as limit, so that (ideally) only one page of results will be fetched, instead of an unlimited number. (limit is only as suggestion to the server, so it may take multiple requests to get max_items even if it's set to the same value as limit).
  • rename get_item_collections to item_collections and get_items to items
  • deprecate ItemSearch.get_all_items - this has the potential to consume a large amount of memory, and has no benefit over the iterable get_items method.
  • deprecate `ItemSearch.get_all_items_as_dict - the point of this was to return the dict structure instead of going though the (expensive) pydantic unmarshalling to get objects, but this suffers from the same problems as get_all_items of potentially pulling a large number of objects into memory
  • add ItemSearch.get_items_as_dicts() -> Iterable[Dict[str, Any]] - this will operate the same as get_items() but will return the (never unmarshalled) dict representation of the Item rather than unmarshalled pydantic Item objects
@matthewhanson
Copy link
Member

These changes all seem reasonable to me.

Changing max_items = limit is fine, but think we should make sure if using the CLI that it indicates that it only fetched X of numberMatched (if available)

Makes sense to not have any function that actually gets all items since a user can do the same thing with an iterator by casting to List.

@gadomski
Copy link
Member

I'm 👍🏽 for all proposed solutions (and Matt's amendments) except for that I'm not sure I agree with

an ItemCollection is a proper name and shouldn't be split

. I'm used to PascalCase mapping to pascal_case. To me, get_item_collections scans fine -- I would expect something like get_items_and_collections or get_items_or_collections if you were getting both somehow.

@philvarner
Copy link
Collaborator Author

In that example, the term "Pascal case" (two words, with a space) is being UpperCamelCase stylized to PascalCase or pascal_case stylized to pascal_case. I feel like there's a difference between words that are stylized like this in specs vs. multiple words and are stylized to distinguish them when concatenated in code.

How would you pascal_case the geojson geometries? MultiPolygon to multi_polygon or multipolyon, LineString to line_string or linestring, MultiLineString to multi_line_string or multilinestring?

@gadomski
Copy link
Member

Fair. I'll convert to :+0: on the name change, curious if anyone else has thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants