Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve inconsistency between dplyr and galah implementations of collapse() and compute() #217

Closed
mjwestgate opened this issue Dec 4, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@mjwestgate
Copy link
Collaborator

As of version 2.0, you can use collapse() to convert a data_request into a query_set. A query_set often contains >1 query, with the final query being the one requested by the user, and earlier querys being needed to check that it is valid. Finally, calling compute() evaluates the query_set into a single, valid URL:

x <- galah_call() |>
    filter(year == 2010) |>
    count() |>
    collapse()
x
Object of class `query_set` containing 3 queries:
• metadata/fields data: galah:::check_internal_cache()$fields
• metadata/assertions data: galah:::check_internal_cache()$assertions
• data/occurrences-count url: https://biocache-ws.ala.org.au/ws/occurrences/sea...

compute(x)$url
[1] "https://biocache-ws.ala.org.au/ws/occurrences/search?fq=%28year%3A%222010%22%29&disableAllQualityFilters=true&pageSize=0"

For occurrence queries, however, compute() is actually doing three things:

  • evaluating (some) API calls to check the request is valid
  • building a valid url
  • sending that API call to the selected atlas

This is not ideal, because it means that for occurrence queries, it is impossible to interrogate the URL before sending, or to see it once it has been sent. Fundamentally, this problem occurs because of a conflict between the principle that collapse() shouldn't ping an API, and the need in galah to run checks before building a URL. I see two possible solutions:

  • remove the limitation on collapse() to not hit any APIs; then collapse() would always return a valid query (or error if needed), but would also run checks
  • add a function between collapse() and compute() (perhaps precompute()?) to build a single URL, but not push the request to the selected atlas

My instinct is that the former is neater, though it does change some behaviour from version 2.0. For context, this is what dplyr says about these functions:

compute() stores results in a remote temporary table. collect() retrieves data 
into a local tibble. collapse() is slightly different: it doesn't force computation, 
but instead forces generation of the SQL query. This is sometimes needed to 
work around bugs in dplyr's SQL generation.
@mjwestgate mjwestgate added the enhancement New feature or request label Dec 4, 2023
@mjwestgate
Copy link
Collaborator Author

Decision on 17-01-2024:

  • collapse() will now return a query object, which by default has slots type and url
    • i.e. previous behaviour of compute()
  • collapse() also gains a new .expand (?) argument
    • when TRUE, this appends the query_set to an additional slot (possible called calls?) to support debugging
  • compute() has reduced functionality under this model
    • mainly acts as a pass-trough function between collapse() and compute()
    • compute() is still critical for type = "occurrences", where it submits the query (but doesn't wait for a response)

mjwestgate added a commit that referenced this issue Jan 23, 2024
- move evaluation of lazy data and metadata APIs to `parse_` prefix
- add new class `evaluated_query()` to distinguish between outputs from `collapse()` and `compute()`
@mjwestgate
Copy link
Collaborator Author

Implemented in v 2.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant