Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user, by default, I want to search only for the latest versions of all products on the /products/{identifier}/member-of/member-of endpoint #487

Closed
jordanpadams opened this issue Jul 11, 2024 · 10 comments Β· Fixed by NASA-PDS/registry#299
Assignees
Labels
B15.0 p.should-have requirement the current issue is a requirement

Comments

@jordanpadams
Copy link
Member

Checked for duplicates

No - I haven't checked

πŸ§‘β€πŸ”¬ User Persona(s)

No response

πŸ’ͺ Motivation

...so that I can [why do you want to do this?]

πŸ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

βš™οΈ Engineering Details

No response

πŸŽ‰ I&T

No response

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Jul 18, 2024

@jordanpadams could you clarify the intended behaviour when each of the following is provided as identifier?

  • a LID
    • the {latest | metadata-specified} version of the bundle referenced by {the latest | any} version of the LID?
  • a LIDVID, which is the latest extant version
    • the {latest | metadata-specified} version(s) of the bundles referenced by the LIDVID?
  • a LIDVID, which is not the latest extant version
    • the {latest | metadata-specified} version(s) of the bundles referenced by {the LIDVID | the latest version of the LIDVID}?

Could you please also provide equivalent clarification for #486 #485 #484?

@tloubrieu-jpl
Copy link
Member

Maybe to keep things simple (until a user asks otherwise), we can handle the all-versions/only-latest query parameter and apply it to all the level of the hierarchy where it can apply. For example:

  • latest and lid : we only get the latest bundle, the latest collections and the latest observational products
  • all and lid: we get all bundles, all collections and all observationnal products
  • latest and lidvid: we get the latest lidvid for the bundle, and the latest for all members
  • all and lidvid: the exact lidvid and all the versions of all the members

In this case, the all/latest should apply to the identifier as it does in the /products/{identifier} end-point. To be consistent from a user point of view and make the readability of the behavior better:

  • lid+latest --> latest lidvid
  • lid+all --> all lidvid matching
  • lidvid+latest --> latest of the related lid (@alexdunnjpl I think commit 1155708 broke this behavior, I am not too sure how matchField can work on PdsProductIdentifier, I missed it before we merged)
  • lidivid+all --> all has not effect.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Jul 19, 2024

@tloubrieu-jpl @jordanpadams the issue here is that there isn't really a "simplest" from either an implementation standpoint or a user-intuition standpoint - there are drawbacks no matter what and an arbitrary decision is to be avoided.

If we can't state authoritatively how an endpoint should behave and why, users don't have any hope of understanding.

If we want "latest" (applied either via query parameter or lack thereof by default) to mean both

  • resolve input LID/LIDVID to the latest LIDVID, and
  • resolve result products to the latest LIDVID

that's fine, but the two behaviours aren't intuitively linked and may be desired independently, so they shouldn't be the same query parameter.

Then there's the question of support for keywords and querystrings - if you say "latest-mode" resolves outputs to the latest version of any matching product, a user can and will attempt to filter on a condition which is true for an earlier version of a product, then be confused when their results don't satisfy the condition(s) they provided. Worse yet, the current implementation of latest (subset results to products which aren't superseded) will actively hide the presence of matches if they aren't the latest version of a product. And you'll need to actually resolve every product to its latest version, which is very expensive (one query per LID in the result set).

So now your default either changes in the presence of keyword/querystring queries, or those queries are only compatible if all-version mode is specified.

Then, there's the fact that if the latest version of a product is orphaned (present in OpenSearch while its parents are not, which is currently the case for 2.5k products in en-prod), it will break, for example, the ancestry-based endpoints for its entire LID family unless all-versions mode is active, and will do so silently by just yielding zero hits.

I can make suggestions and attempt to justify them, but only if we're all on the same page about this being a significant problem to begin with.

The specification above doesn't really answer the question unless I'm misunderstanding something simple (it's unclear whether a given comment about LID/LIDVID refers to identifier or to the output result set, and I'm not sure what the first four bullets is supposed to describe compared to the last four bullets)

@jordanpadams
Copy link
Member Author

@alexdunnjpl @tloubrieu-jpl I guess to take a step back for a second, and more generally, the purpose of all these new requirements is really just to accomplish one thing: ALL endpoints should return the latest versions of all products, by default. 99.9% of searches in the PDS will only care about the latest versions. As far as the end user is concerned, our database only contains latest versions of products, unless explicitly requested otherwise. In PDS terms, a new version of a product means "this product SUPERSEDES all past versions", in other words "all past versions are trash, use this one if you are performing science today".

The only reason we are giving access to past versions of products is for historical provenance purposes. Here are some examples of the 0.1% of use cases where provenance is needed:

  • all versions given a LID - a user has a LID from a paper, goes to /product/{identifier}, gets the latest product, sees something is wrong (hmmm this creation date is last week but this paper was written last year), and asks for all versions.
  • all versions given a LIDVID - as an archivist 25 years from now, I have a LIDVID of a data set and go to /product/{identifier}?all-versions=true because I want to see the provenance of this product.
  • latest version given a LIDVID - a user has a LIDVID for a specific version of a product, and wants to know if a newer version exists. we could just document how to do this with a LID instead of introducing more complication to the API

These are all product-specific use cases. In general, I would almost prefer we keep this as simple as possible and only support all-versions and/or latest-only from the /products endpoint. the members and members-of do not need to support this. If someone wants to know other versions of a product, they can perform a second step in their code to go looking for it.

present in OpenSearch while its parents are not, which is currently the case for 2.5k products in en-prod

this is an outlier. we would definitely like to know about orphaned products. something we should provide monitoring for in kibana at some point. that being said, en-prod is a unique case, and we have not sufficiently trained our Operations Team to do this correctly. we can create a ticket to figure out how to track this corner case in the API, but I don't think we should design to this.

@jordanpadams jordanpadams changed the title As a user, by default, I want to search for the latest versions of all products on the /products/{identifier}/member-of/member-of endpoint unless explicitly requested As a user, by default, I want to search only for the latest versions of all products on the /products/{identifier}/member-of/member-of endpoint Jul 19, 2024
@jordanpadams
Copy link
Member Author

jordanpadams commented Jul 19, 2024

@alexdunnjpl @tloubrieu-jpl I updated the requirement to remove the opportunity for all-versions at this time. we can come back and revisit at a later date if it is determined someone needs this. As I noted above, this use case is so rare that if a user really cares about versions of a specific product, the can use the /products/{identifier} endpoint.

@jordanpadams
Copy link
Member Author

also apologies for not filling out this ticket in its entirety yet. I only got through a few of them to include acceptance criteria

@jordanpadams
Copy link
Member Author

@alexdunnjpl @tloubrieu-jpl created #503 and #504 to clarify or muddy the waters a bit. I will fill them out with more details sometime this weekend.

For the MVP we need to get the API online ASAP, the most important tickets we implement are:

If we can support returning on the latest products across the other endpoints, then great. It seems like the implementation would be similar across endpoints (not exists(superseded_by)), but I'm sure I'm missing something.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Jul 19, 2024

LID identifier resolves to latest
LIDVID identifier resolves to specified version

products/{identifier}/versions is the only way to discover superseded versions

Superseded products are omitted from API responses by default (except /product/{lidvid) and /product/{identifier}/versions, obviously)
products/{lidvid}/*/?include-superseded=true is the only way to view superseded results in result-sets

There are no subroutes for "all versions" functionality - if that data is ever desired, users must hit products/{identifier}/versions/ to discover extant versions, then loop through those versions, making a request (ex. to ../members) for each version.

@jordanpadams
Copy link
Member Author

per some offline discussions, I think we have settled on

For all versions:

/products/{identifier}/versions

For past versions, legacy can mean a lot of things to a lot of people, maybe superseded-data=true? Per this functionality, let's table that as a separate requirement. I'm not sure how prevalent that use case will be, so I don't want us to waste too much time implementing / testing it right now. Something we can bring to the WG when it bubbles up.

@alexdunnjpl
Copy link
Contributor

Updated this comment to reflect latest understanding of requirements.

@jordanpadams @tloubrieu-jpl please check and correct any errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B15.0 p.should-have requirement the current issue is a requirement
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

3 participants