Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summary does not contain the property values #171

Closed
tloubrieu-jpl opened this issue Jul 27, 2022 · 7 comments
Closed

summary does not contain the property values #171

tloubrieu-jpl opened this issue Jul 27, 2022 · 7 comments
Assignees
Labels
B13.0 bug Something isn't working s.high High severity

Comments

@tloubrieu-jpl
Copy link
Member

🐛 Describe the bug

The list of properties is missing in the summary.

📜 To Reproduce

curl --request GET 'http://localhost:8080/collections/urn:nasa:pds:insight_rad:data_derived::7.0/products?summary-only=true'

Result:
{
"summary": {
"q": "",
"hits": 7,
"took": 25,
"start": 0,
"limit": 100,
"sort": [],
"properties": []
}
}

🕵️ Expected behavior

The properties used by the products of this collection should be listed.

📚 Version of Software Used

all

🩺 Test Data / Additional context

🏞Screenshots

🖥 System Info

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

🦄 Related requirements

⚙️ Engineering Details

@tloubrieu-jpl
Copy link
Member Author

blocks NASA-PDS/search-api-notebook#6

@al-niessner
Copy link
Contributor

@tloubrieu-jpl @jordanpadams

Filling in the properties is a problem. In order to correctly fill them in, the code would have to request all of the data from opensearch then collect only those that exist. It will make summary only inefficient in the sense that more data has to be exchanged between the Java code and opensearch.

An alternative is to return what the user asked for. In the case given, it would be a correct list since no fields were asked for only the defaults would be returned.

However, consider the case where the user literally asks for ?fields=foobar. Which options is the desired:

  1. leave properties blank because verifying them is expensive [efficient and error free]
  2. fill properties with what would be requested but not verify they are what is actually found (no foobar in any document but can still request it) [efficient but erroneous]
  3. process every document despite none of them are to be returned to the user just to build a properties list [inefficient and error free]

@jordanpadams
Copy link
Member

@al-niessner
Copy link
Contributor

@jordanpadams @tloubrieu-jpl

Yes and no. I am tired of that answer as I am sure everyone else is.

As best I understand the documentation and limited testing. Yes, the mapping can be changed on the fly but nothing is re-indexed. You have to use the reindex API for that but it also comes with caveats. You cannot reindex into an existing index. Not really a problem if one had the foresight to know that things would change or used dynamic (first instance starts the indexing which is pretty clever). If one plans on index changes, then assign an alias and use it. For instance, registry would be the alias for r1 say. Then when we update the mapping we put it into r2 then reindex r1 into r2. When all that work is done, change the alias so that registry points at r2. I am sure that comes with caveats as well like no document (old db record) inserts during reindex or something along those lines.

If you like, I can look at adding an alias to the index and use it so that we can update and move around a smidge easier in the future but if you think this is the last change we will ever need to make to the index mapping then I can keep it as is.

@tloubrieu-jpl
Copy link
Member Author

Hi @al-niessner @jordanpadams ,

What we had before
in a previous version of the code the properties were filled from the results of the current page, by building a set of unique properties as we read the products from the results.

When summary-only=true, we read the results anyway and sent the unique properties of the page.

This was inaccurate and costly, in addition this will not work if we replace summary-only with limit=0.

What we need
The content of the properties should be:

  • if fields is not set in the request, all the properties available for the request, beyond the pagination limit.
  • if fields is set, the intersection of the fields value and the properties available for the request, beyond the pagination limit.

This is a requirement to be able to get the list of properties before returning the results since otherwise the user will have no clue how to apply request criteria on the registry.

Other thoughts
Could we have the list of properties as a keyword attributes when we run harvest ? Opensearch would then build an index of the existing properties.

Note that eventually we also want faceting which is the ability to see statistics on the values of the properties. One level higher.

@al-niessner
Copy link
Contributor

@tloubrieu-jpl @jordanpadams

I do not understand the requirements:

  • if fields is not set in the request, all the properties available for the request, beyond the pagination limit.

All the fields available to the request? That would mean walking every document to build that list since not all keywords are indexed and many documents have unique elements. If dynamic were true, we could just use the index. Since dynamic cannot be true, then must build it manually at ingestion. Separate index or stuff it in registry as yet another document with a special lidvid like 1.

Do they really want the default fields that the output would use anyway? Long list for json and short list for kvp+json?

What is the pagination limit for in the requirement? Is that when summary-only existed and find the unique fields in the first limit of matching documents?

  • if fields is set, the intersection of the fields value and the properties available for the request, beyond the pagination limit.

Why intersection? We return the union. If a user asks for field=foobar with json output then get all the default fields plus foobar not just foobar. Is the requirement wrong or our implementation wrong?


Independent of your answers to the requirements, it seems like you want efficient and accurate which means adding something new during ingestion. Is this correct?

@jordanpadams
Copy link
Member

fixed by #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B13.0 bug Something isn't working s.high High severity
Projects
None yet
Development

No branches or pull requests

3 participants