Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As an API user, I want to know in the response how many hits are returned for an API query. #68

Closed
jordanpadams opened this issue Apr 1, 2021 · 8 comments
Assignees
Labels
B12.0 Epic p.must-have requirement the current issue is a requirement
Milestone

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Apr 1, 2021

Motivation

...so that I can use that information to efficiently paginate or understand at a glance what the query returns

Additional Details

Similar to other search responses, we want to include a hits field with the number of matching products.

Acceptance Criteria

General

Given a deployed pds-registry-app with X number of products ingested
When I perform a query against any of the endpoints
Then I expect a hits field in the response with the value equal to the number of matching products returned by the registry

Specific

Given a deployed pds-registry-app with X number of products ingested
When I perform a query against the /products endpoint for all products
Then I expect a hits field in the response with value equal to X

Engineering Details

Similar to ESDIS CMR response, something like:

{
  "hits" : 2,
  "took" : 11,
  "items" : [ {
...

may vary slightly depending on the response format (e.g. JSON vs. pds4+json)

@jordanpadams jordanpadams added requirement the current issue is a requirement needs:triage labels Apr 1, 2021
@jordanpadams jordanpadams self-assigned this Apr 1, 2021
@jordanpadams jordanpadams changed the title As an API user, I want to know how many hits are returned for a query As an API user, I want to know in the response how many hits are returned for a query Apr 1, 2021
@jordanpadams jordanpadams changed the title As an API user, I want to know in the response how many hits are returned for a query As an API user, I want to know in the response how many hits are returned for an API query. Apr 1, 2021
@jordanpadams jordanpadams added this to the 06.Mary.Decker.Slaney milestone Apr 1, 2021
@jordanpadams jordanpadams removed their assignment May 19, 2021
@jordanpadams jordanpadams removed this from the 09.Valerie.Brisco milestone Jun 3, 2021
@jordanpadams jordanpadams added this to the 12.Roger.Bannister milestone Aug 11, 2021
@jordanpadams jordanpadams assigned tdddblog and unassigned tdddblog Aug 11, 2021
@jordanpadams
Copy link
Member Author

@tdddblog FYI the hits field you were talking about yesterday

@al-niessner
Copy link
Contributor

@jordanpadams @tloubrieu-jpl

Here is a problem. Some requests like /bundle/lidvid/products and /product/lidvid/bundles require double de-referencing to obtain the result. In the case of /product/lidvid/bundles it probably does not matter but for /bundle/lidvid/products the double de-referencing is going to make calculating hits expensive. Let me explain:

The first look up is the bundle from the lidvid. This gives a list of collection lids from ref_collection_lid in the bundle. These are then converted to lidvids and looked up in the registry_ref_index to get a list of results with each result containing the product_lidvids list of the products that collection contains. Therefore, to compute hits, one would have to sum up all the lengths of product_lidvids which requires traversing the entire list of results. Since each collection can have a different number of products, there are no shortcuts to walking the entire list. In turn, walking the entire list will undo the efficiency gains of registry-api-service#13.

Now, let me turn that explanation into a concrete example using pds-gamma and our notebook pds-api-client-ovirs-part1-explore-a-collection, the bundle urn:nasa:pds:orex.ovirs has some set of collection of which urn:nasa:pds:orex.ovirs:data_calibrated is one and contains a total of 279500 products. Interestingly enough, these are not returned in a single query. Instead, the query for the collection lidvid in ref_registry_index turns up a list of 559 results with each product_lidvids list being 500 lidvids in length. Remember, 500 each is not required so must walk all 559 results to compute total sum. But it gets worse. The user has to paginate to get all 279500 products so the loop is optimized to only load what it needs and stop when it has its items. For instance, if a user requests products [650, 700) then the first result product_lidvids is basically skipped and move on to the second where the 50 products are looked up and returned. Having hits means that would have to continue the iteration until the end to get full length. The same would then have to be done for each collection in a bundle. Hence, the functions will become constant time at the maximum breaking the desired requirement in registry-api-service#13.

This is a battle of conflicting requirements and will probably take some time to reach a consensus on which requirement wins. In the interim, I will return -1 hits for those places where extra time is required to compute the hits -- aka, cannot be extracted directly from the elasticsearch query.

@jordanpadams
Copy link
Member Author

jordanpadams commented Aug 13, 2021

@al-niessner copy. I think I made some sense of that. would the following help this?

@tdddblog when we ingest data into the registry via harvest, how easy/hard would it be for us to add a product count to the ref_registry_index documents?

@al-niessner
Copy link
Contributor

@jordanpadams @tloubrieu-jpl

Adding product_count to ref_registry_index does not help the problem as stated. You still have nested loops to do the computation that re not needed to process the request from start to start+limit.

@jordanpadams
Copy link
Member Author

@al-niessner copy that. so I guess I am reading the above as the problem statement, but how would you propose we change/update the registry indexes in ES to fix it? or are you saying it is un-fixable?

@tdddblog
Copy link

tdddblog commented Aug 16, 2021

@jordanpadams @al-niessner
Right now there is only "batch_id" (1, ... N) and "batch_size" (actual number of docs in a batch). There are separate counts for primary and secondary refs. I can add more fields / counts to refs index.

The new fields can look like this:

  • batch_id - batch (aka page) counter starting from 0 (I would rather start it from 0 than 1). Also can call it "batch_index"
  • max_batch_size - max number of docs per batch. I think it is 200 now.
  • batch_size - actual number of docs in a batch
  • first_doc_index - index of the first document in a batch
  • last_doc_index - index of the last document in a batch
  • total_doc_count - total product count of a given type, primary or secondary
  • reference_type - "primary" / "secondary" (this field is already there)

Examples
(1) Collection with 300 primary product refs, 2 ES docs in registry-refs index:

_id reference_type batch_id max_batch_size batch_size first_doc_index last_doc_index total_doc_count
urn:nasa:pds:test::1.0::P1 primary 0 200 200 0 199 300
urn:nasa:pds:test::1.0::P2 primary 1 200 100 200 299 300

(2) Collection with 100 primary and 300 secondary product refs, 3 ES docs in registry-refs index:

_id reference_type batch_id max_batch_size batch_size first_doc_index last_doc_index total_doc_count
urn:nasa:pds:test::1.0::P1 primary 0 200 100 0 99 100
urn:nasa:pds:test::1.0::S1 secondary 0 200 200 0 199 300
urn:nasa:pds:test::1.0::S2 secondary 1 200 100 200 299 300

I can also add total primary and secondary product refs counts to collection document in registry index.
Will that help?

@al-niessner
Copy link
Contributor

@jordanpadams

I would recommend one of two choices:

  1. Make hits as an endpoint not a parameter that is returned from a request because it requires different work than fulfilling the request in most cases.
  2. Make hits part of summary and nothing else. Summary returns no fieldsd so can use that to have minimal cost at running over all nested loops at lowest cost.

@jordanpadams
Copy link
Member Author

closed per NASA-PDS/registry-api-service#61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B12.0 Epic p.must-have requirement the current issue is a requirement
Projects
None yet
Development

No branches or pull requests

4 participants