Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for star tree index feature #8598

Merged
merged 34 commits into from
Nov 1, 2024

Conversation

bharath-techie
Copy link
Contributor

@bharath-techie bharath-techie commented Oct 22, 2024

Description

This PR adds documentation for star tree index feature. OpenSearch RFC / Meta

Issues Resolved

Closes #8131

Version

List the OpenSearch version to which this PR applies, e.g. 2.14, 2.12--2.14, or all.

Frontend features

If you're submitting documentation for an OpenSearch Dashboards feature, add a video that shows how a user will interact with the UI step by step. A voiceover is optional.

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

@bharath-techie bharath-techie marked this pull request as draft October 22, 2024 17:11
@Naarcha-AWS Naarcha-AWS added v2.18.0 4 - Doc review PR: Doc review in progress labels Oct 22, 2024
@kolchfa-aws kolchfa-aws added the release-notes PR: Include this PR in the automated release notes label Oct 22, 2024
@bharath-techie bharath-techie force-pushed the startree branch 3 times, most recently from 1ac0829 to f3833a0 Compare October 23, 2024 09:10
@bharath-techie bharath-techie marked this pull request as ready for review October 23, 2024 09:10
Signed-off-by: Bharathwaj G <bharath78910@gmail.com>
Signed-off-by: Bharathwaj G <bharath78910@gmail.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naarcha-AWS Final comments/changes. I'd like to read lines 59-62 in star-tree-index.md before approving. Thanks!

_field-types/supported-field-types/star-tree.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
_search-plugins/star-tree-index.md Outdated Show resolved Hide resolved
Naarcha-AWS and others added 2 commits November 1, 2024 12:25
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Naarcha-AWS and others added 2 commits November 1, 2024 14:44
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naarcha-AWS LGTM!

@Naarcha-AWS Naarcha-AWS merged commit faa328a into opensearch-project:main Nov 1, 2024
5 checks passed

Define star-tree index mappings in the `composite` section in `mappings`.

The following example API request creates a corresponding star-tree index for all `request_aggs`. To compute metric aggregations for `request_size` and `latency` fields with queries on `port` and `status` fields, configure the following mappings:
Copy link
Contributor Author

@bharath-techie bharath-techie Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about following :

The following example API request creates a corresponding star-tree index configuration under request_aggs

"all request_aggs " for me sounds a bit confusing


| Parameter | Required/Optional | Description |
| :--- | :--- | :--- |
| `name` | Required | The name of the field. The field name should be present in the `properties` section as part of the index `mapping`. Ensure that the `doc_values` setting is `enabled` for any associated fields.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused on this - config itself doesn't have a name property. Can we remove this ?

Under config , user can specify ordered_dimensions, metrics, max_leaf_docs and skip_star_node_creation_for_dimensions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this. We have the definitions for max_leaf_docs and skip_star_node_creation_for_dimensions on line 193.


| Parameter | Description |
| :--- | :--- |
| `max_leaf_docs` | The maximum number of star-tree documents that a leaf node can point to. After the maximum number of documents is reached, the nodes will be split based on the value of the next dimension. Default is `10000`. A lower value will use more storage but result in faster query performance. Inversely, a higher value will use less storage but result in slower query performance. For more information, see [Star-tree indexing structure]({{site.url}}{{site.baseurl}}/search-plugins/star-tree-index/#star-tree-index-structure). |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nodes will be split based on the value of the next dimension.

How about once a node crosses threshold of max_leaf_docs , children nodes will be created based on the unique values or something similar.


<img src="{{site.url}}{{site.baseurl}}/images/star-tree-index.png" alt="A star-tree index containing two dimensions and two metrics" width="700">

Sorted and aggregated star-tree documents are backed by `doc_values` in an index. `doc_values` use the following pattern:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

" Values stored in doc_values use the following pattern "

doc_values itself is a singular column and generally end user might not understand it well either , so we need to enhance it a bit similar to above.


Sorted and aggregated star-tree documents are backed by `doc_values` in an index. `doc_values` use the following pattern:

- The values are sorted based on the order of their `ordered_dimension`. In the preceding image, the dimensions are determined by the `status` setting and then by the `port` for each status.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values across fields are primary sorted based on first field in ordered_dimension, secondary sorted by corresponding fields mentioned in the ordered_dimension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exact wording but we need to call out that the sort is based on the fields specified in the ordered dimension.


### Star nodes

Star nodes are children of non-leaf nodes that contain preaggregated records for data split after dimension removal, aggregating metrics for rows containing dimensions with identical values. These aggregated documents are then appended to the end of star-tree documents. If a document does contain a dimension with identical values, it traverses through the star node.
Copy link
Contributor Author

@bharath-techie bharath-techie Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inaccurate as star nodes can be leaf or non - leaf nodes. Lets reword as below:

"""
Star nodes are special nodes which has the aggregated data of all the other nodes in the same dimension.

  1. This helps when we need to query the aggregated value of a particular field without traversing through all the nodes of a particular field [ dimension ] in the star tree.
  2. This also helps in skipping the dimension which is not part of the query.
    """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original wording itself looks good to me

"""
There are special nodes called star nodes (*) which helps in skipping non-competitive nodes and also in fetching aggregated document wherever applicable during query time.
"""


Star nodes are children of non-leaf nodes that contain preaggregated records for data split after dimension removal, aggregating metrics for rows containing dimensions with identical values. These aggregated documents are then appended to the end of star-tree documents. If a document does contain a dimension with identical values, it traverses through the star node.

The star-tree index structure diagram contains the following three examples demonstrating how a document does or does not traverse star-tree nodes (indicated by the `*` symbol in the diagram) during a `Term` query, based on the average request size of the query and whether the document contains matching dimensions:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must be talking about query algorithm here

"""
The star-tree index structure diagram contains the following three examples demonstrating how query algorithm traverses the star tree to get the results
"""


The star-tree index structure diagram contains the following three examples demonstrating how a document does or does not traverse star-tree nodes (indicated by the `*` symbol in the diagram) during a `Term` query, based on the average request size of the query and whether the document contains matching dimensions:

- When the port equals `8443` and the status equals `200`. Because the status equals `200`, the query does not traverse through a star node, and the aggregated metric is stored at the end of a star-tree document.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not talk about how star node is traversed / not traversed. Lets keep the wording on how query algorithm traverses the star tree as a whole for a particular query. Image clearly explains the path which is taken by the query

Something similar to the original wording

 - Compute average request size aggregation with Terms query where port equals 8443 and status equals 200 - the query visits the actual nodes of 8443 and 200 values(Support for Terms query will be added in upcoming release, see https://github.com/opensearch-project/OpenSearch/issues/15257)
    Compute count of requests aggregation with Term query where status equals 200 (query traverses through * node of port dimension since port is not present as part of query)
    Compute average request size aggregation with Term query where port equals 5600 (query traverses through * node of status dimension since status is not present as part of query).
    The second and third examples uses star nodes.

- When the status equals `200`. The query traverses through a star node in the `port` dimension because `port` is not present as part of the query.
- When the port equals `5600`. The query traverses through a star node in the `status` dimension because `status` is not present as part of the query.

Support for the `Term` query will be added in a future version. For more information, see [GitHub issue #15257](https://github.com/opensearch-project/OpenSearch/issues/15257).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this should be Terms query. Term query is already supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can say individual term query is already supported and support for multiple terms query is to be supported to avoid confusion to the user

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just switch to terms here.


## Example mapping

In the following example, index mappings define the star-tree configuration. This star-tree index precomputes aggregations in the `log` index. The aggregations are calculated using the `size` and `latency` fields for all the combinations of values indexed in the `port` and `status` fields:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : logs index


### Aggregation example

The following example gets the sum of the `size` field for all error logs with `status=500`, using the [example mapping](#example-mapping):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reword to something similar :

The following example gets the sum of all the values in the size field - for all error logs with status=500, using the example mapping:


With the star-tree index, the result will be retrieved from a single aggregated document as it traverses to the `status=500` node, as opposed to scanning through all of the matching documents. This results in lower query latency.

## Using queries with a star-tree index
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be without star tree index ?


Star-tree indexes can be used to optimize queries and aggregations.

### Supported queries

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section gives the impression that these query shapes are supported independently.
Ideally this should be a sub-section within Supported Aggregations itself.

Basically, with the below supported aggregations, one can also add a term-query to a search request.

epugh pushed a commit to o19s/documentation-website that referenced this pull request Nov 23, 2024
* Adding documentation for star tree index feature

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* fixes and addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* fixing json

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* fixing json

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* addressing comments

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>

* Add edits for star tree field page

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Add index edit

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update improving-search-performance.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update star-tree-index.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update star-tree.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _field-types/supported-field-types/star-tree.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update star-tree-index.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: Bharathwaj G <bharath78910@gmail.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Editorial review PR: Editorial review in progress release-notes PR: Include this PR in the automated release notes v2.18.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Star tree index feature documentation
8 participants