Merge branch 'main' into issue-6278-add-new-param

opensearch-project · Feb 10, 2024 · 0867996 · 0867996
2 parents 1ab4eb9 + f482849
commit 0867996
Show file tree

Hide file tree

Showing 118 changed files with 5,745 additions and 580 deletions.
diff --git a/.github/vale/styles/OpenSearch/AdverbsOfTime.yml b/.github/vale/styles/OpenSearch/AdverbsOfTime.yml
diff --git a/.github/vale/styles/OpenSearch/SubstitutionsError.yml b/.github/vale/styles/OpenSearch/SubstitutionsError.yml
@@ -23,7 +23,8 @@ swap:
   'Huggingface': Hugging Face
   'indices': indexes
   'ingestion pipeline': ingest pipeline
-  'keystore': key store
+  'key store': keystore
+  'key/value': key-value
   'kmeans': k-means
   'kNN': k-NN
   'machine-learning': machine learning
@@ -46,7 +47,7 @@ swap:
   'time stamp': timestamp
   'timezone': time zone
   'tradeoff': trade-off
-  'truststore': trust store
+  'trust store': truststore
   'U.S.': US
   'web page': webpage
   'web site': website

diff --git a/.github/vale/styles/Vocab/OpenSearch/Plugins/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Plugins/accept.txt
@@ -4,6 +4,7 @@ Asynchronous Search plugin
 Crypto plugin
 Cross-Cluster Replication plugin
 Custom Codecs plugin
+Flow Framework plugin
 Maps plugin
 Notebooks plugin
 Notifications plugin

diff --git a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt
@@ -20,6 +20,7 @@ Boolean
 [Dd]eallocate
 [Dd]eduplicates?
 [Dd]eduplication
+[Dd]eprovision(s|ed|ing)?
 [Dd]eserialize
 [Dd]eserialization
 Dev
@@ -75,7 +76,9 @@ Levenshtein
 [Mm]ultivalued
 [Mm]ultiword
 [Nn]amespace
+[Oo]versamples?
 pebibyte
+[Pp]erformant
 [Pp]luggable
 [Pp]reconfigure
 [Pp]refetch
@@ -103,6 +106,7 @@ pebibyte
 [Ss]erverless
 [Ss]harding
 [Ss]ignificand
+[Ss]napshott(ed|ing)
 stdout
 [Ss]temmers?
 [Ss]ubaggregation
@@ -130,6 +134,7 @@ tebibyte
 [Uu]nigram
 [Uu]nnesting
 [Uu]nrecovered
+[Uu]nregister(s|ed|ing)?
 [Uu]pdatable
 [Uu]psert
 [Ww]alkthrough

diff --git a/.github/vale/tests/test-style-neg.md b/.github/vale/tests/test-style-neg.md
@@ -2,8 +2,6 @@
 
 This sentence tests Advanced Placement (AP). We should define AP before using.
 
-Then this sentence tests adverbs of time.
-
 This sentence tests cybersecurity.
 
 This sentence tests dash---spacing.

diff --git a/.github/vale/tests/test-style-pos.md b/.github/vale/tests/test-style-pos.md
@@ -2,8 +2,6 @@
 
 This sentence tests AP. AP should be defined before using.
 
-Then, this sentence tests adverbs of time.
-
 This sentence tests cyber security.
 
 This sentence tests dash --- spacing.

diff --git a/.vale.ini b/.vale.ini
@@ -19,7 +19,6 @@ Vale.Spelling = NO
 Vale.Repetition = NO
 Vale.Terms = YES
 OpenSearch.AcronymParentheses = YES
-OpenSearch.AdverbsOfTime = YES
 OpenSearch.Ampersand = YES
 OpenSearch.Cyber = YES
 OpenSearch.DashSpacing = YES

diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md
@@ -45,8 +45,7 @@ Use lowercase when referring to features, unless you are referring to a formally
 * “The Notifications plugin provides a central location for all of your *notifications* from OpenSearch plugins.”
 * “*Remote-backed storage* is an experimental feature. Therefore, we do not recommend the use of *remote-backed storage* in a production environment.”
 * “You can take and restore *snapshots* using the snapshot API.”
-* “You can use the *VisBuilder* visualization type in OpenSearch Dashboards to create data visualizations by using a drag-and-drop gesture.” (You can refer to VisBuilder alone or qualify the term with “visualization type”.)
-* “As of OpenSearch 2.4, the *ML framework* only supports text-embedding models without GPU acceleration.”
+* “You can use the *VisBuilder* visualization type in OpenSearch Dashboards to create data visualizations by using a drag-and-drop gesture” (You can refer to VisBuilder alone or qualify the term with “visualization type”).
 
 #### Plugin names
 
@@ -344,7 +343,6 @@ We follow a slightly modified version of the _Microsoft Writing Style Guide_ gui
      - Independent clauses separated by coordinating conjunctions (but, or, yet, for, and, nor, so).
      - Introductory clauses, phrases, words that precede the main clause.
      - Words, clauses, and phrases listed in a series. Also known as the Oxford comma.
-     - Skip the comma after single-word adverbs of time at the beginning of a sentence, such as *afterward*, *then*, *later*, or *subsequently*.
 
 - An em dash (—) is the width of an uppercase M. Do not include spacing on either side. Use an em dash to set off parenthetical phrases within a sentence or set off phrases or clauses at the end of a sentence for restatement or emphasis.
 

diff --git a/TERMS.md b/TERMS.md
@@ -291,6 +291,8 @@ Exception: *Execution* is unavoidable for third-party terms for which no alterna
 
 **fail over (v.), failover (n.)**
 
+**Faiss**
+
 **file name**
 
 **frontend (n., adj.)**
@@ -399,7 +401,11 @@ Use *just* in the sense of *just now* (as in "the resources that you just create
 
 ## K
 
-**key store**
+**keystore**
+
+**key-value**
+
+Not _key/value_.
 
 **kill**
 
@@ -716,7 +722,7 @@ Data that's provided as part of a metric. The time value is assumed to be when t
 
 Avoid using as a verb to refer to an action that precipitates a subsequent action. It is OK to use when referring to a feature name, such as a *trigger function* or *time-triggered architecture*. As a verb, use an alternative, such as *initiate*, *invoke*, *launch*, or *start*.
 
-**trust store**
+**truststore**
 
 **turn on, turn off**
 

diff --git a/_about/quickstart.md b/_about/quickstart.md
@@ -52,9 +52,9 @@ You'll need a special file, called a Compose file, that Docker Compose uses to d
     opensearch-node1        "./opensearch-docker…"   opensearch-node1        running             0.0.0.0:9200->9200/tcp, 9300/tcp, 0.0.0.0:9600->9600/tcp, 9650/tcp
     opensearch-node2        "./opensearch-docker…"   opensearch-node2        running             9200/tcp, 9300/tcp, 9600/tcp, 9650/tcp
     ```
-1. Query the OpenSearch REST API to verify that the service is running. You should use `-k` (also written as `--insecure`) to disable host name checking because the default security configuration uses demo certificates. Use `-u` to pass the default username and password (`admin:admin`).
+1. Query the OpenSearch REST API to verify that the service is running. You should use `-k` (also written as `--insecure`) to disable hostname checking because the default security configuration uses demo certificates. Use `-u` to pass the default username and password (`admin:<custom-admin-password>`).
     ```bash
-    curl https://localhost:9200 -ku admin:admin
+    curl https://localhost:9200 -ku admin:<custom-admin-password>
     ```
     Sample response:
     ```json
@@ -76,7 +76,7 @@ You'll need a special file, called a Compose file, that Docker Compose uses to d
         "tagline" : "The OpenSearch Project: https://opensearch.org/"
     }
     ```
-1. Explore OpenSearch Dashboards by opening `http://localhost:5601/` in a web browser on the same host that is running your OpenSearch cluster. The default username is `admin` and the default password is `admin`.
+1. Explore OpenSearch Dashboards by opening `http://localhost:5601/` in a web browser on the same host that is running your OpenSearch cluster. The default username is `admin` and the default password is set in your `docker-compose.yml` file in the `OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password>` setting.
 
 ## Create an index and field mappings using sample data
 
@@ -100,18 +100,18 @@ Create an index and define field mappings using a dataset provided by the OpenSe
     ```
 1. Define the field mappings with the mapping file.
     ```bash
-    curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/ecommerce" -ku admin:admin --data-binary "@ecommerce-field_mappings.json"
+    curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/ecommerce" -ku admin:<custom-admin-password> --data-binary "@ecommerce-field_mappings.json"
     ```
 1. Upload the index to the bulk API.
     ```bash
-    curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/ecommerce/_bulk" -ku admin:admin --data-binary "@ecommerce.json"
+    curl -H "Content-Type: application/x-ndjson" -X PUT "https://localhost:9200/ecommerce/_bulk" -ku admin:<custom-admin-password> --data-binary "@ecommerce.json"
     ```
 1. Query the data using the search API. The following command submits a query that will return documents where `customer_first_name` is `Sonya`.
     ```bash
-    curl -H 'Content-Type: application/json' -X GET "https://localhost:9200/ecommerce/_search?pretty=true" -ku admin:admin -d' {"query":{"match":{"customer_first_name":"Sonya"}}}'
+    curl -H 'Content-Type: application/json' -X GET "https://localhost:9200/ecommerce/_search?pretty=true" -ku admin:<custom-admin-password> -d' {"query":{"match":{"customer_first_name":"Sonya"}}}'
     ```
     Queries submitted to the OpenSearch REST API will generally return a flat JSON by default. For a human readable response body, use the query parameter `pretty=true`. For more information about `pretty` and other useful query parameters, see [Common REST parameters]({{site.url}}{{site.baseurl}}/opensearch/common-parameters/).
-1. Access OpenSearch Dashboards by opening `http://localhost:5601/` in a web browser on the same host that is running your OpenSearch cluster. The default username is `admin` and the default password is `admin`.
+1. Access OpenSearch Dashboards by opening `http://localhost:5601/` in a web browser on the same host that is running your OpenSearch cluster. The default username is `admin` and the password is set in your `docker-compose.yml` file in the `OPENSEARCH_INITIAL_ADMIN_PASSWORD=<custom-admin-password>` setting.
 1. On the top menu bar, go to **Management > Dev Tools**.
 1. In the left pane of the console, enter the following:
     ```json
@@ -162,4 +162,4 @@ OpenSearch will fail to start if your host's `vm.max_map_count` is too low. Revi
 opensearch-node1         | ERROR: [1] bootstrap checks failed
 opensearch-node1         | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
 opensearch-node1         | ERROR: OpenSearch did not exit normally - check the logs at /usr/share/opensearch/logs/opensearch-cluster.log
-```
+```
diff --git a/_aggregations/bucket/terms.md b/_aggregations/bucket/terms.md
@@ -58,16 +58,51 @@ GET opensearch_dashboards_sample_data_logs/_search
 The values are returned with the key `key`.
 `doc_count` specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of `doc-count`.
 
+
+## Size and shard size parameters
+
+The number of buckets returned by the `terms` aggregation is controlled by the `size` parameter, which is 10 by default.
+
+Additionally, the coordinating node responsible for the aggregation will prompt each shard for its top unique terms. The number of buckets returned by each shard is controlled by the `shard_size` parameter. This parameter is distinct from the `size` parameter and exists as a mechanism to increase the accuracy of the bucket document counts.
+
+For example, imagine a scenario in which the `size` and `shard_size` parameters both have a value of 3. The `terms` aggregation prompts each shard for its top three unique terms. The coordinating node aggregates the results to compute the final result. If a shard contains an object that is not included in the top three, then it won't show up in the response. However, increasing the `shard_size` value for this request will allow each shard to return a larger number of unique terms, increasing the likelihood that the coordinating node will receive all relevant results.
+
+By default, the `shard_size` parameter is set to `size * 1.5 + 10`.
+
+When using concurrent segment search, the `shard_size` parameter is also applied to each segment slice. 
+
+The `shard_size` parameter serves as a way to balance the performance and document count accuracy of the `terms` aggregation. Higher `shard_size` values will ensure higher document count accuracy but will result in higher memory and compute usage. Lower `shard_size` values will be more performant but will result in lower document count accuracy.
+
+## Document count error
+
 The response also includes two keys named `doc_count_error_upper_bound` and `sum_other_doc_count`.
 
-The `terms` aggregation returns the top unique terms. So, if the data has many unique terms, then some of them might not appear in the results. The `sum_other_doc_count` field is the sum of the documents that are left out of the response. In this case, the number is 0 because all the unique values appear in the response.
+The `terms` aggregation returns the top unique terms. Therefore, if the data contains many unique terms, then some of them might not appear in the results. The `sum_other_doc_count` field represents the sum of the documents that are excluded from the response. In this case, the number is 0 because all of the unique values appear in the response. 
+
+The `doc_count_error_upper_bound` field represents the maximum possible count for a unique value that is excluded from the final results. Use this field to estimate the margin of error for the count. 
+
+The `doc_count_error_upper_bound` value and the concept of accuracy are only applicable to aggregations using the default sort order---by document count, descending. This is because when you sort by descending document count, any terms that were not returned are guaranteed to include equal or fewer documents than those terms that were returned. Based on this, you can compute the `doc_count_error_upper_bound`.
+
+If the `show_term_doc_count_error` parameter is set to `true`, then the `terms` aggregation will show the `doc_count_error_upper_bound` computed for each unique bucket in addition to the overall value.
+
+## The `min_doc_count` and `shard_min_doc_count` parameters
+
+You can use the `min_doc_count` parameter to filter out any unique terms with fewer than `min_doc_count` results. The `min_doc_count` threshold is applied only after merging the results retrieved from all of the shards. Each shard is unaware of the global document count for a given term. If there is a significant difference between the top `shard_size` globally frequent terms and the top terms local to a shard, you may receive unexpected results when using the `min_doc_count` parameter.
+
+Separately, the `shard_min_doc_count` parameter is used to filter out the unique terms that a shard returns back to the coordinator with fewer than `shard_min_doc_count` results.
+
+When using concurrent segment search, the `shard_min_doc_count` parameter is not applied to each segment slice. For more information, see the [related GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/11847).
+
+## Collect mode
+
+There are two collect modes available: `depth_first` and `breadth_first`. The `depth_first` collection mode expands all branches of the aggregation tree in a depth-first manner and only performs pruning after the expansion is complete. 
+
+However, when using nested `terms` aggregations, the cardinality of the number of buckets returned is multiplied by the cardinality of the field at each level of nesting, making it easy to see combinatorial explosion in the bucket count as you nest aggregations.
 
-The `doc_count_error_upper_bound` field represents the maximum possible count for a unique value that's left out of the final results. Use this field to estimate the error margin for the count.
+You can use the `breadth_first` collection mode to address this issue. In this case, pruning will be applied to the first level of the aggregation tree before it is expanded to the next level, potentially greatly reducing the number of buckets computed.
 
-The count might not be accurate. A coordinating node that’s responsible for the aggregation prompts each shard for its top unique terms. Imagine a scenario where the `size` parameter is 3.
-The `terms` aggregation requests each shard for its top 3 unique terms. The coordinating node takes each of the results and aggregates them to compute the final result. If a shard has an object that’s not part of the top 3, then it won't show up in the response.
+Additionally, there is memory overhead associated with performing `breadth_first` collection, which is linearly related to the number of matching documents. This is because `breadth_first` collection works by caching and replaying the pruned set of buckets from the parent level.
 
-This is especially true if `size` is set to a low number. Because the default size is 10, an error is unlikely to happen. If you don’t need high accuracy and want to increase the performance, you can reduce the size.
 
 ## Account for pre-aggregated data
 

diff --git a/_api-reference/cluster-api/cluster-stats.md b/_api-reference/cluster-api/cluster-stats.md
@@ -127,7 +127,10 @@ Parameter | Type | Description
                      "max_bytes" : 0
                   },
                   "max_refresh_time_lag_in_millis" : 0,
-                  "total_time_spent_in_millis" : 516
+                  "total_time_spent_in_millis" : 516,
+                  "pressure" : {
+                     "total_rejections" : 0
+                  }
                },
                "download" : {
                   "total_download_size" : {

diff --git a/_api-reference/index-apis/force-merge.md b/_api-reference/index-apis/force-merge.md
@@ -15,6 +15,8 @@ The force merge API operation forces a merge on the shards of one or more indexe
 
 In OpenSearch, a shard is a Lucene index, which consists of _segments_ (or segment files). Segments store the indexed data. Periodically, smaller segments are merged into larger ones and the larger segments become immutable. Merging reduces the overall number of segments on each shard and frees up disk space. 
 
+OpenSearch performs background segment merges that produce segments no larger than `index.merge.policy.max_merged_segment` (the default is 5 GB).
+
 ## Deleted documents
 
 When a document is deleted from an OpenSearch index, it is not deleted from the Lucene segment but is rather only marked to be deleted. When the segment files are merged, deleted documents are removed (or _expunged_). Thus, merging also frees up space occupied by documents marked as deleted.
@@ -69,7 +71,7 @@ The following table lists the available query parameters. All query parameters a
 | `flush` | Boolean | Performs a flush on the indexes after the force merge. A flush ensures that the files are persisted to disk. Default is `true`. |
 | `ignore_unavailable` | Boolean | If `true`, OpenSearch ignores missing or closed indexes. If `false`, OpenSearch returns an error if the force merge operation encounters missing or closed indexes. Default is `false`. |
 | `max_num_segments` | Integer | The number of larger segments into which smaller segments are merged. Set this parameter to `1` to merge all segments into one segment. The default behavior is to perform the merge as necessary. |
-| `only_expunge_deletes` | Boolean | If `true`, the merge operation only expunges segments containing a certain percentage of deleted documents. The percentage is 10% by default and is configurable in the `index.merge.policy.expunge_deletes_allowed` setting. Using `only_expunge_deletes` may produce segments larger than `index.merge.policy.max_merged_segment`, and those large segments may not participate in future merges. For more information, see [Deleted documents](#deleted-documents). Default is `false`. |
+| `only_expunge_deletes` | Boolean | If `true`, the merge operation only expunges segments containing a certain percentage of deleted documents. The percentage is 10% by default and is configurable in the `index.merge.policy.expunge_deletes_allowed` setting. Prior to OpenSearch 2.12, `only_expunge_deletes` ignored the `index.merge.policy.max_merged_segment` setting. Starting with OpenSearch 2.12, using `only_expunge_deletes` does not produce segments larger than `index.merge.policy.max_merged_segment` (by default, 5 GB). For more information, see [Deleted documents](#deleted-documents). Default is `false`. |
 
 #### Example request: Force merge a specific index