From 303f5325f93ba30367da92fc86e3a7f97c29b32e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 17 Jul 2019 08:04:11 +0200 Subject: [PATCH] [DOCS] Updates data frame limitations for 7.3. (#407) This PR updates the data frame limitations for version 7.3. --- .../en/stack/data-frames/limitations.asciidoc | 242 ++++++++++++++---- 1 file changed, 195 insertions(+), 47 deletions(-) diff --git a/docs/en/stack/data-frames/limitations.asciidoc b/docs/en/stack/data-frames/limitations.asciidoc index fa9f198f9..0bdae21af 100644 --- a/docs/en/stack/data-frames/limitations.asciidoc +++ b/docs/en/stack/data-frames/limitations.asciidoc @@ -7,74 +7,222 @@ beta[] -The following limitations and known problems apply to the 7.2 release of +The following limitations and known problems apply to the 7.3 release of the Elastic {dataframe} feature: +[float] +[[df-compatibility-limitations]] +=== Beta {dataframe-transforms} do not have guaranteed backwards or forwards +compatibility + +Whilst {dataframe-transforms} are beta, it is not guaranteed that a +{dataframe-transform} created in a previous version of the {stack} will be able +to start and operate in a future version. Neither can support be provided for +{dataframe-transform} tasks to be able to operate in a cluster with mixed node +versions. +Please note that the output of a {dataframe-transform} is persisted to a +destination index. This is a normal {es} index and is not affected by the beta +status. + +[float] +[[df-ui-limitation]] +=== {dataframes} UI will not work during a rolling upgrade from 7.2 to 7.3 + +If your cluster contains mixed version nodes, for example during a rolling +upgrade from 7.2 to 7.3, and {dataframe-transforms} have been created in 7.2, +the {dataframe} UI will not work. Please wait until all nodes have been upgraded +to 7.3 before using the {dataframe} UI. + + [float] [[df-datatype-limitations]] === {dataframe-cap} data type limitation {dataframes-cap} do not (yet) support fields containing arrays – in the UI or -the API. If you try to create one, the UI will fail to show the source index table. +the API. If you try to create one, the UI will fail to show the source index +table. [float] [[df-ccs-limitations]] -=== {ccs-cap} limitation +=== {ccs-cap} is not supported -{ccs-cap} is not supported in 7.2 for {dataframe-transforms}. +{ccs-cap} is not supported in 7.3 for {dataframe-transforms}. [float] [[df-kibana-limitations]] -=== {kib} only displays up to 100 {dataframe-transforms} +=== Up to 1,000 {dataframe-transforms} are supported -The {kib} *Machine Learning* > *Data Frames* page lists up to 100 -{dataframe-transforms}. You can work-around this limitation by calling the -{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] -with the `size` parameter. +A single cluster will support up to 1,000 {dataframe-transforms}. +When using the +{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] a total +`count` of transforms is returned. Use the `size` and `from` parameters to +enumerate through the full list. [float] -[[df-dateformat-limitations]] -=== Date histogram limitation +[[df-aggresponse-limitations]] +=== Aggregation responses may be incompatible with destination index mappings -If you use a {ref}/search-aggregations-bucket-datehistogram-aggregation.html[date -histogram] in the `group_by` object in the create or preview {dataframe-transform} -APIs, the defined interval and time format must have the same time fidelity. -Otherwise, it might cause issues in the {dataframe}. +When a {dataframe-transform} is first started, it will deduce the mappings +required for the destination index. This process is based on the field types of +the source index and the aggregations used. If the fields are derived from +{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[`scripted_metrics`] +or {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_scripts`], +{ref}/dynamic-mapping.html[dynamic mappings] will be used. In some instances the +deduced mappings may be incompatible with the actual data. For example, numeric +overflows might occur or dynamically mapped fields might contain both numbers +and strings. Please check {es} logs if you think this may have occurred. As a +workaround, you may define custom mappings prior to starting the +{dataframe-transform}. For example, +{ref}/indices-create-index.html[create a custom destination index] or +{ref}/indices-templates.html[define an index template]. + +[float] +[[df-batch-limitations]] +=== Batch {dataframe-transforms} may not account for changed documents -For example, if you set the `calendar_interval` of the date histogram to one minute -(`1m`), then make sure that the `format` is `yyyy-MM-dd HH:mm` instead of -`yyyy-MM-dd HH:00`. +A batch {dataframe-transform} uses a +{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregation] +which allows efficient pagination through all buckets. Composite aggregations +do not yet support a search context, therefore if the source data is changed +(deleted, updated, added) while the batch {dataframe} is in progress, then the +results may not include these changes. [float] -=== Date format limitation in {dataframe-transform} destination index - -When you create a {dataframe-transform} that uses a `date_histogram` as a `group-by` -and set the `interval` to `1y`, the date could be interpreted incorrectly -in the generated date field of the destination index. The reason is that the `yyyy` -value can be identified incorrectly as `epoch_millis`. As a workaround, using the -API, you may define a custom destination index data format mapping prior to starting -the {dataframe-transform}. For example: - -[source, json] ------------------------------------------------------------- -"mappings" : { - "properties" : { - "custom_date" : { - "type" : "date", - "format": "yyyy" - } - } - } ------------------------------------------------------------- +[[df-consistency-limitations]] +=== {cdataframe-cap} consistency does not account for deleted or updated documents + +While the process for {cdataframe-transforms} allows the continual recalculation +of the {dataframe-transform} as new data is being ingested, it does also have +some limitations. + +Changed entities will only be identified if their time field +has also been updated and falls within the range of the action to check for +changes. This has been designed in principle for, and is suited to, the use case +where new data is given a timestamp for the time of ingest. + +If the indices that fall within the scope of the source index pattern are +removed, for example when deleting historical time-based indices, then the +composite aggregation performed in consecutive checkpoint processing will search +over different source data, and entities that only existed in the deleted index +will not be removed from the {dataframe} destination index. + +Depending on your use case, you may wish to recreate the {dataframe-transform} +entirely after deletions. Alternatively, if your use case is tolerant to +historical archiving, you may wish to include a max ingest timestamp in your +aggregation. This will allow you to exclude results that have not been recently +updated when viewing the {dataframe} destination index. + [float] -[[df-aggresponse-limitations]] -=== Aggregation responses may be incompatible with destination index mappings +[[df-deletion-limitations]] +=== Deleting a {dataframe-transform} does not delete the {dataframe} destination +index or {kib} index pattern + +When deleting a {dataframe-transform} using `DELETE _data_frame/transforms/index` +neither the {dataframe} destination index nor the {kib} index pattern, should +one have been created, are deleted. These objects must be deleted separately. + +[float] +[[df-aggregation-page-limitations]] +=== Handling dynamic adjustment of aggregation page size + +During the development of {dataframe-transforms}, control was favoured over +performance. In the design considerations, it is preferred for the +{dataframe-transform} to take longer to complete quietly in the background +rather than to finish quickly and take precedence in resource consumption. + +Composite aggregations are well suited for high cardinality data enabling +pagination through results. If a {ref}/circuit-breaker.html[circuit breaker] +memory exception occurs when performing the composite aggregated search then we +try again reducing the number of buckets requested. This circuit breaker is +calculated based upon all activity within the cluster, not just activity from +{dataframe-transforms}, so it therefore may only be a temporary resource +availability issue. + +For a batch {dataframe-transform}, the number of buckets requested is only ever +adjusted downwards. The lowering of value may result in a longer duration for the +transform checkpoint to complete. For {cdataframes}, the number of +buckets requested is reset back to its default at the start of every checkpoint +and it is possible for circuit breaker exceptions to occur repeatedly in the +{es} logs. + +The {dataframe-transform} retrieves data in batches which means it calculates +several buckets at once. Per default this is 500 buckets per search/index +operation. The default can be changed using `max_page_search_size` and the +minimum value is 10. If failures still occur once the number of buckets +requested has been reduced to its minimum, then the {dataframe-transform} will +be set to a failed state. + +[float] +[[df-dynamic-adjustments-limitations]] +=== Handling dynamic adjustments for many terms + +For each checkpoint, entities are identified that have changed since the last +time the check was performed. This list of changed entities is supplied as a +{ref}/query-dsl-terms-query.html[terms query] to the {dataframe-transform} +composite aggregation, one page at a time. Then updates are applied to the +destination index for each page of entities. + +The page `size` is defined by `max_page_search_size` which is also used to +define the number of buckets returned by the composite aggregation search. The +default value is 500, the minimum is 10. + +The index setting +{ref}/index-modules.html#dynamic-index-settings[`index.max_terms_count`] defines +the maximum number of terms that can be used in a terms query. The default value +is 65536. If `max_page_search_size` exceeds `index.max_terms_count` the +transform will fail. + +Using smaller values for `max_page_search_size` may result in a longer duration +for the transform checkpoint to complete. + +[float] +[[df-update-limitations]] +=== Cannot update a {dataframe-transform} + +{dataframe-transform-cap} configurations cannot be updated. Please delete and +then create a new transform instead. -{dataframes-cap} use composite aggregations to transform data. In some cases, -composite aggregations may return responses which are not compatible with the -mappings set for the destination index. For example "NaN", "Infinity" or possibly -a numeric overflow. Where possible, a null response has been substituted. Please, -check {es} logs if you think this may have occurred. As a workaround, -using the API, you may define custom destination index mappings prior to starting -the {dataframe-transform}. +[float] +[[df-scheduling-limitations]] +=== {cdataframe-cap} scheduling limitations + +A {cdataframe} periodically checks for changes to source data. The functionality +of the scheduler is currently limited to a basic periodic timer which can be +within the `frequency` range from 1s to 1h. The default is 1m. This is designed +to run little and often. When choosing a `frequency` for this timer consider +your ingest rate along with the impact that the {dataframe-transform} +search/index operations has other users in your cluster. Also note that retries +occur at `frequency` interval. + +[float] +[[df-failed-limitations]] +=== Handling of failed {dataframe-transforms} + +Failed {dataframe-transforms} remain as a persistent task and should be handled +appropriately, either by deleting it or by resolving the root cause of the +failure and re-starting. + +When using the API to delete a failed {dataframe-transform}, first stop it using +`_stop?force=true`, then delete it. + +If starting a failed {dataframe-transform}, after the root cause has been +resolved, the `_start?force=true` parameter must be specified. + +[float] +[[df-availability-limitations]] +=== {cdataframes-cap} may give incorrect results if documents are not yet +available to search + +After a document is indexed, there is a very small delay until it is available +to search. + +A {cdataframe-transform} periodically checks for changed entities between the +time since it last checked and `now` minus `sync.time.delay`. This time window +moves without overlapping. If the timestamp of a recently indexed document falls +within this time window but this document is not yet available to search then +this entity will not be updated. + +If using a `sync.time.field` that represents the data ingest time and using a +zero second or very small `sync.time.delay`, then it is more likely that this +issue will occur. \ No newline at end of file