From 303f5325f93ba30367da92fc86e3a7f97c29b32e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?=
 <istvan.szabo@elastic.co>
Date: Wed, 17 Jul 2019 08:04:11 +0200
Subject: [PATCH] [DOCS] Updates data frame limitations for 7.3. (#407)

This PR updates the data frame limitations for version 7.3.
---
 .../en/stack/data-frames/limitations.asciidoc | 242 ++++++++++++++----
 1 file changed, 195 insertions(+), 47 deletions(-)

diff --git a/docs/en/stack/data-frames/limitations.asciidoc b/docs/en/stack/data-frames/limitations.asciidoc
index fa9f198f9..0bdae21af 100644
--- a/docs/en/stack/data-frames/limitations.asciidoc
+++ b/docs/en/stack/data-frames/limitations.asciidoc
@@ -7,74 +7,222 @@
 
 beta[]
 
-The following limitations and known problems apply to the 7.2 release of 
+The following limitations and known problems apply to the 7.3 release of 
 the Elastic {dataframe} feature:
 
+[float]
+[[df-compatibility-limitations]]
+=== Beta {dataframe-transforms} do not have guaranteed backwards or forwards 
+compatibility
+
+Whilst {dataframe-transforms} are beta, it is not guaranteed that a 
+{dataframe-transform} created in a previous version of the {stack} will be able 
+to start and operate in a future version. Neither can support be provided for 
+{dataframe-transform} tasks to be able to operate in a cluster with mixed node 
+versions. 
+Please note that the output of a {dataframe-transform} is persisted to a 
+destination index. This is a normal {es} index and is not affected by the beta 
+status. 
+
+[float]
+[[df-ui-limitation]]
+=== {dataframes} UI will not work during a rolling upgrade from 7.2 to 7.3
+
+If your cluster contains mixed version nodes, for example during a rolling 
+upgrade from 7.2 to 7.3, and {dataframe-transforms} have been created in 7.2, 
+the {dataframe} UI will not work. Please wait until all nodes have been upgraded 
+to 7.3 before using the {dataframe} UI.
+
+
 [float]
 [[df-datatype-limitations]]
 === {dataframe-cap} data type limitation
 
 {dataframes-cap} do not (yet) support fields containing arrays – in the UI or 
-the API. If you try to create one, the UI will fail to show the source index table.
+the API. If you try to create one, the UI will fail to show the source index 
+table.
 
 [float]
 [[df-ccs-limitations]]
-=== {ccs-cap} limitation
+=== {ccs-cap} is not supported
 
-{ccs-cap} is not supported in 7.2 for {dataframe-transforms}.
+{ccs-cap} is not supported in 7.3 for {dataframe-transforms}.
 
 [float]
 [[df-kibana-limitations]]
-=== {kib} only displays up to 100 {dataframe-transforms}
+=== Up to 1,000 {dataframe-transforms} are supported
 
-The {kib} *Machine Learning* > *Data Frames* page lists up to 100 
-{dataframe-transforms}. You can work-around this limitation by calling the 
-{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] 
-with the `size` parameter.
+A single cluster will support up to 1,000 {dataframe-transforms}.
+When using the 
+{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] a total 
+`count` of transforms is returned. Use the `size` and `from` parameters to 
+enumerate through the full list.
 
 [float]
-[[df-dateformat-limitations]]
-=== Date histogram limitation
+[[df-aggresponse-limitations]]
+=== Aggregation responses may be incompatible with destination index mappings
 
-If you use a {ref}/search-aggregations-bucket-datehistogram-aggregation.html[date 
-histogram] in the `group_by` object in the create or preview {dataframe-transform} 
-APIs, the defined interval and time format must have the same time fidelity. 
-Otherwise, it might cause issues in the {dataframe}.
+When a {dataframe-transform} is first started, it will deduce the mappings 
+required for the destination index. This process is based on the field types of 
+the source index and the aggregations used. If the fields are derived from 
+{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[`scripted_metrics`] 
+or {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_scripts`], 
+{ref}/dynamic-mapping.html[dynamic mappings] will be used. In some instances the 
+deduced mappings may be incompatible with the actual data. For example, numeric 
+overflows might occur or dynamically mapped fields might contain both numbers 
+and strings. Please check {es} logs if you think this may have occurred. As a 
+workaround, you may define custom mappings prior to starting the 
+{dataframe-transform}. For example, 
+{ref}/indices-create-index.html[create a custom destination index] or 
+{ref}/indices-templates.html[define an index template].
+
+[float]
+[[df-batch-limitations]]
+=== Batch {dataframe-transforms} may not account for changed documents
 
-For example, if you set the `calendar_interval` of the date histogram to one minute 
-(`1m`), then make sure that the `format` is `yyyy-MM-dd HH:mm` instead of 
-`yyyy-MM-dd HH:00`.
+A batch {dataframe-transform} uses a 
+{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregation]
+which allows efficient pagination through all buckets. Composite aggregations 
+do not yet support a search context, therefore if the source data is changed 
+(deleted, updated, added) while the batch {dataframe} is in progress, then the 
+results may not include these changes.
 
 [float]
-=== Date format limitation in {dataframe-transform} destination index
-
-When you create a {dataframe-transform} that uses a `date_histogram` as a `group-by` 
-and set the `interval` to `1y`, the date could be interpreted incorrectly 
-in the generated date field of the destination index. The reason is that the `yyyy`
-value can be identified incorrectly as `epoch_millis`. As a workaround, using the 
-API, you may define a custom destination index data format mapping prior to starting 
-the {dataframe-transform}. For example:
-
-[source, json]
-------------------------------------------------------------
-"mappings" : {
-        "properties" : {
-            "custom_date" : { 
-            "type" : "date", 
-            "format": "yyyy"
-            }
-          }
-        }
-------------------------------------------------------------
+[[df-consistency-limitations]]
+=== {cdataframe-cap} consistency does not account for deleted or updated documents
+
+While the process for {cdataframe-transforms} allows the continual recalculation 
+of the {dataframe-transform} as new data is being ingested, it does also have 
+some limitations.
+
+Changed entities will only be identified if their time field 
+has also been updated and falls within the range of the action to check for 
+changes. This has been designed in principle for, and is suited to, the use case 
+where new data is given a timestamp for the time of ingest. 
+
+If the indices that fall within the scope of the source index pattern are 
+removed, for example when deleting historical time-based indices, then the 
+composite aggregation performed in consecutive checkpoint processing will search 
+over different source data, and entities that only existed in the deleted index 
+will not be removed from the {dataframe} destination index.
+
+Depending on your use case, you may wish to recreate the {dataframe-transform} 
+entirely after deletions. Alternatively, if your use case is tolerant to 
+historical archiving, you may wish to include a max ingest timestamp in your 
+aggregation. This will allow you to exclude results that have not been recently 
+updated when viewing the {dataframe} destination index.
+
 
 [float]
-[[df-aggresponse-limitations]]
-=== Aggregation responses may be incompatible with destination index mappings
+[[df-deletion-limitations]]
+=== Deleting a {dataframe-transform} does not delete the {dataframe} destination 
+index or {kib} index pattern
+
+When deleting a {dataframe-transform} using `DELETE _data_frame/transforms/index` 
+neither the {dataframe} destination index nor the {kib} index pattern, should 
+one have been created, are deleted. These objects must be deleted separately.
+
+[float]
+[[df-aggregation-page-limitations]]
+=== Handling dynamic adjustment of aggregation page size
+
+During the development of {dataframe-transforms}, control was favoured over 
+performance. In the design considerations, it is preferred for the 
+{dataframe-transform} to take longer to complete quietly in the background 
+rather than to finish quickly and take precedence in resource consumption.
+
+Composite aggregations are well suited for high cardinality data enabling 
+pagination through results. If a {ref}/circuit-breaker.html[circuit breaker] 
+memory exception occurs when performing the composite aggregated search then we 
+try again reducing the number of buckets requested. This circuit breaker is 
+calculated based upon all activity within the cluster, not just activity from 
+{dataframe-transforms}, so it therefore may only be a temporary resource 
+availability issue.
+
+For a batch {dataframe-transform}, the number of buckets requested is only ever 
+adjusted downwards. The lowering of value may result in a longer duration for the 
+transform checkpoint to complete. For {cdataframes}, the number of 
+buckets requested is reset back to its default at the start of every checkpoint 
+and it is possible for circuit breaker exceptions to occur repeatedly in the 
+{es} logs. 
+
+The {dataframe-transform} retrieves data in batches which means it calculates 
+several buckets at once. Per default this is 500 buckets per search/index 
+operation. The default can be changed using `max_page_search_size` and the 
+minimum value is 10. If failures still occur once the number of buckets 
+requested has been reduced to its minimum, then the {dataframe-transform} will 
+be set to a failed state.
+
+[float]
+[[df-dynamic-adjustments-limitations]]
+=== Handling dynamic adjustments for many terms
+
+For each checkpoint, entities are identified that have changed since the last 
+time the check was performed. This list of changed entities is supplied as a 
+{ref}/query-dsl-terms-query.html[terms query] to the {dataframe-transform} 
+composite aggregation, one page at a time. Then updates are applied to the 
+destination index for each page of entities.
+
+The page `size` is defined by `max_page_search_size` which is also used to 
+define the number of buckets returned by the composite aggregation search. The 
+default value is 500, the minimum is 10.
+
+The index setting 
+{ref}/index-modules.html#dynamic-index-settings[`index.max_terms_count`] defines 
+the maximum number of terms that can be used in a terms query. The default value 
+is 65536. If `max_page_search_size` exceeds `index.max_terms_count` the 
+transform will fail. 
+
+Using smaller values for `max_page_search_size` may result in a longer duration 
+for the transform checkpoint to complete.
+
+[float]
+[[df-update-limitations]]
+=== Cannot update a {dataframe-transform}
+
+{dataframe-transform-cap} configurations cannot be updated. Please delete and 
+then create a new transform instead.
 
-{dataframes-cap} use composite aggregations to transform data. In some cases, 
-composite aggregations may return responses which are not compatible with the 
-mappings set for the destination index. For example "NaN", "Infinity" or possibly 
-a numeric overflow. Where possible, a null response has been substituted. Please, 
-check {es} logs if you think this may have occurred. As a workaround, 
-using the API, you may define custom destination index mappings prior to starting 
-the {dataframe-transform}.
+[float]
+[[df-scheduling-limitations]]
+=== {cdataframe-cap} scheduling limitations
+
+A {cdataframe} periodically checks for changes to source data. The functionality 
+of the scheduler is currently limited to a basic periodic timer which can be 
+within the `frequency` range from 1s to 1h. The default is 1m. This is designed 
+to run little and often. When choosing a `frequency` for this timer consider 
+your ingest rate along with the impact that the {dataframe-transform} 
+search/index operations has other users in your cluster. Also note that retries 
+occur at `frequency` interval.
+
+[float]
+[[df-failed-limitations]]
+=== Handling of failed {dataframe-transforms}
+
+Failed {dataframe-transforms} remain as a persistent task and should be handled 
+appropriately, either by deleting it or by resolving the root cause of the 
+failure and re-starting.
+
+When using the API to delete a failed {dataframe-transform}, first stop it using 
+`_stop?force=true`, then delete it.
+
+If starting a failed {dataframe-transform}, after the root cause has been 
+resolved, the `_start?force=true` parameter must be specified.
+
+[float]
+[[df-availability-limitations]]
+=== {cdataframes-cap} may give incorrect results if documents are not yet 
+available to search
+
+After a document is indexed, there is a very small delay until it is available 
+to search.
+
+A {cdataframe-transform} periodically checks for changed entities between the 
+time since it last checked and `now` minus `sync.time.delay`. This time window 
+moves without overlapping. If the timestamp of a recently indexed document falls 
+within this time window but this document is not yet available to search then 
+this entity will not be updated.
+
+If using a `sync.time.field` that represents the data ingest time and using a 
+zero second or very small `sync.time.delay`, then it is more likely that this 
+issue will occur.
\ No newline at end of file