[DOCS] Adds data frame analytics overview (#383)

elastic · Jul 17, 2019 · 42d24ce · 42d24ce
1 parent 303f532
commit 42d24ce
Show file tree

Hide file tree

Showing 4 changed files with 98 additions and 1 deletion.
diff --git a/docs/en/stack/ml/df-analytics/dfa-outlierdetection.asciidoc b/docs/en/stack/ml/df-analytics/dfa-outlierdetection.asciidoc
@@ -0,0 +1,83 @@
+[role="xpack"]
+[[dfa-outlier-detection]]
+== {oldetection-cap}
+
+
+{oldetection-cap} is an analysis for identifying data points (outliers) whose 
+feature values are different from those of the normal data points in a 
+particular data set. Outliers may denote errors or unusual behavior.
+
+We use unsupervised {oldetection} which means there is no need to provide a 
+training data set to teach {oldetection} to recognize outliers. Unsupervised 
+{oldetection} uses various machine learning techniques to find which data points 
+are unusual compared to the majority of the data points.
+
+In the {stack}, we use an ensemble of four different distance and density based 
+{oldetection} methods. By default, you don't need to select the methods or 
+provide any parameters, but you can override the default behavior if you like. 
+The basic assumption of the **distance based methods** is that normal data 
+points – in other words, points that are not outliers – have a lot of neighbors 
+nearby, because we expect that in a population the majority of the data points 
+have similar feature values, while the minority of the data points – the 
+outliers – have different feature values and will, therefore, be far away from 
+the normal points.
+
+//FIGURE ON DISTANCE BASED METHOD
+
+The distance of K^th^ nearest neighbor method (`distance_kth_nn`) computes the 
+distance of the data point to its K^th^ nearest neighbor where K is a small 
+number and usually independent of the total number of data points. The higher 
+this distance the more the data point is an outlier.
+
+The distance of K-nearest neighbors method (`distance_knn`) calculates the 
+average distance of the data points to their nearest neighbors. Points with the 
+largest average distance will be the most outlying.
+
+While the results of the distance based methods are easy to interpret, their 
+drawback is that they don't take into account the density variations of a 
+data set. This is the point where **density based methods** come into the 
+picture, they are used for mitigating this problem. These methods take into 
+account not only the distance of the points to their K nearest neighbors but 
+also the distance of these neighbors to their neighbors.
+
+//[role="screenshot"]
+//image::ml/images/ml-densitybm.jpg["Density based method – By Chire - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=10423954"]
+
+Based on this approach, a metric is computed called local outlier factor 
+(`lof`) for each data point. The higher the local outlier factor, the more 
+outlying is the data point.
+
+The other density based method that {oldetection} uses is the local 
+distance-based outlier factor (`ldof`). Ldof is a ratio of two measures: the 
+first computes the average distance of the data point to its K nearest 
+neighbors; the second computes the average of the pairwise distances of the 
+neighbors themselves. Again, the higher the value the more the data point is an 
+outlier.
+
+As you can see, these four algorithms work differently, so they don't always 
+agree on which points are outliers. By default, we use all these methods during 
+{oldetection}, then normalize and combine their results and give every datapoint 
+in the index an {olscore}. The {olscore} ranges from 0 to 1, where the higher 
+number represents the chance that the data point is an outlier compared to the 
+other data points in the index.
+
+IMPORTANT: {oldetection-cap} is a batch analysis, it runs against your data 
+once. If new data comes into the index, you need to do the analysis again on the 
+altered data.
+
+[discrete]
+[[dfa-feature-influence]]
+=== Feature influence
+
+Besides the {olscore}, another value is calculated during {oldetection}: 
+the feature influence score. As we mentioned, there are multiple features of a 
+data point that are analyzed during {oldetection}. An influential feature is a 
+feature of a data point that is responsible for the point being an outlier. The 
+value of feature influence provides a relative ranking of features by their 
+contribution to a point being an outlier. Therefore, while {olscore} tells us 
+whether a data point is an outlier, feature influence shows which features make 
+the point an outlier. By doing this, this value provides context to help 
+understand more about the reasons for the data point being unusual and can drive 
+visualizations.
+
+//FIGURE ON FEATURE INFLUENCE
diff --git a/docs/en/stack/ml/df-analytics/index.asciidoc b/docs/en/stack/ml/df-analytics/index.asciidoc
@@ -5,10 +5,23 @@
 [partintro]
 --
 {dfanalytics-cap} enable you to perform different analyses of your data and 
-annotate it with the results.
+annotate it with the results. Essentially, as part of its output, {dfanalytics} 
+appends the results of the analysis to the source data. By doing this, it 
+provides additional insights into the data. The process leaves the source index 
+intact, it creates a new index that contains a copy of the source data and the 
+annotated data. You can slice and dice the data extended with the results as you 
+normally do with any other data set.
 
+IMPORTANT: Using {dfanalytics} requires source data to be structured as a two 
+dimensional "tabular" data structure, in other words a 
+{stack-ov}/ml-dataframes.html[{dataframe}]. 
+{ref}/data-frame-apis.html[{dataframe-transforms-cap}] allow you to create 
+{dataframes} which can be used as the source for {dfanalytics}.
+
+* <<dfa-outlier-detection>>
 * <<ml-dfanalytics-apis>>
 
 --
 
+include::dfa-outlierdetection.asciidoc[]
 include::api-quickref.asciidoc[]
diff --git a/docs/en/stack/ml/images/ml-densitybm.jpg b/docs/en/stack/ml/images/ml-densitybm.jpg
diff --git a/docs/en/stack/ml/overview.asciidoc b/docs/en/stack/ml/overview.asciidoc
@@ -10,3 +10,4 @@ include::buckets.asciidoc[]
 include::calendars.asciidoc[]
 include::rules.asciidoc[]
 include::architecture.asciidoc[]
+
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,3 +10,4 @@ include::buckets.asciidoc[]
		include::calendars.asciidoc[]
		include::rules.asciidoc[]
		include::architecture.asciidoc[]