Skip to content

Commit

Permalink
[DOCS] Adds data frame analytics overview (#383)
Browse files Browse the repository at this point in the history
  • Loading branch information
szabosteve committed Jul 17, 2019
1 parent 303f532 commit 42d24ce
Show file tree
Hide file tree
Showing 4 changed files with 98 additions and 1 deletion.
83 changes: 83 additions & 0 deletions docs/en/stack/ml/df-analytics/dfa-outlierdetection.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
[role="xpack"]
[[dfa-outlier-detection]]
== {oldetection-cap}


{oldetection-cap} is an analysis for identifying data points (outliers) whose
feature values are different from those of the normal data points in a
particular data set. Outliers may denote errors or unusual behavior.

We use unsupervised {oldetection} which means there is no need to provide a
training data set to teach {oldetection} to recognize outliers. Unsupervised
{oldetection} uses various machine learning techniques to find which data points
are unusual compared to the majority of the data points.

In the {stack}, we use an ensemble of four different distance and density based
{oldetection} methods. By default, you don't need to select the methods or
provide any parameters, but you can override the default behavior if you like.
The basic assumption of the **distance based methods** is that normal data
points – in other words, points that are not outliers – have a lot of neighbors
nearby, because we expect that in a population the majority of the data points
have similar feature values, while the minority of the data points – the
outliers – have different feature values and will, therefore, be far away from
the normal points.

//FIGURE ON DISTANCE BASED METHOD

The distance of K^th^ nearest neighbor method (`distance_kth_nn`) computes the
distance of the data point to its K^th^ nearest neighbor where K is a small
number and usually independent of the total number of data points. The higher
this distance the more the data point is an outlier.

The distance of K-nearest neighbors method (`distance_knn`) calculates the
average distance of the data points to their nearest neighbors. Points with the
largest average distance will be the most outlying.

While the results of the distance based methods are easy to interpret, their
drawback is that they don't take into account the density variations of a
data set. This is the point where **density based methods** come into the
picture, they are used for mitigating this problem. These methods take into
account not only the distance of the points to their K nearest neighbors but
also the distance of these neighbors to their neighbors.

//[role="screenshot"]
//image::ml/images/ml-densitybm.jpg["Density based method – By Chire - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=10423954"]

Based on this approach, a metric is computed called local outlier factor
(`lof`) for each data point. The higher the local outlier factor, the more
outlying is the data point.

The other density based method that {oldetection} uses is the local
distance-based outlier factor (`ldof`). Ldof is a ratio of two measures: the
first computes the average distance of the data point to its K nearest
neighbors; the second computes the average of the pairwise distances of the
neighbors themselves. Again, the higher the value the more the data point is an
outlier.

As you can see, these four algorithms work differently, so they don't always
agree on which points are outliers. By default, we use all these methods during
{oldetection}, then normalize and combine their results and give every datapoint
in the index an {olscore}. The {olscore} ranges from 0 to 1, where the higher
number represents the chance that the data point is an outlier compared to the
other data points in the index.

IMPORTANT: {oldetection-cap} is a batch analysis, it runs against your data
once. If new data comes into the index, you need to do the analysis again on the
altered data.

[discrete]
[[dfa-feature-influence]]
=== Feature influence

Besides the {olscore}, another value is calculated during {oldetection}:
the feature influence score. As we mentioned, there are multiple features of a
data point that are analyzed during {oldetection}. An influential feature is a
feature of a data point that is responsible for the point being an outlier. The
value of feature influence provides a relative ranking of features by their
contribution to a point being an outlier. Therefore, while {olscore} tells us
whether a data point is an outlier, feature influence shows which features make
the point an outlier. By doing this, this value provides context to help
understand more about the reasons for the data point being unusual and can drive
visualizations.

//FIGURE ON FEATURE INFLUENCE
15 changes: 14 additions & 1 deletion docs/en/stack/ml/df-analytics/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,23 @@
[partintro]
--
{dfanalytics-cap} enable you to perform different analyses of your data and
annotate it with the results.
annotate it with the results. Essentially, as part of its output, {dfanalytics}
appends the results of the analysis to the source data. By doing this, it
provides additional insights into the data. The process leaves the source index
intact, it creates a new index that contains a copy of the source data and the
annotated data. You can slice and dice the data extended with the results as you
normally do with any other data set.

IMPORTANT: Using {dfanalytics} requires source data to be structured as a two
dimensional "tabular" data structure, in other words a
{stack-ov}/ml-dataframes.html[{dataframe}].
{ref}/data-frame-apis.html[{dataframe-transforms-cap}] allow you to create
{dataframes} which can be used as the source for {dfanalytics}.

* <<dfa-outlier-detection>>
* <<ml-dfanalytics-apis>>

--

include::dfa-outlierdetection.asciidoc[]
include::api-quickref.asciidoc[]
Binary file added docs/en/stack/ml/images/ml-densitybm.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/en/stack/ml/overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ include::buckets.asciidoc[]
include::calendars.asciidoc[]
include::rules.asciidoc[]
include::architecture.asciidoc[]

0 comments on commit 42d24ce

Please sign in to comment.