-
Notifications
You must be signed in to change notification settings - Fork 257
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DOCS] Adds data frame analytics overview (#383)
- Loading branch information
1 parent
303f532
commit 42d24ce
Showing
4 changed files
with
98 additions
and
1 deletion.
There are no files selected for viewing
83 changes: 83 additions & 0 deletions
83
docs/en/stack/ml/df-analytics/dfa-outlierdetection.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
[role="xpack"] | ||
[[dfa-outlier-detection]] | ||
== {oldetection-cap} | ||
|
||
|
||
{oldetection-cap} is an analysis for identifying data points (outliers) whose | ||
feature values are different from those of the normal data points in a | ||
particular data set. Outliers may denote errors or unusual behavior. | ||
|
||
We use unsupervised {oldetection} which means there is no need to provide a | ||
training data set to teach {oldetection} to recognize outliers. Unsupervised | ||
{oldetection} uses various machine learning techniques to find which data points | ||
are unusual compared to the majority of the data points. | ||
|
||
In the {stack}, we use an ensemble of four different distance and density based | ||
{oldetection} methods. By default, you don't need to select the methods or | ||
provide any parameters, but you can override the default behavior if you like. | ||
The basic assumption of the **distance based methods** is that normal data | ||
points – in other words, points that are not outliers – have a lot of neighbors | ||
nearby, because we expect that in a population the majority of the data points | ||
have similar feature values, while the minority of the data points – the | ||
outliers – have different feature values and will, therefore, be far away from | ||
the normal points. | ||
|
||
//FIGURE ON DISTANCE BASED METHOD | ||
|
||
The distance of K^th^ nearest neighbor method (`distance_kth_nn`) computes the | ||
distance of the data point to its K^th^ nearest neighbor where K is a small | ||
number and usually independent of the total number of data points. The higher | ||
this distance the more the data point is an outlier. | ||
|
||
The distance of K-nearest neighbors method (`distance_knn`) calculates the | ||
average distance of the data points to their nearest neighbors. Points with the | ||
largest average distance will be the most outlying. | ||
|
||
While the results of the distance based methods are easy to interpret, their | ||
drawback is that they don't take into account the density variations of a | ||
data set. This is the point where **density based methods** come into the | ||
picture, they are used for mitigating this problem. These methods take into | ||
account not only the distance of the points to their K nearest neighbors but | ||
also the distance of these neighbors to their neighbors. | ||
|
||
//[role="screenshot"] | ||
//image::ml/images/ml-densitybm.jpg["Density based method – By Chire - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=10423954"] | ||
|
||
Based on this approach, a metric is computed called local outlier factor | ||
(`lof`) for each data point. The higher the local outlier factor, the more | ||
outlying is the data point. | ||
|
||
The other density based method that {oldetection} uses is the local | ||
distance-based outlier factor (`ldof`). Ldof is a ratio of two measures: the | ||
first computes the average distance of the data point to its K nearest | ||
neighbors; the second computes the average of the pairwise distances of the | ||
neighbors themselves. Again, the higher the value the more the data point is an | ||
outlier. | ||
|
||
As you can see, these four algorithms work differently, so they don't always | ||
agree on which points are outliers. By default, we use all these methods during | ||
{oldetection}, then normalize and combine their results and give every datapoint | ||
in the index an {olscore}. The {olscore} ranges from 0 to 1, where the higher | ||
number represents the chance that the data point is an outlier compared to the | ||
other data points in the index. | ||
|
||
IMPORTANT: {oldetection-cap} is a batch analysis, it runs against your data | ||
once. If new data comes into the index, you need to do the analysis again on the | ||
altered data. | ||
|
||
[discrete] | ||
[[dfa-feature-influence]] | ||
=== Feature influence | ||
|
||
Besides the {olscore}, another value is calculated during {oldetection}: | ||
the feature influence score. As we mentioned, there are multiple features of a | ||
data point that are analyzed during {oldetection}. An influential feature is a | ||
feature of a data point that is responsible for the point being an outlier. The | ||
value of feature influence provides a relative ranking of features by their | ||
contribution to a point being an outlier. Therefore, while {olscore} tells us | ||
whether a data point is an outlier, feature influence shows which features make | ||
the point an outlier. By doing this, this value provides context to help | ||
understand more about the reasons for the data point being unusual and can drive | ||
visualizations. | ||
|
||
//FIGURE ON FEATURE INFLUENCE |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters