Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Updates anomaly detection terminology in Stack Overview #44888

Merged
merged 1 commit into from
Jul 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/reference/ml/anomaly-detection/aggregations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

By default, {dfeeds} fetch data from {es} using search and scroll requests.
It can be significantly more efficient, however, to aggregate data in {es}
and to configure your jobs to analyze aggregated data.
and to configure your {anomaly-jobs} to analyze aggregated data.

One of the benefits of aggregating data this way is that {es} automatically
distributes these calculations across your cluster. You can then feed this
Expand All @@ -19,8 +19,8 @@ of the last record in the bucket. If you use a terms aggregation and the
cardinality of a term is high, then the aggregation might not be effective and
you might want to just use the default search and scroll behavior.

When you create or update a job, you can include the names of aggregations, for
example:
When you create or update an {anomaly-job}, you can include the names of
aggregations, for example:

[source,js]
----------------------------------
Expand Down
6 changes: 3 additions & 3 deletions docs/reference/ml/anomaly-detection/categories.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@ we do not want the detailed SQL to be considered in the message categorization.
This particular categorization filter removes the SQL statement from the categorization
algorithm.

If your data is stored in {es}, you can create an advanced job with these same
properties:
If your data is stored in {es}, you can create an advanced {anomaly-job} with
these same properties:

[role="screenshot"]
image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]
Expand Down Expand Up @@ -209,7 +209,7 @@ letters in tokens whereas the `ml_classic` tokenizer does, although that could
be fixed by using more complex regular expressions.

For more information about the `categorization_analyzer` property, see
{ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization Analyzer].
{ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization analyzer].

NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
**Edit JSON** tab and copy the `categorization_analyzer` object from one of the
Expand Down
4 changes: 2 additions & 2 deletions docs/reference/ml/anomaly-detection/configuring.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ your cluster and all master-eligible nodes must have {ml} enabled. By default,
all nodes are {ml} nodes. For more information about these settings, see
{ref}/modules-node.html#ml-node[{ml} nodes].

To use the {ml-features} to analyze your data, you must create a job and
send your data to that job.
To use the {ml-features} to analyze your data, you can create an {anomaly-job}
and send your data to that job.

* If your data is stored in {es}:

Expand Down
23 changes: 12 additions & 11 deletions docs/reference/ml/anomaly-detection/customurl.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@
[[ml-configuring-url]]
=== Adding custom URLs to machine learning results

When you create an advanced job or edit any job in {kib}, you can optionally
attach one or more custom URLs.
When you create an advanced {anomaly-job} or edit any {anomaly-jobs} in {kib},
you can optionally attach one or more custom URLs.

The custom URLs provide links from the anomalies table in the *Anomaly Explorer*
or *Single Metric Viewer* window in {kib} to {kib} dashboards, the *Discovery*
page, or external websites. For example, you can define a custom URL that
provides a way for users to drill down to the source data from the results set.

When you edit a job in {kib}, it simplifies the creation of the custom URLs for
{kib} dashboards and the *Discover* page and it enables you to test your URLs.
For example:
When you edit an {anomaly-job} in {kib}, it simplifies the creation of the
custom URLs for {kib} dashboards and the *Discover* page and it enables you to
test your URLs. For example:

[role="screenshot"]
image::images/ml-customurl-edit.jpg["Edit a job to add a custom URL"]
Expand All @@ -29,7 +29,8 @@ As in this case, the custom URL can contain
are populated when you click the link in the anomalies table. In this example,
the custom URL contains `$earliest$`, `$latest$`, and `$service$` tokens, which
pass the beginning and end of the time span of the selected anomaly and the
pertinent `service` field value to the target page. If you were interested in the following anomaly, for example:
pertinent `service` field value to the target page. If you were interested in
the following anomaly, for example:

[role="screenshot"]
image::images/ml-customurl.jpg["An example of the custom URL links in the Anomaly Explorer anomalies table"]
Expand All @@ -43,8 +44,8 @@ image::images/ml-customurl-discover.jpg["An example of the results on the Discov
Since we specified a time range of 2 hours, the time filter restricts the
results to the time period two hours before and after the anomaly.

You can also specify these custom URL settings when you create or update jobs by
using the {ml} APIs.
You can also specify these custom URL settings when you create or update
{anomaly-jobs} by using the APIs.

[float]
[[ml-configuring-url-strings]]
Expand Down Expand Up @@ -74,9 +75,9 @@ time as the earliest and latest times. The same is also true if the interval is
set to `Auto` and a one hour interval was chosen. You can override this behavior
by using the `time_range` setting.

The `$mlcategoryregex$` and `$mlcategoryterms$` tokens pertain to jobs where you
are categorizing field values. For more information about this type of analysis,
see <<ml-configuring-categories>>.
The `$mlcategoryregex$` and `$mlcategoryterms$` tokens pertain to {anomaly-jobs}
where you are categorizing field values. For more information about this type of
analysis, see <<ml-configuring-categories>>.

The `$mlcategoryregex$` token passes the regular expression value of the
category of the selected anomaly, as identified by the value of the `mlcategory`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ functions are not really affected. In these situations, it all comes out okay in
the end as the delayed data is distributed randomly. An example would be a `mean`
metric for a field in a large collection of data. In this case, checking for
delayed data may not provide much benefit. If data are consistently delayed,
however, jobs with a `low_count` function may provide false positives. In this
situation, it would be useful to see if data comes in after an anomaly is
however, {anomaly-jobs} with a `low_count` function may provide false positives.
In this situation, it would be useful to see if data comes in after an anomaly is
recorded so that you can determine a next course of action.

==== How do we detect delayed data?
Expand All @@ -35,11 +35,11 @@ Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
triggers a document search over the configured indices. This search looks over a
time span with a length of `check_window` ending with the latest finalized bucket.
That time span is partitioned into buckets, whose length equals the bucket span
of the associated job. The `doc_count` of those buckets are then compared with
the job's finalized analysis buckets to see whether any data has arrived since
the analysis. If there is indeed missing data due to their ingest delay, the end
user is notified. For example, you can see annotations in {kib} for the periods
where these delays occur.
of the associated {anomaly-job}. The `doc_count` of those buckets are then
compared with the job's finalized analysis buckets to see whether any data has
arrived since the analysis. If there is indeed missing data due to their ingest
delay, the end user is notified. For example, you can see annotations in {kib}
for the periods where these delays occur.

==== What to do about delayed data?

Expand Down
41 changes: 21 additions & 20 deletions docs/reference/ml/anomaly-detection/detector-custom-rules.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,18 @@ Let us see how those can be configured by examples.

==== Specifying custom rule scope

Let us assume we are configuring a job in order to detect DNS data exfiltration.
Our data contain fields "subdomain" and "highest_registered_domain".
We can use a detector that looks like `high_info_content(subdomain) over highest_registered_domain`.
If we run such a job it is possible that we discover a lot of anomalies on
frequently used domains that we have reasons to trust. As security analysts, we
are not interested in such anomalies. Ideally, we could instruct the detector to
skip results for domains that we consider safe. Using a rule with a scope allows
us to achieve this.
Let us assume we are configuring an {anomaly-job} in order to detect DNS data
exfiltration. Our data contain fields "subdomain" and "highest_registered_domain".
We can use a detector that looks like
`high_info_content(subdomain) over highest_registered_domain`. If we run such a
job, it is possible that we discover a lot of anomalies on frequently used
domains that we have reasons to trust. As security analysts, we are not
interested in such anomalies. Ideally, we could instruct the detector to skip
results for domains that we consider safe. Using a rule with a scope allows us
to achieve this.

First, we need to create a list of our safe domains. Those lists are called
_filters_ in {ml}. Filters can be shared across jobs.
_filters_ in {ml}. Filters can be shared across {anomaly-jobs}.

We create our filter using the {ref}/ml-put-filter.html[put filter API]:

Expand All @@ -41,8 +42,8 @@ PUT _ml/filters/safe_domains
// CONSOLE
// TEST[skip:needs-licence]

Now, we can create our job specifying a scope that uses the `safe_domains`
filter for the `highest_registered_domain` field:
Now, we can create our {anomaly-job} specifying a scope that uses the
`safe_domains` filter for the `highest_registered_domain` field:

[source,js]
----------------------------------
Expand Down Expand Up @@ -139,8 +140,8 @@ example, 0.02. Given our knowledge about how CPU utilization behaves we might
determine that anomalies with such small actual values are not interesting for
investigation.

Let us now configure a job with a rule that will skip results where CPU
utilization is less than 0.20.
Let us now configure an {anomaly-job} with a rule that will skip results where
CPU utilization is less than 0.20.

[source,js]
----------------------------------
Expand Down Expand Up @@ -214,18 +215,18 @@ PUT _ml/anomaly_detectors/rule_with_range
==== Custom rules in the life-cycle of a job

Custom rules only affect results created after the rules were applied.
Let us imagine that we have configured a job and it has been running
Let us imagine that we have configured an {anomaly-job} and it has been running
for some time. After observing its results we decide that we can employ
rules in order to get rid of some uninteresting results. We can use
the {ref}/ml-update-job.html[update job API] to do so. However, the rule we
added will only be in effect for any results created from the moment we added
the rule onwards. Past results will remain unaffected.
the {ref}/ml-update-job.html[update {anomaly-job} API] to do so. However, the
rule we added will only be in effect for any results created from the moment we
added the rule onwards. Past results will remain unaffected.

==== Using custom rules VS filtering data
==== Using custom rules vs. filtering data

It might appear like using rules is just another way of filtering the data
that feeds into a job. For example, a rule that skips results when the
partition field value is in a filter sounds equivalent to having a query
that feeds into an {anomaly-job}. For example, a rule that skips results when
the partition field value is in a filter sounds equivalent to having a query
that filters out such documents. But it is not. There is a fundamental
difference. When the data is filtered before reaching a job it is as if they
never existed for the job. With rules, the data still reaches the job and
Expand Down
12 changes: 6 additions & 6 deletions docs/reference/ml/anomaly-detection/functions.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
The {ml-features} include analysis functions that provide a wide variety of
flexible ways to analyze data for anomalies.

When you create jobs, you specify one or more detectors, which define the type of
analysis that needs to be done. If you are creating your job by using {ml} APIs,
you specify the functions in
{ref}/ml-job-resource.html#ml-detectorconfig[Detector Configuration Objects].
When you create {anomaly-jobs}, you specify one or more detectors, which define
the type of analysis that needs to be done. If you are creating your job by
using {ml} APIs, you specify the functions in
{ref}/ml-job-resource.html#ml-detectorconfig[Detector configuration objects].
If you are creating your job in {kib}, you specify the functions differently
depending on whether you are creating single metric, multi-metric, or advanced
jobs.
Expand All @@ -24,8 +24,8 @@ You can specify a `summary_count_field_name` with any function except `metric`.
When you use `summary_count_field_name`, the {ml} features expect the input
data to be pre-aggregated. The value of the `summary_count_field_name` field
must contain the count of raw events that were summarized. In {kib}, use the
**summary_count_field_name** in advanced jobs. Analyzing aggregated input data
provides a significant boost in performance. For more information, see
**summary_count_field_name** in advanced {anomaly-jobs}. Analyzing aggregated
input data provides a significant boost in performance. For more information, see
<<ml-configuring-aggregation>>.

If your data is sparse, there may be gaps in the data which means you might have
Expand Down
37 changes: 19 additions & 18 deletions docs/reference/ml/anomaly-detection/functions/count.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ These functions support the following properties:
* `partition_field_name` (optional)

For more information about those properties,
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector Configuration Objects].
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector configuration objects].

.Example 1: Analyzing events with the count function
[source,js]
Expand All @@ -65,8 +65,9 @@ This example is probably the simplest possible analysis. It identifies
time buckets during which the overall count of events is higher or lower than
usual.

When you use this function in a detector in your job, it models the event rate
and detects when the event rate is unusual compared to its past behavior.
When you use this function in a detector in your {anomaly-job}, it models the
event rate and detects when the event rate is unusual compared to its past
behavior.

.Example 2: Analyzing errors with the high_count function
[source,js]
Expand All @@ -89,7 +90,7 @@ PUT _ml/anomaly_detectors/example2
// CONSOLE
// TEST[skip:needs-licence]

If you use this `high_count` function in a detector in your job, it
If you use this `high_count` function in a detector in your {anomaly-job}, it
models the event rate for each error code. It detects users that generate an
unusually high count of error codes compared to other users.

Expand Down Expand Up @@ -117,9 +118,9 @@ PUT _ml/anomaly_detectors/example3
In this example, the function detects when the count of events for a
status code is lower than usual.

When you use this function in a detector in your job, it models the event rate
for each status code and detects when a status code has an unusually low count
compared to its past behavior.
When you use this function in a detector in your {anomaly-job}, it models the
event rate for each status code and detects when a status code has an unusually
low count compared to its past behavior.

.Example 4: Analyzing aggregated data with the count function
[source,js]
Expand Down Expand Up @@ -168,7 +169,7 @@ These functions support the following properties:
* `partition_field_name` (optional)

For more information about those properties,
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector Configuration Objects].
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector configuration objects].

For example, if you have the following number of events per bucket:

Expand Down Expand Up @@ -206,10 +207,10 @@ PUT _ml/anomaly_detectors/example5
// CONSOLE
// TEST[skip:needs-licence]

If you use this `high_non_zero_count` function in a detector in your job, it
models the count of events for the `signaturename` field. It ignores any buckets
where the count is zero and detects when a `signaturename` value has an
unusually high count of events compared to its past behavior.
If you use this `high_non_zero_count` function in a detector in your
{anomaly-job}, it models the count of events for the `signaturename` field. It
ignores any buckets where the count is zero and detects when a `signaturename`
value has an unusually high count of events compared to its past behavior.

NOTE: Population analysis (using an `over_field_name` property value) is not
supported for the `non_zero_count`, `high_non_zero_count`, and
Expand Down Expand Up @@ -238,7 +239,7 @@ These functions support the following properties:
* `partition_field_name` (optional)

For more information about those properties,
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector Configuration Objects].
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector configuration objects].

.Example 6: Analyzing users with the distinct_count function
[source,js]
Expand All @@ -261,9 +262,9 @@ PUT _ml/anomaly_detectors/example6
// TEST[skip:needs-licence]

This `distinct_count` function detects when a system has an unusual number
of logged in users. When you use this function in a detector in your job, it
models the distinct count of users. It also detects when the distinct number of
users is unusual compared to the past.
of logged in users. When you use this function in a detector in your
{anomaly-job}, it models the distinct count of users. It also detects when the
distinct number of users is unusual compared to the past.

.Example 7: Analyzing ports with the high_distinct_count function
[source,js]
Expand All @@ -287,6 +288,6 @@ PUT _ml/anomaly_detectors/example7
// TEST[skip:needs-licence]

This example detects instances of port scanning. When you use this function in a
detector in your job, it models the distinct count of ports. It also detects the
`src_ip` values that connect to an unusually high number of different
detector in your {anomaly-job}, it models the distinct count of ports. It also
detects the `src_ip` values that connect to an unusually high number of different
`dst_ports` values compared to other `src_ip` values.
16 changes: 8 additions & 8 deletions docs/reference/ml/anomaly-detection/functions/geo.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ input data.

The {ml-features} include the following geographic function: `lat_long`.

NOTE: You cannot create forecasts for jobs that contain geographic functions.
You also cannot add rules with conditions to detectors that use geographic
functions.
NOTE: You cannot create forecasts for {anomaly-jobs} that contain geographic
functions. You also cannot add rules with conditions to detectors that use
geographic functions.

[float]
[[ml-lat-long]]
Expand All @@ -26,7 +26,7 @@ This function supports the following properties:
* `partition_field_name` (optional)

For more information about those properties,
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector Configuration Objects].
see {ref}/ml-job-resource.html#ml-detectorconfig[Detector configuration objects].

.Example 1: Analyzing transactions with the lat_long function
[source,js]
Expand All @@ -49,15 +49,15 @@ PUT _ml/anomaly_detectors/example1
// CONSOLE
// TEST[skip:needs-licence]

If you use this `lat_long` function in a detector in your job, it
If you use this `lat_long` function in a detector in your {anomaly-job}, it
detects anomalies where the geographic location of a credit card transaction is
unusual for a particular customer’s credit card. An anomaly might indicate fraud.

IMPORTANT: The `field_name` that you supply must be a single string that contains
two comma-separated numbers of the form `latitude,longitude`, a `geo_point` field,
a `geo_shape` field that contains point values, or a `geo_centroid` aggregation.
The `latitude` and `longitude` must be in the range -180 to 180 and represent a point on the
surface of the Earth.
The `latitude` and `longitude` must be in the range -180 to 180 and represent a
point on the surface of the Earth.

For example, JSON data might contain the following transaction coordinates:

Expand All @@ -75,6 +75,6 @@ In {es}, location data is likely to be stored in `geo_point` fields. For more
information, see {ref}/geo-point.html[Geo-point datatype]. This data type is
supported natively in {ml-features}. Specifically, {dfeed} when pulling data from
a `geo_point` field, will transform the data into the appropriate `lat,lon` string
format before sending to the {ml} job.
format before sending to the {anomaly-job}.

For more information, see <<ml-configuring-transform>>.
Loading