Skip to content

Commit

Permalink
[DOCS] Adds screenshots to regression example (#842)
Browse files Browse the repository at this point in the history
  • Loading branch information
lcawl committed Feb 7, 2020
1 parent 3fd3043 commit 2a933ee
Show file tree
Hide file tree
Showing 4 changed files with 135 additions and 93 deletions.
228 changes: 135 additions & 93 deletions docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,21 @@ distance and carrier to predict the number of minutes delayed for each flight.
As it is a continuous numeric variable, we'll use {reganalysis} to make the
prediction.

We have chosen this dataset as an example because it is easily accessible for
We have chosen this data set as an example because it is easily accessible for
{kib} users and the use case is relevant. However, the data has been manually
created and contains some inconsistencies. For example, a flight can be both
delayed and canceled. Please remember that the quality of your input data will
affect the quality of results.

Each document in the dataset contains details for a single flight, so this data
Each document in the data set contains details for a single flight, so this data
is ready for analysis as it is already in a two-dimensional entity-based data
structure (_{dataframe}_). In general, you often need to
{ref}/transforms.html[transform] the data into an entity-centric index before
you analyze the data.

This is an example source document from the dataset:

.Example source document
[%collapsible]
====
```
{
"_index": "kibana_sample_data_flights",
Expand Down Expand Up @@ -70,6 +71,7 @@ This is an example source document from the dataset:
}
}
```
====


{regression-cap} is a supervised machine learning analysis and therefore needs
Expand All @@ -96,18 +98,41 @@ To predict the number of minutes delayed for each flight:
. Create a {dfanalytics-job}.
+
--
Use the {ref}/put-dfanalytics.html[create {dfanalytics-jobs}] API as you can see
in the following example:
You can use the wizard on the *Machine Learning* > *Data Frame Analaytics* tab
in {kib} or the {ref}/put-dfanalytics.html[create {dfanalytics-jobs}] API.

image::images/flights-regression-job.jpg[alt="Creating a {dfanalytics-job} in {kib}",width="50%",role="screenshot left",align="text-left"]

.. Choose `regression` as the job type.
.. Choose `kibana_sample_data_flights` as the source index.
.. Add the name of the destination index that will contain the results of the
analysis. It will contain a copy of the source index data where each document is
annotated with the results. If the index does not exist, it will be created
automatically.
.. Choose `FlightDelayMin` as the dependent variable, which is the field that we
want to predict with the {reganalysis}.
.. Choose a training percent of `90` which means it randomly selects 90% of the
source data for training.
.. Add `Cancelled`, `FlightDelay`, and `FlightDelayType` to the list of excluded
fields. These fields will be excluded from the analysis. It is recommended to
exclude fields that either contain erroneous data or describe the
`dependent_variable`.
.. Use the default memory limit for the job. If the job requires more than this
amount of memory, it fails to start. If the available memory on the node is
limited, this setting makes it possible to prevent job execution.

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/model-flight-delays
{
"source": {
"index": [
"kibana_sample_data_flights" <1>
"kibana_sample_data_flights"
],
"query": { <2>
"query": { <1>
"range": {
"DistanceKilometers": {
"gt": 0
Expand All @@ -116,74 +141,72 @@ PUT _ml/data_frame/analytics/model-flight-delays
}
},
"dest": {
"index": "df-flight-delays" <3>
"index": "df-flight-delays"
},
"analysis": {
"regression": {
"dependent_variable": "FlightDelayMin", <4>
"training_percent": 90 <5>
"dependent_variable": "FlightDelayMin",
"training_percent": 90
}
},
"analyzed_fields": {
"includes": [],
"excludes": [ <6>
"excludes": [
"Cancelled",
"FlightDelay",
"FlightDelayType"
]
},
"model_memory_limit": "100mb" <7>
"model_memory_limit": "100mb"
}
--------------------------------------------------
// TEST[skip:setup kibana sample data]
<1> The source index to analyze.
<2> This query removes erroneous data from the analysis to improve its quality.
<3> The index that will contain the results of the analysis; it will consist of
a copy of the source index data where each document is annotated with the
results.
<4> Specifies the continuous variable we want to predict with the {reganalysis}.
<5> Specifies the approximate proportion of data that is used for training. In
this example we randomly select 90% of the source data for training.
<6> Specifies fields to be excluded from the analysis. It is recommended to
exclude fields that either contain erroneous data or describe the
`dependent_variable`.
<7> Specifies a memory limit for the job. If the job requires more than this
amount of memory, it fails to start. This makes it possible to prevent job
execution if the available memory on the node is limited.
<1> This optional query removes erroneous data from the analysis to improve its
quality.
====
--

. Start the job.
. Start the job in {kib} or use the
{ref}/start-dfanalytics.html[start {dfanalytics-jobs}] API.
+
--
Use the {ref}/start-dfanalytics.html[start {dfanalytics-jobs}] API to start the
job. It will stop automatically when the analysis is complete.
The job takes a few minutes to run. Runtime depends on the local hardware and
also on the number of documents and fields that are analyzed. The more fields
and documents, the longer the job runs. It stops automatically when the analysis
is complete.

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
POST _ml/data_frame/analytics/model-flight-delays/_start
--------------------------------------------------
// TEST[skip:TBD]


The job takes a few minutes to run. Runtime depends on the local hardware and
also on the number of documents and fields that analyzed. The more fields and
documents, the longer the job runs.
====
--

. Check the job stats to follow the progress by using the
. Check the job stats to follow the progress in {kib} or use the
{ref}/get-dfanalytics-stats.html[get {dfanalytics-jobs} statistics API].
+
--
[role="screenshot"]
image::images/flights-regression-details.jpg["Statistics for a {dfanalytics-job} in {kib}"]

The job has four phases (reindexing, loading data, analyzing, and writing
results). When all the phases have completed, the job stops and the results are
ready to view and evaluate.

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
GET _ml/data_frame/analytics/model-flight-delays/_stats
--------------------------------------------------
// TEST[skip:TBD]

The API call returns the following response:
[source,console-result]
Expand Down Expand Up @@ -215,29 +238,42 @@ The API call returns the following response:
}
]
}
----


The job has four phases. When all the phases have completed, the job stops and
the results are ready to view and evaluate.
----
====
--


[[flightdata-regression-results]]
==== Viewing {regression} results

Now you have a new index that contains a copy of your source data with
predictions for your dependent variable. Use the standard {es} search command to
view the results in the destination index:
predictions for your dependent variable.

When you view the {regression} results in {kib}, it shows the contents of the
destination index in a tabular format:

[role="screenshot"]
image::images/flights-regression-results.jpg["Results for a {dfanalytics-job} in {kib}"]

In this example, the table shows a column for the dependent variable
(`FlightDelayMin`), which contains the ground truth values that we are trying to
predict with the {reganalysis}. It also shows a column for the prediction values
(`ml.FlightDelayMin_prediction`) and a column that indicates whether the
document was used in the training set (`ml.is_training`). You can filter the
table to show only testing or training data and you can select which fields are
shown in the table.

If you do not use {kib}, you can see the same information by using the standard
{es} search command to view the results in the destination index.

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
GET df-flight-delays/_search
--------------------------------------------------
// TEST[skip:TBD]


The snippet below shows a part of a document with the annotated results:
[source,console-result]
Expand All @@ -246,33 +282,43 @@ The snippet below shows a part of a document with the annotated results:
"DestRegion" : "UK",
"OriginAirportID" : "LHR",
"DestCityName" : "London",
"FlightDelayMin" : 66, <1>
"FlightDelayMin" : 66,
"ml" : {
"FlightDelayMin_prediction" : 62.527, <2>
"is_training" : false <3>
"FlightDelayMin_prediction" : 62.527,
"is_training" : false
}
...
----

<1> The `dependent_variable` with the ground truth value. This is what we are
trying to predict with the {reganalysis}.
<2> The prediction. The field name is suffixed with `_prediction`.
<3> Indicates that this document was not used in the training set.
====


[[flightdata-regression-evaluate]]
==== Evaluating results
==== Evaluating {regression} results

The results can be evaluated for documents which contain both the ground truth
field and the prediction. In the example below, `FlightDelayMins` contains the
ground truth and the prediction is stored as `ml.FlightDelayMin_prediction`.
Though you can look at individual results and compare the predicted value
(`ml.FlightDelayMin_prediction`) to the actual value (`FlightDelayMins`), you
typically need to evaluate the success of the {regression} model as a whole.

. Use the {dfanalytics} evaluate API to evaluate the results.
+
--
First, we want to know the training error that represents how well the model
performed on the training dataset:
{kib} provides _training error_ metrics, which represent how well the model
performed on the training data set. It also provides _generalization error_
metrics, which represent how well the model performed on testing data.

A MSE of zero means that the models predicts the dependent variable with
perfect accuracy. This is the ideal, but is typically not possible. Likewise, an
R-squared value of 1 indicates that all of the variance in the dependent variable
can be explained by the feature variables. Typically, you compare the MSE and
R-squared values from multiple {regression} models to find the best balance or
fit for your data.

For more information about the interpreting the evaluation metrics, see
<<ml-dfanalytics-regression-evaluation>>.

You can alternatively generate these metrics with the
{ref}/evaluate-dfanalytics.html[{dfanalytics} evaluate API].

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
POST _ml/data_frame/_evaluate
Expand All @@ -297,13 +343,28 @@ POST _ml/data_frame/_evaluate
--------------------------------------------------
// TEST[skip:TBD]
<1> The destination index which is the output of the analysis job.
<2> We calculate the training error by only evaluating the training data.
<3> The ground truth label.
<4> Predicted value.
<1> The destination index which is the output of the {dfanalysis-job}.
<2> We calculate the training error by evaluating only the training data.
<3> The field that contains the actual (ground truth) value.
<4> The field that contains the predicted value.
Next, we calculate the generalization error that represents how well the model
performed on previously unseen data:
The API returns a response like this:
[source,console-result]
----
{
"regression" : {
"mean_squared_error" : {
"error" : 3006.517622042659
},
"r_squared" : {
"value" : 0.6794200914263231
}
}
}
----
Next, we calculate the generalization error:
[source,console]
--------------------------------------------------
Expand All @@ -329,28 +390,9 @@ POST _ml/data_frame/_evaluate
--------------------------------------------------
// TEST[skip:TBD]
<1> We evaluate only the documents that are not part of the training data.
====


The evaluate {dfanalytics} API returns the following response:

[source,console-result]
----
{
"regression" : {
"mean_squared_error" : {
"error" : 3759.7242253334207
},
"r_squared" : {
"value" : 0.5853159777330623
}
}
}
----

For more information about the evaluation metrics, see
<<dfa-regression-evaluation>>.

If you don't want to keep the {dfanalytics-job}, you can delete it by using the
{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete
{dfanalytics-jobs}, the destination indices remain intact.
--
If you don't want to keep the {dfanalytics-job}, you can delete it. For example,
use {kib} or the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API].
When you delete {dfanalytics-jobs}, the destination indices remain intact.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2a933ee

Please sign in to comment.