[DOCS] Adds screenshots to regression example (#842)

elastic · Feb 7, 2020 · 2a933ee · 2a933ee
1 parent 3fd3043
commit 2a933ee
Show file tree

Hide file tree

Showing 4 changed files with 135 additions and 93 deletions.
diff --git a/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc
@@ -10,20 +10,21 @@ distance and carrier to predict the number of minutes delayed for each flight.
 As it is a continuous numeric variable, we'll use {reganalysis} to make the 
 prediction.
 
-We have chosen this dataset as an example because it is easily accessible for 
+We have chosen this data set as an example because it is easily accessible for 
 {kib} users and the use case is relevant. However, the data has been manually 
 created and contains some inconsistencies. For example, a flight can be both 
 delayed and canceled. Please remember that the quality of your input data will 
 affect the quality of results.
 
-Each document in the dataset contains details for a single flight, so this data 
+Each document in the data set contains details for a single flight, so this data 
 is ready for analysis as it is already in a two-dimensional entity-based data 
 structure (_{dataframe}_). In general, you often need to 
 {ref}/transforms.html[transform] the data into an entity-centric index before 
 you analyze the data.
 
-This is an example source document from the dataset:
-
+.Example source document
+[%collapsible]
+====
 ```
 {
   "_index": "kibana_sample_data_flights",
@@ -70,6 +71,7 @@ This is an example source document from the dataset:
   }
 }
 ```
+====
 
 
 {regression-cap} is a supervised machine learning analysis and therefore needs 
@@ -96,18 +98,41 @@ To predict the number of minutes delayed for each flight:
 . Create a {dfanalytics-job}.
 +
 --
-Use the {ref}/put-dfanalytics.html[create {dfanalytics-jobs}] API as you can see 
-in the following example:
+You can use the wizard on the *Machine Learning* > *Data Frame Analaytics* tab
+in {kib} or the {ref}/put-dfanalytics.html[create {dfanalytics-jobs}] API.
+
+image::images/flights-regression-job.jpg[alt="Creating a {dfanalytics-job} in {kib}",width="50%",role="screenshot left",align="text-left"]
+
+.. Choose `regression` as the job type.
+.. Choose `kibana_sample_data_flights` as the source index.
+.. Add the name of the destination index that will contain the results of the
+analysis. It will contain a copy of the source index data where each document is
+annotated with the results. If the index does not exist, it will be created
+automatically.
+.. Choose `FlightDelayMin` as the dependent variable, which is the field that we
+want to predict with the {reganalysis}.
+.. Choose a training percent of `90` which means it randomly selects 90% of the
+source data for training.
+.. Add `Cancelled`, `FlightDelay`, and `FlightDelayType` to the list of excluded
+fields. These fields will be excluded from the analysis. It is recommended to 
+exclude fields that either contain erroneous data or describe the 
+`dependent_variable`.
+.. Use the default memory limit for the job. If the job requires more than this 
+amount of memory, it fails to start. If the available memory on the node is
+limited, this setting makes it possible to prevent job execution.
 
+.API example
+[%collapsible]
+====
 [source,console]
 --------------------------------------------------
 PUT _ml/data_frame/analytics/model-flight-delays
 {
   "source": {
     "index": [
-      "kibana_sample_data_flights" <1>
+      "kibana_sample_data_flights"
     ],
-    "query": { <2>
+    "query": { <1>
       "range": {
         "DistanceKilometers": { 
           "gt": 0
@@ -116,74 +141,72 @@ PUT _ml/data_frame/analytics/model-flight-delays
     }
   },
   "dest": {
-    "index": "df-flight-delays"  <3>
+    "index": "df-flight-delays"
   },
   "analysis": {
     "regression": {
-      "dependent_variable": "FlightDelayMin",  <4>
-      "training_percent": 90  <5>
+      "dependent_variable": "FlightDelayMin",
+      "training_percent": 90
     }
   },
   "analyzed_fields": {
     "includes": [],
-    "excludes": [    <6>
+    "excludes": [
       "Cancelled",
       "FlightDelay",
       "FlightDelayType"
     ]
   },
-  "model_memory_limit": "100mb" <7>
+  "model_memory_limit": "100mb"
 }
 --------------------------------------------------
 // TEST[skip:setup kibana sample data]
 
-<1> The source index to analyze.
-<2> This query removes erroneous data from the analysis to improve its quality.
-<3> The index that will contain the results of the analysis; it will consist of 
-a copy of the source index data where each document is annotated with the 
-results.
-<4> Specifies the continuous variable we want to predict with the {reganalysis}.
-<5> Specifies the approximate proportion of data that is used for training. In 
-this example we randomly select 90% of the source data for training.
-<6> Specifies fields to be excluded from the analysis. It is recommended to 
-exclude fields that either contain erroneous data or describe the 
-`dependent_variable`.
-<7> Specifies a memory limit for the job. If the job requires more than this 
-amount of memory, it fails to start. This makes it possible to prevent job 
-execution if the available memory on the node is limited.
+<1> This optional query removes erroneous data from the analysis to improve its
+quality.
+====
 --
 
-. Start the job.
+. Start the job in {kib} or use the
+{ref}/start-dfanalytics.html[start {dfanalytics-jobs}] API.
 +
 --
-Use the {ref}/start-dfanalytics.html[start {dfanalytics-jobs}] API to start the 
-job. It will stop automatically when the analysis is complete.
+The job takes a few minutes to run. Runtime depends on the local hardware and 
+also on the number of documents and fields that are analyzed. The more fields
+and documents, the longer the job runs. It stops automatically when the analysis
+is complete.
 
+.API example
+[%collapsible]
+====
 [source,console]
 --------------------------------------------------
 POST _ml/data_frame/analytics/model-flight-delays/_start
 --------------------------------------------------
 // TEST[skip:TBD]
-
-
-The job takes a few minutes to run. Runtime depends on the local hardware and 
-also on the number of documents and fields that analyzed. The more fields and 
-documents, the longer the job runs.
+====
 --
 
-. Check the job stats to follow the progress by using the 
+. Check the job stats to follow the progress in {kib} or use the 
 {ref}/get-dfanalytics-stats.html[get {dfanalytics-jobs} statistics API].
 +
 --
+[role="screenshot"]
+image::images/flights-regression-details.jpg["Statistics for a {dfanalytics-job} in {kib}"]
 
+The job has four phases (reindexing, loading data, analyzing, and writing
+results). When all the phases have completed, the job stops and the results are
+ready to view and evaluate.
 
+.API example
+[%collapsible]
+====
 [source,console]
 --------------------------------------------------
 GET _ml/data_frame/analytics/model-flight-delays/_stats
 --------------------------------------------------
 // TEST[skip:TBD]
 
-
 The API call returns the following response: 
 
 [source,console-result]
@@ -215,29 +238,42 @@ The API call returns the following response:
     }
   ]
 }
-----  
-
-
-The job has four phases. When all the phases have completed, the job stops and 
-the results are ready to view and evaluate.
+----
+====
 --
 
-
 [[flightdata-regression-results]]
 ==== Viewing {regression} results
 
 Now you have a new index that contains a copy of your source data with 
-predictions for your dependent variable. Use the standard {es} search command to 
-view the results in the destination index:
+predictions for your dependent variable.
+
+When you view the {regression} results in {kib}, it shows the contents of the
+destination index in a tabular format:
+
+[role="screenshot"]
+image::images/flights-regression-results.jpg["Results for a {dfanalytics-job} in {kib}"]
+
+In this example, the table shows a column for the dependent variable
+(`FlightDelayMin`), which contains the ground truth values that we are trying to
+predict with the {reganalysis}. It also shows a column for the prediction values
+(`ml.FlightDelayMin_prediction`) and a column that indicates whether the
+document was used in the training set (`ml.is_training`). You can filter the
+table to show only testing or training data and you can select which fields are
+shown in the table.
 
+If you do not use {kib}, you can see the same information by using the standard
+{es} search command to view the results in the destination index.
+
+.API example
+[%collapsible]
+====
 [source,console]
 --------------------------------------------------
 GET df-flight-delays/_search
 --------------------------------------------------
 // TEST[skip:TBD]
 
-
-
 The snippet below shows a part of a document with the annotated results:
 
 [source,console-result]
@@ -246,33 +282,43 @@ The snippet below shows a part of a document with the annotated results:
           "DestRegion" : "UK",
           "OriginAirportID" : "LHR",
           "DestCityName" : "London",
-          "FlightDelayMin" : 66,      <1>
+          "FlightDelayMin" : 66,
           "ml" : {
-            "FlightDelayMin_prediction" : 62.527,   <2>
-            "is_training" : false   <3>
+            "FlightDelayMin_prediction" : 62.527,
+            "is_training" : false
           }
           ...
 ----
-
-<1> The `dependent_variable` with the ground truth value. This is what we are 
-trying to predict with the {reganalysis}.
-<2> The prediction. The field name is suffixed with `_prediction`.
-<3> Indicates that this document was not used in the training set.
+====
 
 
 [[flightdata-regression-evaluate]]
-==== Evaluating results
+==== Evaluating {regression} results
 
-The results can be evaluated for documents which contain both the ground truth 
-field and the prediction. In the example below, `FlightDelayMins` contains the 
-ground truth and the prediction is stored as `ml.FlightDelayMin_prediction`.
+Though you can look at individual results and compare the predicted value
+(`ml.FlightDelayMin_prediction`) to the actual value (`FlightDelayMins`), you
+typically need to evaluate the success of the {regression} model as a whole.
 
-. Use the {dfanalytics} evaluate API to evaluate the results.
-+
---
-First, we want to know the training error that represents how well the model 
-performed on the training dataset:
+{kib} provides _training error_ metrics, which represent how well the model
+performed on the training data set. It also provides _generalization error_
+metrics, which represent how well the model performed on testing data.
+
+A MSE of zero means that the models predicts the dependent variable with
+perfect accuracy. This is the ideal, but is typically not possible. Likewise, an
+R-squared value of 1 indicates that all of the variance in the dependent variable
+can be explained by the feature variables. Typically, you compare the MSE and
+R-squared values from multiple {regression} models to find the best balance or
+fit for your data.
 
+For more information about the interpreting the evaluation metrics, see
+<<ml-dfanalytics-regression-evaluation>>.
+
+You can alternatively generate these metrics with the
+{ref}/evaluate-dfanalytics.html[{dfanalytics} evaluate API].
+
+.API example
+[%collapsible]
+====
 [source,console]
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
@@ -297,13 +343,28 @@ POST _ml/data_frame/_evaluate
 --------------------------------------------------
 // TEST[skip:TBD]
 
-<1> The destination index which is the output of the analysis job.
-<2> We calculate the training error by only evaluating the training data.
-<3> The ground truth label.
-<4> Predicted value.
+<1> The destination index which is the output of the {dfanalysis-job}.
+<2> We calculate the training error by evaluating only the training data.
+<3> The field that contains the actual (ground truth) value.
+<4> The field that contains the predicted value.
 
-Next, we calculate the generalization error that represents how well the model 
-performed on previously unseen data:
+The API returns a response like this:
+
+[source,console-result]
+----  
+{
+  "regression" : {
+    "mean_squared_error" : {
+      "error" : 3006.517622042659
+    },
+    "r_squared" : {
+      "value" : 0.6794200914263231
+    }
+  }
+}
+----
+
+Next, we calculate the generalization error:
 
 [source,console]
 --------------------------------------------------
@@ -329,28 +390,9 @@ POST _ml/data_frame/_evaluate
 --------------------------------------------------
 // TEST[skip:TBD]
 <1> We evaluate only the documents that are not part of the training data.
+====
 
 
-The evaluate {dfanalytics} API returns the following response:
-
-[source,console-result]
-----  
-{
-  "regression" : {
-    "mean_squared_error" : {
-      "error" : 3759.7242253334207
-    },
-    "r_squared" : {
-      "value" : 0.5853159777330623
-    }
-  }
-}
-----
-
-For more information about the evaluation metrics, see 
-<<dfa-regression-evaluation>>.
-
-If you don't want to keep the {dfanalytics-job}, you can delete it by using the 
-{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete 
-{dfanalytics-jobs}, the destination indices remain intact.
---
+If you don't want to keep the {dfanalytics-job}, you can delete it. For example,
+use {kib} or the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API].
+When you delete  {dfanalytics-jobs}, the destination indices remain intact.
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-details.jpg b/docs/en/stack/ml/df-analytics/images/flights-regression-details.jpg
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-job.jpg b/docs/en/stack/ml/df-analytics/images/flights-regression-job.jpg
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-results.jpg b/docs/en/stack/ml/df-analytics/images/flights-regression-results.jpg