[DOCS] Adds language identification documentation to the ML DFA docs (#…

…816) Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com> Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
elastic · Jan 29, 2020 · b9076ab · b9076ab
1 parent 410f2ed
commit b9076ab
Show file tree

Hide file tree

Showing 2 changed files with 193 additions and 2 deletions.
diff --git a/docs/en/stack/ml/df-analytics/examples.asciidoc b/docs/en/stack/ml/df-analytics/examples.asciidoc
@@ -8,16 +8,18 @@
 
 beta[]
 
-These examples demonstrate how to use {dfanalytics} to derive useful 
-insights from your data.
+These examples demonstrate how to use {dfanalytics} to derive useful insights 
+from your data.
 
 * <<ecommerce-outliers>>
 * https://github.com/elastic/examples/tree/master/Machine%20Learning/Outlier%20Detection/Introduction[{oldetection-cap} example (Jupyter notebook)]
 * <<flightdata-regression>>
 * <<flightdata-classification>>
 * https://github.com/elastic/examples/tree/master/Machine%20Learning/Analytics%20Jupyter%20Notebooks[{classanalysis-cap} example (Jupyter notebook)]
+* <<ml-lang-ident>>
 
 
 include::ecommerce-outliers.asciidoc[]
 include::flightdata-regression.asciidoc[]
 include::flightdata-classification.asciidoc[]
+include::ml-lang-ident.asciidoc[]
diff --git a/docs/en/stack/ml/df-analytics/ml-lang-ident.asciidoc b/docs/en/stack/ml/df-analytics/ml-lang-ident.asciidoc
@@ -0,0 +1,189 @@
+[role="xpack"]
+[[ml-lang-ident]]
+=== {lang-ident-cap}
+
+experimental[]
+
+{lang-ident-cap} is an {infer} trained model (`lang_ident_model_1`) that you can 
+use to determine the language of text. You can reference the {lang-ident} model 
+in an {ref}/inference-processor.html[{infer} processor] of an ingest pipeline by 
+using its model ID (`lang_ident_model_1`). The input field name is `text`. If 
+you want to run {lang-ident} on a field with a different name, you must map your 
+field name to `text` in the ingest processor settings.
+
+The longer the text passed into the {lang-ident} model, the more accurately the 
+model can identify the language. It is fairly accurate on short samples 
+(for example, 50 character-long streams) in certain languages, but languages 
+that are similar to each other are harder to identify based on a short 
+character stream.
+
+{lang-ident-cap} takes into account Unicode boundaries when the feature set is 
+built. If the text has diacritical marks, then the model uses that information 
+for identifying the language of the text.  In certain cases, the model can 
+detect the source language even if it is not written in the script that the 
+language traditionally uses. These languages are marked in the supported 
+languages table (see below) with the `Latn` subtag. {lang-ident-cap} supports 
+Unicode input.
+
+
+[[ml-lang-ident-supported-languages]]
+==== Supported languages
+
+The table below contains the ISO codes and the English names of the languages 
+that {lang-ident} supports. If a language has a 2-letter `ISO 639-1` code, the 
+table contains that identifier. Otherwise, the 3-letter `ISO 639-2` code is 
+used. The ‘Latn’ subtag indicates that the language is transliterated into Latin 
+script.
+
+[cols="<,<,<,<,<,<"]
+|===
+| Code    | Language           | Code    | Language       | Code    | Language
+
+| af      | Afrikaans          | hr      | Croatian       | pa      | Punjabi        
+| am      | Amharic            | ht      | Haitian        | pl      | Polish        
+| ar      | Arabic             | hu      | Hungarian      | ps      | Pashto        
+| az      | Azerbaijani        | hy      | Armenian       | pt      | Portuguese
+| be      | Belarusian         | id      | Indonesian     | ro      | Romanian
+| bg      | Bulgarian          | ig      | Igbo           | ru      | Russian
+| bg-Latn | Bulgarian          | is      | Icelandic      | ru-Latn | Russian
+| bn      | Bengali            | it      | Italian        | sd      | Sindhi
+| bs      | Bosnian            | iw      | Hebrew         | si      | Sinhala
+| ca      | Catalan            | ja      | Japanese       | sk      | Slovak
+| ceb     | Cebuano            | ja-Latn | Japanese       | sl      | Slovenian
+| co      | Corsican           | jv      | Javanese       | sm      | Samoan
+| cs      | Czech              | ka      | Georgian       | sn      | Shona
+| cy      | Welsh              | kk      | Kazakh         | so      | Somali
+| da      | Danish             | km      | Central Khmer  | sq      | Albanian
+| de      | German             | kn      | Kannada        | sr      | Serbian
+| el      | Greek, modern      | ko      | Korean         | st      | Southern Sotho
+| el-Latn | Greek, modern      | ku      | Kurdish        | su      | Sundanese
+| en      | English            | ky      | Kirghiz        | sv      | Swedish
+| eo      | Esperanto          | la      | Latin          | sw      | Swahili
+| es      | Spanish, Castilian | lb      | Luxembourgish  | ta      | Tamil
+| et      | Estonian           | lo      | Lao            | te      | Telugu
+| eu      | Basque             | lt      | Lithuanian     | tg      | Tajik
+| fa      | Persian            | lv      | Latvian        | th      | Thai
+| fi      | Finnish            | mg      | Malagasy       | tr      | Turkish
+| fil     | Filipino           | mi      | Maori          | uk      | Ukrainian
+| fr      | French             | mk      | Macedonian     | ur      | Urdu
+| fy      | Western Frisian    | ml      | Malayalam      | uz      | Uzbek
+| ga      | Irish              | mn      | Mongolian      | vi      | Vietnamese
+| gd      | Gaelic             | mr      | Marathi        | xh      | Xhosa
+| gl      | Galician           | ms      | Malay          | yi      | Yiddish
+| gu      | Gujarati           | mt      | Maltese        | yo      | Yoruba
+| ha      | Hausa              | my      | Burmese        | zh      | Chinese
+| haw     | Hawaiian           | ne      | Nepali         | zh-Latn | Chinese
+| hi      | Hindi              | nl      | Dutch, Flemish | zu      | Zulu
+| hi-Latn | Hindi              | no      | Norwegian      |         |   
+| hmn     | Hmong              | ny      | Chichewa       |         |   
+|===
+
+
+[[ml-lang-ident-example]]
+==== Example of {lang-ident}
+
+In the following example, we feed the {lang-ident} trained model a short 
+Hungarian text that contains diacritics and a couple of English words. The 
+model identifies the text correctly as Hungarian with high probability.
+
+[source,js]
+----------------------------------
+POST _ingest/pipeline/_simulate
+{
+   "pipeline":{
+      "processors":[
+         {
+            "inference":{
+               "model_id":"lang_ident_model_1", <1>
+               "inference_config":{
+                  "classification":{
+                     "num_top_classes":5 <2>
+                  }
+               },
+               "field_mappings":{
+
+               }
+            }
+         }
+      ]
+   },
+   "docs":[
+      {
+         "_source":{ <3>
+            "text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
+         }
+      }
+   ]
+}
+----------------------------------
+//NOTCONSOLE
+
+<1> The ID of the {lang-ident} trained model.
+<2> Indicates that only the top five languages (that is to say, the ones with the highest probability) are returned.
+are reported. In this example, 5 classes (in this case, languages) with the 
+highest probability will be reported.
+<3> The source object that contains the text to identify.
+
+
+The request returns the following response:
+
+[source,js]
+----------------------------------
+{
+  "docs" : [
+    {
+      "doc" : {
+        "_index" : "_index",
+        "_type" : "_doc",
+        "_id" : "_id",
+        "_source" : {
+          "text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
+          "ml" : {
+            "inference" : {
+              "top_classes" : [ <1>
+                {
+                  "class_name" : "hu",
+                  "class_probability" : 0.9999936063740517,
+                  "class_score" : 0.9999936063740517
+                },
+                {
+                  "class_name" : "lv",
+                  "class_probability" : 2.5020248433413966E-6,
+                  "class_score" : 2.5020248433413966E-6
+                },
+                {
+                  "class_name" : "is",
+                  "class_probability" : 1.0150420723037688E-6,
+                  "class_score" : 1.0150420723037688E-6
+                },
+                {
+                  "class_name" : "ga",
+                  "class_probability" : 6.67935962773335E-7,
+                  "class_score" : 6.67935962773335E-7
+                },
+                {
+                  "class_name" : "tr",
+                  "class_probability" : 5.591166324774555E-7,
+                  "class_score" : 5.591166324774555E-7
+                }
+              ],
+              "predicted_value" : "hu", <2>
+              "model_id" : "lang_ident_model_1"
+            }
+          }
+        },
+        "_ingest" : {
+          "timestamp" : "2020-01-22T14:25:14.644912Z"
+        }
+      }
+    }
+  ]
+}
+----------------------------------
+//NOTCONSOLE
+
+<1> Contains probability scores for the top (most probable) inferred languages.
+probable languages. The number of reported languages is defined by 
+`num_top_classes`.
+<2> The predicted value is the ISO identifier of the language with the highest 
+probability.