Skip to content

Commit

Permalink
[DOCS] Adds language identification documentation to the ML DFA docs (#…
Browse files Browse the repository at this point in the history
…816)

Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
  • Loading branch information
3 people authored Jan 29, 2020
1 parent 410f2ed commit b9076ab
Show file tree
Hide file tree
Showing 2 changed files with 193 additions and 2 deletions.
6 changes: 4 additions & 2 deletions docs/en/stack/ml/df-analytics/examples.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,18 @@

beta[]

These examples demonstrate how to use {dfanalytics} to derive useful
insights from your data.
These examples demonstrate how to use {dfanalytics} to derive useful insights
from your data.

* <<ecommerce-outliers>>
* https://github.com/elastic/examples/tree/master/Machine%20Learning/Outlier%20Detection/Introduction[{oldetection-cap} example (Jupyter notebook)]
* <<flightdata-regression>>
* <<flightdata-classification>>
* https://github.com/elastic/examples/tree/master/Machine%20Learning/Analytics%20Jupyter%20Notebooks[{classanalysis-cap} example (Jupyter notebook)]
* <<ml-lang-ident>>


include::ecommerce-outliers.asciidoc[]
include::flightdata-regression.asciidoc[]
include::flightdata-classification.asciidoc[]
include::ml-lang-ident.asciidoc[]
189 changes: 189 additions & 0 deletions docs/en/stack/ml/df-analytics/ml-lang-ident.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
[role="xpack"]
[[ml-lang-ident]]
=== {lang-ident-cap}

experimental[]

{lang-ident-cap} is an {infer} trained model (`lang_ident_model_1`) that you can
use to determine the language of text. You can reference the {lang-ident} model
in an {ref}/inference-processor.html[{infer} processor] of an ingest pipeline by
using its model ID (`lang_ident_model_1`). The input field name is `text`. If
you want to run {lang-ident} on a field with a different name, you must map your
field name to `text` in the ingest processor settings.

The longer the text passed into the {lang-ident} model, the more accurately the
model can identify the language. It is fairly accurate on short samples
(for example, 50 character-long streams) in certain languages, but languages
that are similar to each other are harder to identify based on a short
character stream.

{lang-ident-cap} takes into account Unicode boundaries when the feature set is
built. If the text has diacritical marks, then the model uses that information
for identifying the language of the text. In certain cases, the model can
detect the source language even if it is not written in the script that the
language traditionally uses. These languages are marked in the supported
languages table (see below) with the `Latn` subtag. {lang-ident-cap} supports
Unicode input.


[[ml-lang-ident-supported-languages]]
==== Supported languages

The table below contains the ISO codes and the English names of the languages
that {lang-ident} supports. If a language has a 2-letter `ISO 639-1` code, the
table contains that identifier. Otherwise, the 3-letter `ISO 639-2` code is
used. The ‘Latn’ subtag indicates that the language is transliterated into Latin
script.

[cols="<,<,<,<,<,<"]
|===
| Code | Language | Code | Language | Code | Language

| af | Afrikaans | hr | Croatian | pa | Punjabi
| am | Amharic | ht | Haitian | pl | Polish
| ar | Arabic | hu | Hungarian | ps | Pashto
| az | Azerbaijani | hy | Armenian | pt | Portuguese
| be | Belarusian | id | Indonesian | ro | Romanian
| bg | Bulgarian | ig | Igbo | ru | Russian
| bg-Latn | Bulgarian | is | Icelandic | ru-Latn | Russian
| bn | Bengali | it | Italian | sd | Sindhi
| bs | Bosnian | iw | Hebrew | si | Sinhala
| ca | Catalan | ja | Japanese | sk | Slovak
| ceb | Cebuano | ja-Latn | Japanese | sl | Slovenian
| co | Corsican | jv | Javanese | sm | Samoan
| cs | Czech | ka | Georgian | sn | Shona
| cy | Welsh | kk | Kazakh | so | Somali
| da | Danish | km | Central Khmer | sq | Albanian
| de | German | kn | Kannada | sr | Serbian
| el | Greek, modern | ko | Korean | st | Southern Sotho
| el-Latn | Greek, modern | ku | Kurdish | su | Sundanese
| en | English | ky | Kirghiz | sv | Swedish
| eo | Esperanto | la | Latin | sw | Swahili
| es | Spanish, Castilian | lb | Luxembourgish | ta | Tamil
| et | Estonian | lo | Lao | te | Telugu
| eu | Basque | lt | Lithuanian | tg | Tajik
| fa | Persian | lv | Latvian | th | Thai
| fi | Finnish | mg | Malagasy | tr | Turkish
| fil | Filipino | mi | Maori | uk | Ukrainian
| fr | French | mk | Macedonian | ur | Urdu
| fy | Western Frisian | ml | Malayalam | uz | Uzbek
| ga | Irish | mn | Mongolian | vi | Vietnamese
| gd | Gaelic | mr | Marathi | xh | Xhosa
| gl | Galician | ms | Malay | yi | Yiddish
| gu | Gujarati | mt | Maltese | yo | Yoruba
| ha | Hausa | my | Burmese | zh | Chinese
| haw | Hawaiian | ne | Nepali | zh-Latn | Chinese
| hi | Hindi | nl | Dutch, Flemish | zu | Zulu
| hi-Latn | Hindi | no | Norwegian | |
| hmn | Hmong | ny | Chichewa | |
|===


[[ml-lang-ident-example]]
==== Example of {lang-ident}

In the following example, we feed the {lang-ident} trained model a short
Hungarian text that contains diacritics and a couple of English words. The
model identifies the text correctly as Hungarian with high probability.

[source,js]
----------------------------------
POST _ingest/pipeline/_simulate
{
"pipeline":{
"processors":[
{
"inference":{
"model_id":"lang_ident_model_1", <1>
"inference_config":{
"classification":{
"num_top_classes":5 <2>
}
},
"field_mappings":{
}
}
}
]
},
"docs":[
{
"_source":{ <3>
"text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
}
}
]
}
----------------------------------
//NOTCONSOLE

<1> The ID of the {lang-ident} trained model.
<2> Indicates that only the top five languages (that is to say, the ones with the highest probability) are returned.
are reported. In this example, 5 classes (in this case, languages) with the
highest probability will be reported.
<3> The source object that contains the text to identify.


The request returns the following response:

[source,js]
----------------------------------
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
"ml" : {
"inference" : {
"top_classes" : [ <1>
{
"class_name" : "hu",
"class_probability" : 0.9999936063740517,
"class_score" : 0.9999936063740517
},
{
"class_name" : "lv",
"class_probability" : 2.5020248433413966E-6,
"class_score" : 2.5020248433413966E-6
},
{
"class_name" : "is",
"class_probability" : 1.0150420723037688E-6,
"class_score" : 1.0150420723037688E-6
},
{
"class_name" : "ga",
"class_probability" : 6.67935962773335E-7,
"class_score" : 6.67935962773335E-7
},
{
"class_name" : "tr",
"class_probability" : 5.591166324774555E-7,
"class_score" : 5.591166324774555E-7
}
],
"predicted_value" : "hu", <2>
"model_id" : "lang_ident_model_1"
}
}
},
"_ingest" : {
"timestamp" : "2020-01-22T14:25:14.644912Z"
}
}
}
]
}
----------------------------------
//NOTCONSOLE

<1> Contains probability scores for the top (most probable) inferred languages.
probable languages. The number of reported languages is defined by
`num_top_classes`.
<2> The predicted value is the ISO identifier of the language with the highest
probability.

0 comments on commit b9076ab

Please sign in to comment.