-
Notifications
You must be signed in to change notification settings - Fork 257
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DOCS] Adds language identification documentation to the ML DFA docs (#…
…816) Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com> Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
- Loading branch information
1 parent
410f2ed
commit b9076ab
Showing
2 changed files
with
193 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
[role="xpack"] | ||
[[ml-lang-ident]] | ||
=== {lang-ident-cap} | ||
|
||
experimental[] | ||
|
||
{lang-ident-cap} is an {infer} trained model (`lang_ident_model_1`) that you can | ||
use to determine the language of text. You can reference the {lang-ident} model | ||
in an {ref}/inference-processor.html[{infer} processor] of an ingest pipeline by | ||
using its model ID (`lang_ident_model_1`). The input field name is `text`. If | ||
you want to run {lang-ident} on a field with a different name, you must map your | ||
field name to `text` in the ingest processor settings. | ||
|
||
The longer the text passed into the {lang-ident} model, the more accurately the | ||
model can identify the language. It is fairly accurate on short samples | ||
(for example, 50 character-long streams) in certain languages, but languages | ||
that are similar to each other are harder to identify based on a short | ||
character stream. | ||
|
||
{lang-ident-cap} takes into account Unicode boundaries when the feature set is | ||
built. If the text has diacritical marks, then the model uses that information | ||
for identifying the language of the text. In certain cases, the model can | ||
detect the source language even if it is not written in the script that the | ||
language traditionally uses. These languages are marked in the supported | ||
languages table (see below) with the `Latn` subtag. {lang-ident-cap} supports | ||
Unicode input. | ||
|
||
|
||
[[ml-lang-ident-supported-languages]] | ||
==== Supported languages | ||
|
||
The table below contains the ISO codes and the English names of the languages | ||
that {lang-ident} supports. If a language has a 2-letter `ISO 639-1` code, the | ||
table contains that identifier. Otherwise, the 3-letter `ISO 639-2` code is | ||
used. The ‘Latn’ subtag indicates that the language is transliterated into Latin | ||
script. | ||
|
||
[cols="<,<,<,<,<,<"] | ||
|=== | ||
| Code | Language | Code | Language | Code | Language | ||
|
||
| af | Afrikaans | hr | Croatian | pa | Punjabi | ||
| am | Amharic | ht | Haitian | pl | Polish | ||
| ar | Arabic | hu | Hungarian | ps | Pashto | ||
| az | Azerbaijani | hy | Armenian | pt | Portuguese | ||
| be | Belarusian | id | Indonesian | ro | Romanian | ||
| bg | Bulgarian | ig | Igbo | ru | Russian | ||
| bg-Latn | Bulgarian | is | Icelandic | ru-Latn | Russian | ||
| bn | Bengali | it | Italian | sd | Sindhi | ||
| bs | Bosnian | iw | Hebrew | si | Sinhala | ||
| ca | Catalan | ja | Japanese | sk | Slovak | ||
| ceb | Cebuano | ja-Latn | Japanese | sl | Slovenian | ||
| co | Corsican | jv | Javanese | sm | Samoan | ||
| cs | Czech | ka | Georgian | sn | Shona | ||
| cy | Welsh | kk | Kazakh | so | Somali | ||
| da | Danish | km | Central Khmer | sq | Albanian | ||
| de | German | kn | Kannada | sr | Serbian | ||
| el | Greek, modern | ko | Korean | st | Southern Sotho | ||
| el-Latn | Greek, modern | ku | Kurdish | su | Sundanese | ||
| en | English | ky | Kirghiz | sv | Swedish | ||
| eo | Esperanto | la | Latin | sw | Swahili | ||
| es | Spanish, Castilian | lb | Luxembourgish | ta | Tamil | ||
| et | Estonian | lo | Lao | te | Telugu | ||
| eu | Basque | lt | Lithuanian | tg | Tajik | ||
| fa | Persian | lv | Latvian | th | Thai | ||
| fi | Finnish | mg | Malagasy | tr | Turkish | ||
| fil | Filipino | mi | Maori | uk | Ukrainian | ||
| fr | French | mk | Macedonian | ur | Urdu | ||
| fy | Western Frisian | ml | Malayalam | uz | Uzbek | ||
| ga | Irish | mn | Mongolian | vi | Vietnamese | ||
| gd | Gaelic | mr | Marathi | xh | Xhosa | ||
| gl | Galician | ms | Malay | yi | Yiddish | ||
| gu | Gujarati | mt | Maltese | yo | Yoruba | ||
| ha | Hausa | my | Burmese | zh | Chinese | ||
| haw | Hawaiian | ne | Nepali | zh-Latn | Chinese | ||
| hi | Hindi | nl | Dutch, Flemish | zu | Zulu | ||
| hi-Latn | Hindi | no | Norwegian | | | ||
| hmn | Hmong | ny | Chichewa | | | ||
|=== | ||
|
||
|
||
[[ml-lang-ident-example]] | ||
==== Example of {lang-ident} | ||
|
||
In the following example, we feed the {lang-ident} trained model a short | ||
Hungarian text that contains diacritics and a couple of English words. The | ||
model identifies the text correctly as Hungarian with high probability. | ||
|
||
[source,js] | ||
---------------------------------- | ||
POST _ingest/pipeline/_simulate | ||
{ | ||
"pipeline":{ | ||
"processors":[ | ||
{ | ||
"inference":{ | ||
"model_id":"lang_ident_model_1", <1> | ||
"inference_config":{ | ||
"classification":{ | ||
"num_top_classes":5 <2> | ||
} | ||
}, | ||
"field_mappings":{ | ||
} | ||
} | ||
} | ||
] | ||
}, | ||
"docs":[ | ||
{ | ||
"_source":{ <3> | ||
"text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz." | ||
} | ||
} | ||
] | ||
} | ||
---------------------------------- | ||
//NOTCONSOLE | ||
|
||
<1> The ID of the {lang-ident} trained model. | ||
<2> Indicates that only the top five languages (that is to say, the ones with the highest probability) are returned. | ||
are reported. In this example, 5 classes (in this case, languages) with the | ||
highest probability will be reported. | ||
<3> The source object that contains the text to identify. | ||
|
||
|
||
The request returns the following response: | ||
|
||
[source,js] | ||
---------------------------------- | ||
{ | ||
"docs" : [ | ||
{ | ||
"doc" : { | ||
"_index" : "_index", | ||
"_type" : "_doc", | ||
"_id" : "_id", | ||
"_source" : { | ||
"text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.", | ||
"ml" : { | ||
"inference" : { | ||
"top_classes" : [ <1> | ||
{ | ||
"class_name" : "hu", | ||
"class_probability" : 0.9999936063740517, | ||
"class_score" : 0.9999936063740517 | ||
}, | ||
{ | ||
"class_name" : "lv", | ||
"class_probability" : 2.5020248433413966E-6, | ||
"class_score" : 2.5020248433413966E-6 | ||
}, | ||
{ | ||
"class_name" : "is", | ||
"class_probability" : 1.0150420723037688E-6, | ||
"class_score" : 1.0150420723037688E-6 | ||
}, | ||
{ | ||
"class_name" : "ga", | ||
"class_probability" : 6.67935962773335E-7, | ||
"class_score" : 6.67935962773335E-7 | ||
}, | ||
{ | ||
"class_name" : "tr", | ||
"class_probability" : 5.591166324774555E-7, | ||
"class_score" : 5.591166324774555E-7 | ||
} | ||
], | ||
"predicted_value" : "hu", <2> | ||
"model_id" : "lang_ident_model_1" | ||
} | ||
} | ||
}, | ||
"_ingest" : { | ||
"timestamp" : "2020-01-22T14:25:14.644912Z" | ||
} | ||
} | ||
} | ||
] | ||
} | ||
---------------------------------- | ||
//NOTCONSOLE | ||
|
||
<1> Contains probability scores for the top (most probable) inferred languages. | ||
probable languages. The number of reported languages is defined by | ||
`num_top_classes`. | ||
<2> The predicted value is the ISO identifier of the language with the highest | ||
probability. |