[ML] Make ml_standard tokenizer the default for new categorization jobs #73605

droberts195 · 2021-06-01T14:16:48Z

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Backport of #72805

Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Backport of elastic#72805

Once elastic#73605 is merged this test should pass on master.

Once #73605 is merged this test should pass on master.

droberts195 added backport v7.14.0 labels Jun 1, 2021

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 1, 2021

[ML] Unmute ml_info REST compat test

270c43b

Once elastic#73605 is merged this test should pass on master.

droberts195 mentioned this pull request Jun 1, 2021

[ML] Unmute ml_info REST compat test #73606

Merged

droberts195 merged commit 8cf1fdc into elastic:7.x Jun 2, 2021

droberts195 deleted the ml_standard_tokenizer_for_new_cat_jobs_7x branch June 2, 2021 06:04

droberts195 added a commit that referenced this pull request Jun 2, 2021

[ML] Unmute ml_info REST compat test (#73606)

51703ed

Once #73605 is merged this test should pass on master.

pheyos mentioned this pull request Jun 2, 2021

[ML] Functional tests - reenable categorization tests elastic/kibana#101137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Make ml_standard tokenizer the default for new categorization jobs #73605

[ML] Make ml_standard tokenizer the default for new categorization jobs #73605

droberts195 commented Jun 1, 2021

[ML] Make ml_standard tokenizer the default for new categorization jobs #73605

[ML] Make ml_standard tokenizer the default for new categorization jobs #73605

Conversation

droberts195 commented Jun 1, 2021