-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Warn if ML categorization job is using data that does not categorize well #50749
Comments
Pinging @elastic/ml-core (:ml) |
#51146 added a rudimentary check into 7.6. An audit message is created if 1000 or more categories exact for a job before 100 buckets of results have been created. For 7.7 the intention is to add extra fields into the |
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates #50749
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates elastic/elasticsearch#50749
In elastic#51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by elastic#51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes elastic#50749
…2195) In #51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by #51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes #50749
…2195) In #51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by #51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes #50749
What
If an ML categorization job creates many many categories, it is probably not worth categorising. To be defensive, we should audit a warning message for jobs where the number of categories is high. This warning would be visible in job messages in the UI but would not be intended to stop the job from continuing.
It is difficult to figure out what "high" is because this is data dependent. This could be a ratio of categories to records_processed once a useful learning period has elapsed. Or it could be a hard upper limit on total number of categories (taking into account multiple partitions if they are configured). Or both.
Ideally this check can be performed in the early stages of the job after it has had a chance to analyze a useful amount of data. This could be at the end of a lookback (before starting real-time) or say after 100 buckets or 1 day (whichever sooner) for real-time only jobs.
Re-assessing this warning during the lifetime of a real-time job would also have some value in cases where the input data changes - however this could get annoying if done too frequently.
Why
Log categorization will group unstructured log messages into categories. For example,
Fred accessed file bananas.txt
andWilma accessed file apples.txt
would be considered the same message category. From here, you can use current anomaly detection to model and identify unusual counts of categories of log message and/or rare log message categories.To create a ML categorization job, it requires a timestamp and a message field. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting. For example, we would get very poor results trying to categorize each sentence in the complete works of Shakespeare because sentences are different and do not share similar structure. However we would generally get good results if categorizing applications logs with repeated messages (where certain fields changing in each doc e.g. hostname, IP addr, username).
Consequently, an ML categorization job is worth using providing the data it is analyzing is suitable for categorizing. This is not necessarily immediately obvious to all potential users of the system, therefore we should attempt to warn users if the job is not categorizing well.
When
Log categorization has been part of ML anomaly detection for a long time, but has been a bit of a hidden feature. This is now changing.
In 7.6 (tbc) we are working on a new ML UI Wizard elastic/kibana#53009 which will make it easier to create categorization jobs. Logs UI Observability team are also working on integrating with ML elastic/kibana#53004.
With more visibility of the categorization feature, we should look at seeing how we can enhance its usability so users get a better experience of the functionality.
The text was updated successfully, but these errors were encountered: