added telemetry with most common error from agent logs #146107

juliaElastic · 2022-11-23T09:31:58Z

Summary

Closes https://github.com/elastic/ingest-dev/issues/1261

Merged: elasticsearch change to give kibana_system the missing privilege to read logs-elastic_agent* indices.

Top 3 most common errors in the Elastic Agent logs

Added most common elastic-agent and fleet-server logs to telemetry.

Using a query of message field using sampler and categorize text aggregation. This is a workaround as we can't directly do aggregation on message field.

GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}

Tested with latest Elasticsearch snapshot, and verified that the logs are added to telemetry:

   {
      "agent_logs_top_errors": [
         "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri",
         "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n",
         "stderr panic close of closed channel n ngoroutine running Stop"
      ],
      "fleet_server_logs_top_errors": [
         "Dispatch abort response",
         "error while closing",
         "failed to take ownership"
      ]
   }

Did some measurements locally, and the query took a few ms only. I'll try to check with larger datasets in elastic agent logs too.

Checklist

Unit or functional tests were updated or added to match the most common scenarios

juliaElastic · 2022-11-28T10:15:33Z

@elasticmachine merge upstream

elasticmachine · 2022-11-28T10:15:45Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2022-11-28T15:10:18Z

@elasticmachine merge upstream

kibana-ci · 2022-11-28T16:11:44Z

💚 Build Succeeded

Buildkite Build
Commit: 801302c

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id	before	after	diff
`osquery`	1	2	+1

ESLint disabled line counts

id	before	after	diff
`enterpriseSearch`	19	21	+2
`fleet`	59	65	+6
`osquery`	109	115	+6
`securitySolution`	442	448	+6
total			+20

Total ESLint disabled count

id	before	after	diff
`enterpriseSearch`	20	22	+2
`fleet`	68	74	+6
`osquery`	110	117	+7
`securitySolution`	519	525	+6
total			+21

History

💔 Build #90916 failed 8bc53e4
💛 Build #90303 was flaky 42a0d49
💚 Build #90209 succeeded 5cb4a12

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

kpollich

LGTM! Awesome change.

@timestamp

## Summary Closes elastic/ingest-dev#1261 Merged: [elasticsearch change](elastic/elasticsearch#91701) to give kibana_system the missing privilege to read logs-elastic_agent* indices. ## Top 3 most common errors in the Elastic Agent logs Added most common elastic-agent and fleet-server logs to telemetry. Using a query of message field using sampler and categorize text aggregation. This is a workaround as we can't directly do aggregation on `message` field. ``` GET logs-elastic_agent*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "log.level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } }, "aggregations": { "message_sample": { "sampler": { "shard_size": 200 }, "aggs": { "categories": { "categorize_text": { "field": "message", "size": 10 } } } } } } ``` Tested with latest Elasticsearch snapshot, and verified that the logs are added to telemetry: ``` { "agent_logs_top_errors": [ "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri", "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n", "stderr panic close of closed channel n ngoroutine running Stop" ], "fleet_server_logs_top_errors": [ "Dispatch abort response", "error while closing", "failed to take ownership" ] } ``` Did some measurements locally, and the query took a few ms only. I'll try to check with larger datasets in elastic agent logs too. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

#146507) ## Summary Backport #146107 to 8.6 Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

jakommo · 2023-01-16T16:23:57Z

For reference, this actually made it into 8.6.0.

added telemetry with most common error from agent logs

5cb4a12

juliaElastic added the v8.7.0 label Nov 23, 2022

juliaElastic self-assigned this Nov 23, 2022

juliaElastic added release_note:skip Skip the PR/issue when compiling release notes 8.6.1 labels Nov 23, 2022

juliaElastic mentioned this pull request Nov 23, 2022

[Fleet] Added logs-elastic_agent* read privileges to kibana_system elastic/elasticsearch#91701

Merged

added agent logs telemetry to test

42a0d49

juliaElastic marked this pull request as ready for review November 24, 2022 08:09

juliaElastic requested a review from a team as a code owner November 24, 2022 08:09

Merge branch 'main' into telemetry/agent-logs-errors

8bc53e4

botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 28, 2022

Merge branch 'main' into telemetry/agent-logs-errors

801302c

kpollich approved these changes Nov 28, 2022

View reviewed changes

juliaElastic merged commit 585bf36 into elastic:main Nov 29, 2022

kibanamachine added the backport:skip This commit does not require backporting label Nov 29, 2022

juliaElastic mentioned this pull request Nov 29, 2022

[8.6] added telemetry with most common error from agent logs (#146107) #146507

Merged

juliaElastic added a commit that referenced this pull request Nov 29, 2022

[8.6] added telemetry with most common error from agent logs (#146107) (

bfad267

#146507) ## Summary Backport #146107 to 8.6 Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

juliaElastic mentioned this pull request Jan 16, 2023

[Fleet] Telemetry query uses a ES platinum feature (categorize text agg) #148976

Closed

sophiec20 added the v8.6.1 label Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added telemetry with most common error from agent logs #146107

added telemetry with most common error from agent logs #146107

juliaElastic commented Nov 23, 2022 •

edited

Loading

juliaElastic commented Nov 28, 2022

elasticmachine commented Nov 28, 2022

juliaElastic commented Nov 28, 2022

kibana-ci commented Nov 28, 2022

ESLint disabled in files

ESLint disabled line counts

Total ESLint disabled count

kpollich left a comment

jakommo commented Jan 16, 2023

added telemetry with most common error from agent logs #146107

added telemetry with most common error from agent logs #146107

Conversation

juliaElastic commented Nov 23, 2022 • edited Loading

Summary

Top 3 most common errors in the Elastic Agent logs

Checklist

juliaElastic commented Nov 28, 2022

elasticmachine commented Nov 28, 2022

juliaElastic commented Nov 28, 2022

kibana-ci commented Nov 28, 2022

💚 Build Succeeded

Metrics [docs]

ESLint disabled in files

ESLint disabled line counts

Total ESLint disabled count

History

kpollich left a comment

Choose a reason for hiding this comment

jakommo commented Jan 16, 2023

juliaElastic commented Nov 23, 2022 •

edited

Loading