Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added telemetry with most common error from agent logs #146107

Merged
merged 4 commits into from
Nov 29, 2022

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Nov 23, 2022

Summary

Closes https://github.com/elastic/ingest-dev/issues/1261

Merged: elasticsearch change to give kibana_system the missing privilege to read logs-elastic_agent* indices.

Top 3 most common errors in the Elastic Agent logs

Added most common elastic-agent and fleet-server logs to telemetry.

Using a query of message field using sampler and categorize text aggregation. This is a workaround as we can't directly do aggregation on message field.

GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}

Tested with latest Elasticsearch snapshot, and verified that the logs are added to telemetry:

   {
      "agent_logs_top_errors": [
         "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri",
         "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n",
         "stderr panic close of closed channel n ngoroutine running Stop"
      ],
      "fleet_server_logs_top_errors": [
         "Dispatch abort response",
         "error while closing",
         "failed to take ownership"
      ]
   }

Did some measurements locally, and the query took a few ms only. I'll try to check with larger datasets in elastic agent logs too.

Checklist

@juliaElastic juliaElastic marked this pull request as ready for review November 24, 2022 08:09
@juliaElastic juliaElastic requested a review from a team as a code owner November 24, 2022 08:09
@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 28, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 59 65 +6
osquery 109 115 +6
securitySolution 442 448 +6
total +20

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 68 74 +6
osquery 110 117 +7
securitySolution 519 525 +6
total +21

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Awesome change.

@juliaElastic juliaElastic merged commit 585bf36 into elastic:main Nov 29, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Nov 29, 2022
juliaElastic added a commit to juliaElastic/kibana that referenced this pull request Nov 29, 2022
## Summary

Closes elastic/ingest-dev#1261

Merged: [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

## Top 3 most common errors in the Elastic Agent logs

Added most common elastic-agent and fleet-server logs to telemetry.

Using a query of message field using sampler and categorize text
aggregation. This is a workaround as we can't directly do aggregation on
`message` field.
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```

Tested with latest Elasticsearch snapshot, and verified that the logs
are added to telemetry:
```
   {
      "agent_logs_top_errors": [
         "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri",
         "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n",
         "stderr panic close of closed channel n ngoroutine running Stop"
      ],
      "fleet_server_logs_top_errors": [
         "Dispatch abort response",
         "error while closing",
         "failed to take ownership"
      ]
   }
```

Did some measurements locally, and the query took a few ms only. I'll
try to check with larger datasets in elastic agent logs too.


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
juliaElastic added a commit that referenced this pull request Nov 29, 2022
#146507)

## Summary

Backport #146107 to 8.6

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@jakommo
Copy link
Contributor

jakommo commented Jan 16, 2023

For reference, this actually made it into 8.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.6.1 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants