Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Self managed]: elastic_agent.metricbeat/filebeat datastreams generated on installing fleet-server agent. #376

Closed
amolnater-qasource opened this issue May 19, 2021 · 34 comments · Fixed by elastic/beats#27222
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@amolnater-qasource
Copy link
Collaborator

Kibana version: 7.13.0 BC-7 Kibana self managed environment

Host OS and Browser version: Windows 10 x64, All

Build Details:

 Artifact link used: https://staging.elastic.co/7.13.0-8eb98cbf/summary-7.13.0.html
 BUILD: 40864
 COMMIT: 6ce6847436ff9bef0ad91268b6585e0f9339c9fd

Preconditions:

  1. 7.13.0 BC-7 self-managed Kibana environment should be available.
  2. Windows 10 x64 Fleet Server agent must be installed using Default Fleet server policy having only Fleet Server integration.

Steps to reproduce:

  1. Login to Kibana environment.
  2. Restart elastic-agent from Services.
  3. Navigate to Data Streams tab.
  4. Observe data for few datasets not generating after agent restart.

Expected Result:
Data streams should restart for all datasets on restarting Fleet Server agent.

Logs:
restart issue logs.zip

Note:

  • This issue is observed on self-managed kibana, when first fleet server agent is installed with Default Fleet Server Policy.
  • Before restarting elastic-agent data streams for all the datasets were generating at regular intervals.

Screenshot:
4

@amolnater-qasource
Copy link
Collaborator Author

@dikshachauhan-qasource Please review.

@dikshachauhan-qasource
Copy link

Reviewed and assigned to @EricDavisX

@dikshachauhan-qasource dikshachauhan-qasource removed their assignment May 19, 2021
@dikshachauhan-qasource dikshachauhan-qasource added bug Something isn't working Team:Fleet Label for the Fleet team labels May 19, 2021
@ruflin ruflin added the Team:Elastic-Agent Label for the Agent team label May 20, 2021
@EricDavisX EricDavisX removed their assignment May 20, 2021
@EricDavisX EricDavisX removed the Team:Fleet Label for the Fleet team label May 20, 2021
@EricDavisX
Copy link
Contributor

@ph @ruflin I'd like to understand this before putting to the backlog - I'd want to know a work-around too. It actually sounds severe, doesn't it? that is, if it is a real issue and not otherwise accounted for in other tickets.

The first step to confirm is, with more time (more than 6 minutes, per the screenshot) would the other datastreams not eventually send data? I don't see what would be different for elastic_agent.elastic_agent for metrics vs logs, there seems no immediate pattern to the problem. We should test and confirm that all the same beats (system and monitoring) were re-started on the host.

@EricDavisX
Copy link
Contributor

ok, digging into more git /email updates, this seems to exactly relate - metricbeat is crashed on a restart.
elastic/beats#25785

We can close as a dupe if everyone feels confident

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this issue on 7.13.0 BC-9 self-managed Kibana Environment.

The first step to confirm is, with more time (more than 6 minutes, per the screenshot) would the other datastreams not eventually send data?

We observed it for more than 30 minutes and still no new data for few datasets was observed.
Please refer below screenshot:
1

Logs:
logs.zip

Build details:
Build: 40865
Commit: 9863e88bd63ad546b9d36e6b0c0c55cb65dd9081

Please let us know if anything else is required.
Thanks
QAS

@EricDavisX EricDavisX added impact:high Short-term priority; add to current release, or definitely next. v7.13.1 labels May 24, 2021
@EricDavisX
Copy link
Contributor

Adding the 7.13.1 label to help prioritize it until such time as we are confident it is a duplicate and close it, or we otherwise fix / resolve it.

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX

Thanks for the feedback on slack @michalpristas
We have revalidated this issue and observed that on changing the logging level to "debug" we are getting data for all the expected datasets after agent reboot.

Hence closing this out.

Thanks
QAS

@EricDavisX
Copy link
Contributor

I will work with Amol to do a retest based on my understanding and we'll report back indeed if anything further can be confirmed as a bug.

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this issue on 7.13.0 self managed kibana environment with "info" logging level.
We have observed 15-20 minutes before and after reboot.

Please find below the observation table for the same.

DATASET Type Before Reboot After Reboot
elastic_agent.fleet_server metrics Generating Normally Generating Normally
elastic_agent.elastic_agent metrics Generating Normally Generating Normally
elastic_agent.metricbeat metrics Generating Normally Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent.filebeat metrics Generating Normally Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent.fleet_server logs Generated only Once in beginning of Fleet Server agent installation. Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent logs Generated only Once in beginning of Fleet Server agent installation. Generated only Once in beginning of Fleet Server agent reboot.

Screenshots:
DataStreams

Please let us know if anything else is required.

Thanks
QAS

@EricDavisX
Copy link
Contributor

@michalpristas I am re-opening this for a re-review. It looks to me that after a reboot, the Agent stops collecting dataset elastic_agent.metricbeat and elastic_agent.filebeat data. Can we discuss / review please?

@EricDavisX
Copy link
Contributor

@amolnater-qasource can you re-test this please? we believe this and
elastic/beats#25829 are dupes (or same root cause) and are both fixed now. thank you

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this issue on 7.14.0 self-managed Kibana and found same observations as shared in comment #376 (comment)

Build details:

Build: 41559
Commit: 9838db392e7fcfc12f004b68fb1b09739f131148
Artifact Link: https://snapshots.elastic.co/7.14.0-28665d9b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Please let us know if we are missing anything.

Thanks
QAS

@EricDavisX
Copy link
Contributor

As of today the 7.14 snapshot is 7 days old, I'm not sure it has the latest fixes in it we need. I requested the re-testing with hopes that the build was new enough - that is my fault. Ideally, we'll re-test when the 7.x build is confirmed as new. Let's wait.

@EricDavisX
Copy link
Contributor

EricDavisX commented Jun 24, 2021

all of the builds are green - so we have new artifacts, please retest this and report back. I am wondering if we still have some issue seen in this and elastic/beats#26034 - @amolnater-qasource thank you

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this issue on 7.14.0 self managed kibana environment with "info" logging level.
We have observed 15-20 minutes before and after reboot.

Observations are similar to the one's shared earlier at #376 (comment)
Observation table:

Dataset Type Before reboot After reboot
elastic_agent.elastic_agent metrics Generating regularly Generating regularly
elastic_agent.fleet_server metrics Generating regularly Generating regularly
elastic_agent.filebeat metrics Generating regularly Generated only once after reboot
elastic_agent.metricbeat metrics Generating regularly Generated only once after reboot
elastic_agent.fleet_server logs Generated only Once Generated only once after reboot
elastic_agent logs Missing Missing

Build details:

Build: 42089
Commit: 67a71c75d2da40e49fba2620f488c9b4ce2467d2
Artifact Link: 
https://snapshots.elastic.co/7.14.0-15b00b37/downloads/elasticsearch/elasticsearch-7.14.0-SNAPSHOT-windows-x86_64.zip
https://snapshots.elastic.co/7.14.0-15b00b37/downloads/kibana/kibana-7.14.0-SNAPSHOT-windows-x86_64.zip
https://snapshots.elastic.co/7.14.0-15b00b37/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Note:
We have reported issue for missing "elastic_agent" dataset at elastic/beats#26518

Please let us know if anything else is required from our end.
Thanks

@ph
Copy link
Contributor

ph commented Jun 30, 2021

@michalpristas is looking into this one.

@michalpristas
Copy link
Contributor

missing dataset most likely duplicate of elastic/beats#26518
i will check missing data on other ones

@michalpristas
Copy link
Contributor

michalpristas commented Jul 6, 2021

do we have any logs from experiment above?
i'm using same version and cannot reproduce. sometimes i needed to hit refresh on UI multiple times for timestamps to get updated though, sometimes they got updated to some stale value and then after few hits to latest ones. maybe some kind of cache (using cloud instance)
do we see events in these dataset using discovery and filters?

@EricDavisX
Copy link
Contributor

@amolnater-qasource can you re-run the test and capture the relevant logs (Agent, Fleet Server, Metricbeat, Filebeat) - the last time we posted logs was May 24, I am hoping things may be working better with recent fixes (the one Michal notes was closed out, so that is good).

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this on 7.14.0 BC-2 self-managed Kibana environment.

Please find the required Logs as follows:
logs.zip

Build details:

Build: 42401
Commit: 9826a943dc2e47f26ec6de94816e7d297b752994
Artifact Link: https://staging.elastic.co/7.14.0-e99135ef/summary-7.14.0.html

Screenshot:
2

Please let us know if anything else is required from our end.

Thanks
QAS

@EricDavisX
Copy link
Contributor

Please do re-test on BC3 or newer snapshot (the BC2 is a week old at his point, and fixes were just merged we think).

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this issue on 7.14.0 BC-3 self managed Kibana environment with "info" logging level.

Observations are still same, please find in below table:

DATASET Type Before Reboot After Reboot
elastic_agent.fleet_server metrics Generating Normally Generating Normally
elastic_agent.elastic_agent metrics Generating Normally Generating Normally
elastic_agent.metricbeat metrics Generating Normally Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent.filebeat metrics Generating Normally Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent.fleet_server logs Generated only Once in beginning of Fleet Server agent installation. Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent logs Generated only Once in beginning of Fleet Server agent installation. Generated only Once in beginning of Fleet Server agent reboot.

Build details:

BC-3 Artifact Link: https://staging.elastic.co/7.14.0-682a8012/summary-7.14.0.html
Build: 42545
Commit: c314921a9893e0b46d9a3958f5520e3d6b1ce7d5

Screenshots:
4

Thanks
QAS

@michalpristas
Copy link
Contributor

so you don't see no metrics coming in dashboards as well.

@michel-laterman were you able to find time to see if you can repro locally?

@michel-laterman
Copy link
Contributor

I'm having issues recreating this locally; I'm having issues setting up my environment

@michel-laterman
Copy link
Contributor

@amolnater-qasource, I'm unable to reproduce this on QA cloud with a locally running agent/fleet server. There have been occasion issues where Kibana fails to refresh the data stream displays, but clicking the "reload" button on the upper right of the UI resolves this.

@amolnater-qasource
Copy link
Collaborator Author

Hi @michel-laterman

QA cloud with a locally running agent/fleet server.

This issue is not reported for Cloud builds. It is only reproducible on self-managed/on-prem setup.

There have been occasion issues where Kibana fails to refresh the data stream displays, but clicking the "reload" button on the upper right of the UI resolves this.

You can check that we have shared our observations based on nearly 30 mins test at #376 (comment). We were getting new logs for two of the datasets however not for others.

Thanks
QAS

@michel-laterman
Copy link
Contributor

I think I have recreated this using elastic-package stack up on the latest snapshot (on macos). I can see the same behaviour if I try to restart the docker container. The log messages look the same below is the errors in metricbeat_monitoring-json.log:

{"log.level":"info","@timestamp":"2021-07-19T23:07:01.765Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.state: failure to apply state schema: 4 errors: key `management` not found; key `module` not found; key `outpu
t` not found; key `queue` not found","service.name":"metricbeat","event.dataset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-07-19T23:07:01.771Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.stats: failure to apply stats schema: 1 error: key `libbeat` not found","service.name":"metricbeat","event.dat
aset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-07-19T23:07:11.765Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.state: failure to apply state schema: 4 errors: key `queue` not found; key `management` not found; key `module
` not found; key `output` not found","service.name":"metricbeat","event.dataset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-07-19T23:07:11.771Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.stats: failure to apply stats schema: 1 error: key `libbeat` not found","service.name":"metricbeat","event.dat
aset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-07-19T23:07:21.766Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.state: failure to apply state schema: 4 errors: key `output` not found; key `queue` not found; key `management
` not found; key `module` not found","service.name":"metricbeat","event.dataset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-07-19T23:07:21.771Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.stats: failure to apply stats schema: 1 error: key `libbeat` not found","service.name":"metricbeat","event.dat
aset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}

elastic_agent.elastic_agent and elastic_agent.fleet_servermetrics appear in the data stream up to date,elastic_agent.fleet_serverandelastic_agent` logs timestamps correspond to a restart (I believe because they do not emit log entires when everything is running as expected), and the filebeat and metricbeat metricsets no transmitted. This is probably due to the above errors with metricbeat

@michalpristas
Copy link
Contributor

this looks like what metricbeat receives during call to stats does not conform to schema defined in beats module of metricbeat (hence missing fields)
maybe it's returning empty body or some error message.
does it go live after restart is performed? which container are you trying to restart?

@michel-laterman
Copy link
Contributor

@michalpristas, I was trying to restart the fleet-server container

@michel-laterman
Copy link
Contributor

The error logs above are from metricbeat trying to collect state/stats from fleet-server.
The state endpoint returns null and the stats endpoint is missing the libbeat attribute.
However this is the case when the fleet-server initially starts (and all streams work as expected)

@amolnater-qasource
Copy link
Collaborator Author

amolnater-qasource commented Jul 21, 2021

Hi @michel-laterman
We have revalidated this issue on 7.14.0 BC-3 self managed Kibana with "debug" logging level.

Please find our observations below:

Dataset Type Reboot behaviour
elastic_agent.elastic_agent metrics Generating Normally
elastic_agent.fleet_server metrics Generating Normally
elastic_agent.fleet_server logs Generating Normally
elastic_agent logs Generating consistently [After longer period of time]
elastic_agent.metricbeat metrics Generated only Once in beginning of Fleet Server agent reboot.
elastic_agent.filebeat metrics Generated only Once in beginning of Fleet Server agent reboot.

Screenshot:
23

Logs:
logs.zip

cc: @michalpristas

Please let us know if we are missing anything.
Thanks
QAS

@michel-laterman
Copy link
Contributor

Alright, after some discussion I think that the description for the bug is wrong.

The elastic_agent.metricbeat and elastic_agent.filebeat streams should not be generated by the fleet-server default policy. However, currently before the server is restarted the metricbeat_monitor instance is attempting to gather this data.

If you take a look at the entries from these datasets, it will show errors:
Screen Shot 2021-07-27 at 10 05 28 AM

These socket errors are expected as metricbeat/filebeat should not be running integrations (just the _monitor versions to gather agent monitoring info).

So it appears as though the bug is that on startup, the metricbeat instance attempts to gather this data. A restart corrects this.

On startup (before a restart) the elastic-agent shows that the fleet-server is re-configuring:

bash-4.2$ elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
  * fleet-server	(CONFIGURING)
    Re-configuring
  * filebeat	(HEALTHY)
    Running
  * metricbeat	(HEALTHY)
    Running

The metricbeat_monitor log shows errors connecting to metricbeat and filebeat (same as the metrics stream)

{"log.level":"info","@timestamp":"2021-07-27T15:00:13.219Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset http.json: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/data/tmp/defaul
t/filebeat/filebeat.sock: connect: no such file or directory","service.name":"metricbeat","event.dataset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-07-27T15:00:13.219Z","log.origin":{"file.name":"module/wrapper.go","file.line":259},"message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/data/tmp/defau
lt/metricbeat/metricbeat.sock: connect: no such file or directory","service.name":"metricbeat","event.dataset":"metricbeat_monitor-json.log","ecs.version":"1.6.0"}

After a restart, fleet-server appears healthy (elastic-agent status shows running).

@EricDavisX EricDavisX removed impact:high Short-term priority; add to current release, or definitely next. v7.14.0 labels Jul 28, 2021
@EricDavisX
Copy link
Contributor

Thanks Michael!

@amolnater-qasource @dikshachauhan-qasource if you agree and understand the reasoning, I would like to know. In this case, too, you can update the short description to align to the 'inverse' issue that is identified. We can also update our test suite steps to confirm what is configured and note the current bug, until such time as it may be fixed. Thanks. I'll mark it as 'done' from the Urgent review side, this is nice to have it off the list even if we still have a lesser priority bug we can evaluate fixing.

It remains on the 7.15 candidate list, and we can evaluate it against other issues / features we want to work on.

@amolnater-qasource amolnater-qasource changed the title [Self managed]: Data streams stops for some datasets on restarting Fleet Server agent. [Self managed]: elastic_agent.metricbeat/filebeat datastreams generated on installing fleet-server agent. Jul 30, 2021
@amolnater-qasource
Copy link
Collaborator Author

Thanks @michel-laterman for looking up into this issue.
@EricDavisX yes we agree with michel as on cloud also we observed that only 4 datasets are generated for Fleet Server agent.
No elastic_agent.metricbeat/filebeat data is generated for fleet server on 7.14.0 BC-5 cloud build.

Screenshot:
9

Further we had a query regarding this: Is there any action, due to which elastic_agent.metricbeat/filebeat could generate in future?

We have updated our Expected result for self managed Fleet server testcase at C77071

We will be re-testing it on self-managed 7.15 when it will be fixed.

Please let us know if anything else is required from our end.
Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants