Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet Server goes to permanent offline state when fleet server agent is reboot after Kibana restart. #357

Closed
amolnater-qasource opened this issue May 17, 2021 · 13 comments
Assignees
Labels
bug Something isn't working Feature:fleet-server Team:Elastic-Agent Label for the Agent team

Comments

@amolnater-qasource
Copy link
Collaborator

Kibana version: 7.13.0 Snapshot Kibana cloud environment

Host OS and Browser version: Ubuntu 20, All

Build Details:

  Artifact link used: https://snapshots.elastic.co/7.13.0-7457d36a/downloads/beats/elastic-agent/elastic-agent-7.13.0-SNAPSHOT-linux-x86_64.tar.gz
  Build: 40856
  Commit: 7f416a18f500794f9705085cf02f9299dcccc38d

Preconditions:

  1. 7.13.0 Snapshot Kibana cloud environment should be available.
  2. Linux .tar Fleet Server must be installed on 7.13.0 snapshot.
  3. Few agents must be installed using Linux fleet server.

Steps to reproduce:

  1. Login to Kibana environment.
  2. Restart Kibana from Kibana deployment page.
  3. After kibana restart, reboot Linux fleet server endpoint with sudo reboot command.
  4. Observe Linux fleet server agent doesn't come back to "Healthy" state.
  5. Observe agents installed with this fleet server also goes to "Offline" state.

Expected Result:
Fleet Server should come back "Healthy", when fleet server agent is reboot after Kibana restart.

Fleet-Server Logs:
Logs.zip

Note:

  • If same Fleet Server agent is re-installed on same endpoint, all secondary agents gets back to Healthy state.

Screenshot:
8

@amolnater-qasource
Copy link
Collaborator Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link
Collaborator

Reviewed & Assigned to @EricDavisX

@EricDavisX
Copy link
Contributor

@ph @ruflin @michalpristas @blakerouse any thoughts? Is this a blocker? It feels like it would prevent successful usage of the Beta. I don't know if it is actually the same as other reboot host tests we've seen. thinking of this: https://github.com/elastic/obs-dc-team/issues/528

@ruflin
Copy link
Collaborator

ruflin commented May 18, 2021

@ph @mostlyjason I wonder if for 7.13, we should just state that a local fleet-server is not supported. It adds a lot of variability and complexity.

There is a chance that this is related to the fix we did. We should test again when the next SNAPSHOT / BC are out. What would be nice is if we could reproduce these things also in a local no Cloud setup.

@amolnater-qasource I assume you updated the settings page for the local fleet-server? Are these values by chance reset after the reboot?

@mostlyjason
Copy link

mostlyjason commented May 18, 2021

@ruflin I think local fleet-server needs to be supported because about 60% of beta clusters are self-managed. We also have a GA release coming up, and feedback from real users during the beta will be critical for us to achieve confidence in its reliability. I'd treat it as a bug to fix.

@ruflin
Copy link
Collaborator

ruflin commented May 18, 2021

Not sure if we talk about the same thing. I'm talking about a local fleet-server with Elastic Cloud. Everything on prem must work.

@EricDavisX
Copy link
Contributor

I expect this is still be a valid bug for full on-prem environment. @amolnater-qasource BC7 is in progress available sometime during your work day we expect, can you set up a full on-prem environment and retest and report back please?

@amolnater-qasource
Copy link
Collaborator Author

Hi @ruflin
Thanks for the feedback.

I assume you updated the settings page for the local fleet-server?

Yes, after that we were able to install secondary agents with that Fleet Server. It was initially working fine:

  • We tried rebooting Fleet Server agent several times, it was coming back "Healthy" and related secondary agents also gets back healthy.[As Expected]

Are these values by chance reset after the reboot?

  • After restarting Kibana, no values were reset and it just looks same as it was before kibana restart.
  • However when we restarted the Fleet server agent after Kibana restart it went "Offline".
  • Agents installed with above Fleet server also went "Offline".

Thanks
QAS

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have revalidated this issue on 7.13.0 BC-7 self-managed Kibana environment.

Steps followed:

  • After successful setup we restarted elasticsearch and kibana.
  • Then we restarted Fleet Server agent and it came back "Healthy".

Hence, this issue is not observed on self managed 7.13 Kibana.

Please let us know if we are missing anything.
Thanks

@EricDavisX EricDavisX added Team:Elastic-Agent Label for the Agent team and removed Team:Fleet Label for the Fleet team v7.13.0 labels May 19, 2021
@EricDavisX
Copy link
Contributor

Knowing this now, it is less urgent - removing from urgent issues concerns list and removing 7.13 label

@andresrc
Copy link
Contributor

can we retest this once 7.14.0 is available?

@EricDavisX
Copy link
Contributor

FYI, prior, I had downgraded the urgency because the full cloud-stack side setup is working well, and the full on-prem solution is working well, it was only with the hybrid cloud stack and local fleet server where issues were seen (and that scenario may not yet be cited as fully supported, I need to dig up the ticket).

Regardless, and for now:
@amolnater-qasource it would be helpful to do 3 tests as part of 7.14:

  • 1, the full on-prem and reboot the stack and then fleet server as well and monitor / watch the system restart.
  • 2, test with the hybrid cloud-stack and local fleet server with the Fleet & APM container turned off! We didn't specify this nuance before but it may be important. In this test, we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. only need 1 agent connected in this test.
  • 3, test with the hybrid cloud-stack and local fleet server, with the Fleet-Server/APM container alive and functioning. In this test we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. if you have 2 'standard' agents connected, at least 1 will have to be connected to the local fleet-server. I can't recall any way to manually disable the cloud apm-fleet server except to turn it off in cloud, and it is a good test anyways.

... in this last test, we *could update the Fleet Settings to render the cloud Fleet Server unused, but it is a better test to leave it in place, so we can do that.

Thank you.

@amolnater-qasource
Copy link
Collaborator Author

Hi @EricDavisX
We have attempted these scenarios on 7.14.0 BC-3

1, the full on-prem and reboot the stack and then fleet server as well and monitor / watch the system restart.

We observed this #376 issue reproducible after following this scenario.

  • Few Datastreams stopped after Fleet Server agent reboot.
  • However Fleet Server agent remained Healthy.

2, test with the hybrid cloud-stack and local fleet server with the Fleet & APM container turned off! We didn't specify this nuance before but it may be important. In this test, we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. only need 1 agent connected in this test.

We haven't observed any errors while following this scenario:

  • Fleet server agent and elastic-agent installed with it remained Healthy.
  • Data under Data Streams tab generating normally.
  • No error logs are observed under Logs tab.

3, test with the hybrid cloud-stack and local fleet server, with the Fleet-Server/APM container alive and functioning. In this test we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. if you have 2 'standard' agents connected, at least 1 will have to be connected to the local fleet-server. I can't recall any way to manually disable the cloud apm-fleet server except to turn it off in cloud, and it is a good test anyways.

On attempting this scenario we haven't observed agents going to offline state.
However due to reported issue #563 all pre-installed agents went "Unhealthy".

We aren't able to reproduce this issue on 7.14.0 BC-3. Hence we are closing this issue.

Please let us know if anything else is required from our end.
Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Feature:fleet-server Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

8 participants