Fleet Server goes to permanent offline state when fleet server agent is reboot after Kibana restart. #357

amolnater-qasource · 2021-05-17T12:38:27Z

Kibana version: 7.13.0 Snapshot Kibana cloud environment

Host OS and Browser version: Ubuntu 20, All

Build Details:

  Artifact link used: https://snapshots.elastic.co/7.13.0-7457d36a/downloads/beats/elastic-agent/elastic-agent-7.13.0-SNAPSHOT-linux-x86_64.tar.gz
  Build: 40856
  Commit: 7f416a18f500794f9705085cf02f9299dcccc38d

Preconditions:

7.13.0 Snapshot Kibana cloud environment should be available.
Linux .tar Fleet Server must be installed on 7.13.0 snapshot.
Few agents must be installed using Linux fleet server.

Steps to reproduce:

Login to Kibana environment.
Restart Kibana from Kibana deployment page.
After kibana restart, reboot Linux fleet server endpoint with sudo reboot command.
Observe Linux fleet server agent doesn't come back to "Healthy" state.
Observe agents installed with this fleet server also goes to "Offline" state.

Expected Result:
Fleet Server should come back "Healthy", when fleet server agent is reboot after Kibana restart.

Fleet-Server Logs:
Logs.zip

Note:

If same Fleet Server agent is re-installed on same endpoint, all secondary agents gets back to Healthy state.

Screenshot:

The text was updated successfully, but these errors were encountered:

amolnater-qasource · 2021-05-17T12:38:39Z

@manishgupta-qasource Please review.

manishgupta-qasource · 2021-05-17T12:45:40Z

Reviewed & Assigned to @EricDavisX

EricDavisX · 2021-05-18T01:00:18Z

@ph @ruflin @michalpristas @blakerouse any thoughts? Is this a blocker? It feels like it would prevent successful usage of the Beta. I don't know if it is actually the same as other reboot host tests we've seen. thinking of this: https://github.com/elastic/obs-dc-team/issues/528

ruflin · 2021-05-18T15:24:53Z

@ph @mostlyjason I wonder if for 7.13, we should just state that a local fleet-server is not supported. It adds a lot of variability and complexity.

There is a chance that this is related to the fix we did. We should test again when the next SNAPSHOT / BC are out. What would be nice is if we could reproduce these things also in a local no Cloud setup.

@amolnater-qasource I assume you updated the settings page for the local fleet-server? Are these values by chance reset after the reboot?

mostlyjason · 2021-05-18T17:48:03Z

@ruflin I think local fleet-server needs to be supported because about 60% of beta clusters are self-managed. We also have a GA release coming up, and feedback from real users during the beta will be critical for us to achieve confidence in its reliability. I'd treat it as a bug to fix.

ruflin · 2021-05-18T18:02:47Z

Not sure if we talk about the same thing. I'm talking about a local fleet-server with Elastic Cloud. Everything on prem must work.

EricDavisX · 2021-05-18T19:45:38Z

I expect this is still be a valid bug for full on-prem environment. @amolnater-qasource BC7 is in progress available sometime during your work day we expect, can you set up a full on-prem environment and retest and report back please?

amolnater-qasource · 2021-05-19T06:51:12Z

Hi @ruflin
Thanks for the feedback.

I assume you updated the settings page for the local fleet-server?

Yes, after that we were able to install secondary agents with that Fleet Server. It was initially working fine:

We tried rebooting Fleet Server agent several times, it was coming back "Healthy" and related secondary agents also gets back healthy.[As Expected]

Are these values by chance reset after the reboot?

After restarting Kibana, no values were reset and it just looks same as it was before kibana restart.
However when we restarted the Fleet server agent after Kibana restart it went "Offline".
Agents installed with above Fleet server also went "Offline".

Thanks
QAS

amolnater-qasource · 2021-05-19T08:21:08Z

Hi @EricDavisX
We have revalidated this issue on 7.13.0 BC-7 self-managed Kibana environment.

Steps followed:

After successful setup we restarted elasticsearch and kibana.
Then we restarted Fleet Server agent and it came back "Healthy".

Hence, this issue is not observed on self managed 7.13 Kibana.

Please let us know if we are missing anything.
Thanks

EricDavisX · 2021-05-19T20:43:10Z

Knowing this now, it is less urgent - removing from urgent issues concerns list and removing 7.13 label

andresrc · 2021-07-14T15:11:54Z

can we retest this once 7.14.0 is available?

EricDavisX · 2021-07-14T19:59:53Z

FYI, prior, I had downgraded the urgency because the full cloud-stack side setup is working well, and the full on-prem solution is working well, it was only with the hybrid cloud stack and local fleet server where issues were seen (and that scenario may not yet be cited as fully supported, I need to dig up the ticket).

Regardless, and for now:
@amolnater-qasource it would be helpful to do 3 tests as part of 7.14:

1, the full on-prem and reboot the stack and then fleet server as well and monitor / watch the system restart.
2, test with the hybrid cloud-stack and local fleet server with the Fleet & APM container turned off! We didn't specify this nuance before but it may be important. In this test, we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. only need 1 agent connected in this test.
3, test with the hybrid cloud-stack and local fleet server, with the Fleet-Server/APM container alive and functioning. In this test we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. if you have 2 'standard' agents connected, at least 1 will have to be connected to the local fleet-server. I can't recall any way to manually disable the cloud apm-fleet server except to turn it off in cloud, and it is a good test anyways.

... in this last test, we *could update the Fleet Settings to render the cloud Fleet Server unused, but it is a better test to leave it in place, so we can do that.

Thank you.

amolnater-qasource · 2021-07-19T11:37:45Z

Hi @EricDavisX
We have attempted these scenarios on 7.14.0 BC-3

1, the full on-prem and reboot the stack and then fleet server as well and monitor / watch the system restart.

We observed this #376 issue reproducible after following this scenario.

Few Datastreams stopped after Fleet Server agent reboot.
However Fleet Server agent remained Healthy.

2, test with the hybrid cloud-stack and local fleet server with the Fleet & APM container turned off! We didn't specify this nuance before but it may be important. In this test, we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. only need 1 agent connected in this test.

We haven't observed any errors while following this scenario:

Fleet server agent and elastic-agent installed with it remained Healthy.
Data under Data Streams tab generating normally.
No error logs are observed under Logs tab.

3, test with the hybrid cloud-stack and local fleet server, with the Fleet-Server/APM container alive and functioning. In this test we reboot the stack in cloud and then reboot the local fleet server as well and monitor / watch the system. if you have 2 'standard' agents connected, at least 1 will have to be connected to the local fleet-server. I can't recall any way to manually disable the cloud apm-fleet server except to turn it off in cloud, and it is a good test anyways.

On attempting this scenario we haven't observed agents going to offline state.
However due to reported issue #563 all pre-installed agents went "Unhealthy".

We aren't able to reproduce this issue on 7.14.0 BC-3. Hence we are closing this issue.

Please let us know if anything else is required from our end.
Thanks
QAS

amolnater-qasource added the bug Something isn't working label May 17, 2021

amolnater-qasource assigned manishgupta-qasource May 17, 2021

manishgupta-qasource assigned EricDavisX and unassigned manishgupta-qasource May 17, 2021

manishgupta-qasource added Feature:fleet-server Team:Fleet Label for the Fleet team labels May 17, 2021

EricDavisX assigned ph and unassigned EricDavisX May 18, 2021

EricDavisX added the v7.13.0 label May 18, 2021

EricDavisX added Team:Elastic-Agent Label for the Agent team and removed Team:Fleet Label for the Fleet team v7.13.0 labels May 19, 2021

urso assigned blakerouse May 26, 2021

amolnater-qasource closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet Server goes to permanent offline state when fleet server agent is reboot after Kibana restart. #357

Fleet Server goes to permanent offline state when fleet server agent is reboot after Kibana restart. #357

amolnater-qasource commented May 17, 2021

amolnater-qasource commented May 17, 2021

manishgupta-qasource commented May 17, 2021

EricDavisX commented May 18, 2021

ruflin commented May 18, 2021

mostlyjason commented May 18, 2021 •

edited

Loading

ruflin commented May 18, 2021

EricDavisX commented May 18, 2021

amolnater-qasource commented May 19, 2021

amolnater-qasource commented May 19, 2021

EricDavisX commented May 19, 2021

andresrc commented Jul 14, 2021

EricDavisX commented Jul 14, 2021

amolnater-qasource commented Jul 19, 2021

Fleet Server goes to permanent offline state when fleet server agent is reboot after Kibana restart. #357

Fleet Server goes to permanent offline state when fleet server agent is reboot after Kibana restart. #357

Comments

amolnater-qasource commented May 17, 2021

amolnater-qasource commented May 17, 2021

manishgupta-qasource commented May 17, 2021

EricDavisX commented May 18, 2021

ruflin commented May 18, 2021

mostlyjason commented May 18, 2021 • edited Loading

ruflin commented May 18, 2021

EricDavisX commented May 18, 2021

amolnater-qasource commented May 19, 2021

amolnater-qasource commented May 19, 2021

EricDavisX commented May 19, 2021

andresrc commented Jul 14, 2021

EricDavisX commented Jul 14, 2021

amolnater-qasource commented Jul 19, 2021

mostlyjason commented May 18, 2021 •

edited

Loading