Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deployment]: Hosted fleet server gets unhealthy on 8.4 Snapshot. #1574

Closed
amolnater-qasource opened this issue Jun 21, 2022 · 19 comments · Fixed by elastic/apm-server#8478
Closed
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team

Comments

@amolnater-qasource
Copy link
Collaborator

Deployment Links:

Description:
Hosted fleet server gets unhealthy on 8.4 Snapshot and we have observed APM disabled under Deployment page.

Screenshots:
17
18
19
20

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. labels Jun 21, 2022
@amolnater-qasource
Copy link
Collaborator Author

@manishgupta-qasource Please review.

@amolnater-qasource
Copy link
Collaborator Author

FYI @jlind23 @joshdover

@manishgupta-qasource
Copy link
Collaborator

Secondary review for this ticket is Done

@jlind23
Copy link
Contributor

jlind23 commented Jun 21, 2022

@amolnater-qasource Don't we need APM integration to enable this? Do you have access to fleet-server logs?

@amolnater-qasource
Copy link
Collaborator Author

Hi @jlind23
Thanks for looking into this.
APM integration is already added in the Hosted fleet server.

Screenshot:
image

As we are testing on a cloud build we are not sure how to get hosted fleet-server logs.
Further no logs are available under Logs tab for hosted agent, as logs collection is disabled for managed policy.

image

Could you please share any steps for this?

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jun 22, 2022

As stated yesterday by @cmacknz no 8.4 snapshots were built for more than 15 days. Could you please try again with a fresh snapshot?

@amolnater-qasource
Copy link
Collaborator Author

Hi @jlind23
We have attempted to re-setup latest 8.4 Snapshot Kibana cloud environment and found this issue still reproducible.

  • Hosted fleet server gets unhealthy on 8.4 Snapshot.

Deployment Links:

Build details:
VERSION: 8.4.0
BUILD: 53825
COMMIT: e0446dac822f55f75c1d97b6d9c3f4647c445973
(June 21, 2022 04:04 PM GMT 5:30+
)

We have re-validated this issue on build with above commits.
Please let us know if anything else is required from our end.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jun 22, 2022

@amolnater-qasource I tried on my end, no logs were available but after restarting the integration server it worked. Could you confirm?
@cmacknz @ph @pierrehilbert do you know how I can access more logs as there is nothing available in the logs UI and the Agent monitoring is disabled by default.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 22, 2022

have you tried checking the logs in https://admin.found.no ? it should work for staging instances, more info here: https://docs.google.com/presentation/d/1lIEQsQGgUR0H3MRhMqyZFP3wdofmZmiqmhYX--xT6FE/edit#slide=id.g12bd3a98d22_1_0

EDIT: this is the admin for staging: https://admin.staging.foundit.no/

@jlind23
Copy link
Contributor

jlind23 commented Jun 22, 2022

Thanks @juliaElastic.
@amolnater-qasource I created a new deployment and did not succeed reproducing it.
As soon as you try again, could you please share your deployment ID in order to take a look at the logs?

@jlind23
Copy link
Contributor

jlind23 commented Jun 22, 2022

@elastic/apm-server on this particular deployment I see a lot of apm error like:
precondition failed: dial tcp [::1]:9200: connect: cannot assign requested address

Does it ring a bell on your end?

@axw
Copy link
Member

axw commented Jun 23, 2022

@jlind23 that error message means APM Server is trying to connect to Elasticsearch on localhost, which is obviously not going to work. This implies that Elastic Agent is not sending the Elasticsearch output config to APM Server, or otherwise there's a bug in APM Server related to handling the config.

We also have an issue open to investigate 8.4 failing here: elastic/apm-server#8426

@amolnater-qasource
Copy link
Collaborator Author

Hi @jlind23
Thanks for the update.

We have attempted to restart the integration server and we have got a Healthy Hosted Fleet server once under Agents tab.
However, this hosted fleet server again gets Unhealthy again in sometime.
1
3

Further the APM is still disabled under the deployment page.
2

Deployment id for this build is: 159a7d04412248a9a5ad9d9bd9a0e365

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jun 23, 2022

Hi @amolnater-qasource , thanks for the deployment id.
Indeed I do observe the same connection problem coming from apm-server.
@amolnater-qasource can you also check the policy content to check what is the ES output value?

@pierrehilbert @ph can we quickly have someone from the control plane team looking at it?

@amolnater-qasource
Copy link
Collaborator Author

Hi @jlind23

you also check the policy content to check what is the ES output value?

Host value for ES is: http://fa60f7f1004648e78c8d6b853e89569a.containerhost:9244

Further for detailed information please find below attached Elastic Cloud agent policy:
elastic-agent.zip

Please let us know if anything else is required from our end.
Thanks

@michel-laterman
Copy link
Contributor

michel-laterman commented Jun 24, 2022

The agent logs indicate that the APM server is degraded - this looks like it's caused by the APM server not being able to connect to ES (as noted above)

2022-06-23T18:27:40Z - message: Application: apm-server--8.4.0-SNAPSHOT[e0138b78-b40c-4839-8f77-8afa6d420554]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'

@michel-laterman
Copy link
Contributor

One thing i've noted from the policy posted in #1574 (comment) that the fleet-server input has

    server:
      port: 8220
      host: 0.0.0.0

and the APM server input also has

      host: '0.0.0.0:8200'

I don't think this would effect the ES output, but it may be another issue

@axw
Copy link
Member

axw commented Jun 25, 2022

I'm pretty sure this is related to some recent APM Server build changes I made, which inadvertently have us building without the Fleet management code. I'll merge a fix ASAP.

@amolnater-qasource
Copy link
Collaborator Author

Hi @jlind23
We have revalidated setting up 8.4 Snapshot kibana cloud-staging environment and found it fixed now.

  • Hosted fleet server remains Healthy on 8.4 Snapshot.
  • APM is enabled under Deployment settings

Screenshots:
1
2

Build details:
8.4 Snapshot
BUILD: 53965
COMMIT: 7c8b8f8cf32d752fd405ddf680175299fbd8cd32

Hence marking this as QA:Validated.
Thanks

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants