Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: Linux agents gets unhealthy with system integration on 7.17.27 #6519

Open
harshitgupta-qasource opened this issue Jan 13, 2025 · 12 comments
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@harshitgupta-qasource
Copy link

harshitgupta-qasource commented Jan 13, 2025

Kibana Build details:

VERSION: 7.17.27
BUILD 47755
COMMIT 828e49db669c29d8cc4f3a30f6abe5e8f69a4290
Artifact: https://staging.elastic.co/7.17.27-b47ca93f/summary-7.17.27.html#elastic-agent-package

Host OS and Browser version: [Ubuntu 22] , [Ubuntu 18], [Sles]

Preconditions:

  1. 7.17.27 BC1 Kibana Cloud environment should be available.

Steps to reproduce:

  1. Navigate to the Agents Tab
  2. Wait for a while till the agent becomes unhealthy.
  3. Observe that the Ubuntu agent goes to unhealthy with system intergration
  4. Now add endpoint security integration and Go to the Endpoint Tab
  5. Observe that the Ubuntu agent goes to unhealthy.

Expected:

  • Linux agents should be healthy with system integration on 7.17.27

Screenshot:
Image
Image
Image

Note: Reproducible on Ubuntu agents only.

Agents Logs:

elastic-agent-diagnostics-2025-01-13T05-29-48Z-00.zip

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 13, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource
Copy link

Secondary review for this ticket is Done.

@jlind23
Copy link
Contributor

jlind23 commented Jan 13, 2025

@nfritts @norrietaylor according to the Elastic Agent diag it seems like endpoint is being degraded, can someone on your end take a look please?

@jlind23
Copy link
Contributor

jlind23 commented Jan 13, 2025

Looks like endpoint report this error:
error: 'Get "http://unix/": dial unix /opt/Elastic/Agent/data/tmp/default/endpoint-security/endpoint-security.sock: connect: no such file or directory'

@cmacknz
Copy link
Member

cmacknz commented Jan 13, 2025

Hmm, I'm not sure that's the root cause. We are for some reason trying to connect to endpoint to get monitoring data the same way we do for Beats, AFAIK endpoint has never exposed a monitoring socket like that. I suspect that log is a symptom of something else.

@nfritts
Copy link

nfritts commented Jan 14, 2025

My initial thought is that we may have ended up out of sync on pipe/named socket bootstrapping?

Endpoint was hoping to merge (but hasn't merged to 7.17 yet effectively a backport of the change we made for 8.15 with the bootstrap process to move it off of a localhost socket.

The endpoint PR isn't merged yet https://github.com/elastic/endpoint-dev/pull/15344

Has Agent made changes in anticipation of changing the bootstrap? (I did a quick search but didn't see anything that stood out) If so, then we're out of sync and either the agent change will have to be reverted or we'll have to get the endpoint change merged before things will work.

@jlind23
Copy link
Contributor

jlind23 commented Jan 14, 2025

These are the changes merged between 7.17.26 and 7.17.27, not sure what could have caused this.
Image

@harshitgupta-qasource this problem was not there in 7.17.26 right?

@jlind23
Copy link
Contributor

jlind23 commented Jan 14, 2025

@harshitgupta-qasource what was the system integration version you were using?

@pchila
Copy link
Member

pchila commented Jan 14, 2025

@harshitgupta-qasource I tried reproducing this issue using a 7.17.27 deployment and a 7.17.27 BC1 elastic agent on ubuntu 22.04 but I cannot reproduce the agent being unhealthy.

I created a new empty policy and enrolled an elastic agent

After the agent was healthy I added System Integration v. 1.11.1 as shipped by 7.17.27 cloud stack

Image

Waited a few minutes for the agent to become unhealthy but it didn't happen after a few minutes, so I added the defend integration to the same policy

Image

Agent is still healthy after ~20 mins from the start of my test.

Image

How long would it take for the agent to become unhealthy in your test ?
If I understood correctly you saw the agent unhealthy with just the System integration, correct ?
Is there any difference between my test steps and yours that could lead to a different result ?

@marc-gr
Copy link
Contributor

marc-gr commented Jan 14, 2025

Just adding my 2 cents here it seems latest system version with support for 7.17 was 1.15.1 (https://github.com/elastic/integrations/pull/3509/files#diff-d4cd9d386b49496970c932d312ae09b5a2acc2c3f85f75a7819064d67634248b) so it could be worth trying an update if necessary

@nicholasberlin
Copy link

Please gather an endpoint diagnostic package from the Ubuntu host.

$ sudo /opt/Elastic/Endpoint/elastic-endpoint diagnostics

And, upload here. Thanks.

I suspect that the kernel of the Ubuntu system has moved beyond the support within 7.17 and it's failing to install event sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

9 participants