Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Win-10-2016-ltsb specific]: Agent's Metricbeat process not running (so no Activity logs under Logs tab for Metricbeat), may relate to Endpoint or Linux / Windows Integrations #24180

Closed
amolnater-qasource opened this issue Feb 23, 2021 · 24 comments · Fixed by #24253
Assignees
Labels
bug impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team

Comments

@amolnater-qasource
Copy link

Kibana version: 7.12.0 Snapshot Kibana Cloud environment

Host OS and Browser version: Windows 10, All

Preconditions:

  1. 7.12.0 Snapshot Kibana cloud environment should be available.

Build Details:

Artifact link: https://snapshots.elastic.co/7.12.0-999aaf18/downloads/beats/elastic-agent/elastic-agent-7.12.0-SNAPSHOT-windows-x86_64.zip
Build: 38956
Commit: 00f7be3e485a4ec7788675ccd3d302eb6b92dc50
  1. Default policy must have System and Endpoint Security.

Steps to reproduce:

  1. Login to Kibana Cloud environment.
  2. Navigate to Fleet>Agents tab.
  3. Enroll agent with default policy having System and Endpoint Security.
  4. Observe "Unhealthy" status of agent.
  5. Navigate to agent Logs tab.
  6. Observe no Activity Logs.
  7. Navigate to Data Streams tab.
  8. Observe no Data under Data Streams tab.

Expected Result:
Activity logs should stream on enrolling agent with policy having System and Endpoint Security.

Screenshots:
sys and es

Note:
It is working fine with MAC and Linux .tar agents.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2021
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@andresrc andresrc added the Team:Elastic-Agent Label for the Agent team label Feb 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2021
@manishgupta-qasource
Copy link

Reviewed & assigned to @EricDavisX

@manishgupta-qasource manishgupta-qasource added bug impact:high Short-term priority; add to current release, or definitely next. labels Feb 23, 2021
@amolnater-qasource amolnater-qasource changed the title No agent activity logs under Logs tab on enrolling agent with policy having System and Endpoint Security. [Windows specific]: No agent activity logs under Logs tab on enrolling agent with policy having System and Endpoint Security. Feb 23, 2021
@EricDavisX
Copy link
Contributor

it would be helpful like always to have confirmation of what the Agent logs look like from the host, @amolnater-qasource are there any?

Also, if there is a problem when Endpoint is configured, we will always request to know if it works without Endpoint included, so we can triage all the faster. Please let us know, thank you!

@michalpristas
Copy link
Contributor

it would be worth a try to verify whether logs are discoverable using Discover
this would help us differentiate and tell whether it is agent or fleet ui issue

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Feb 23, 2021

Hi @EricDavisX

Below are the required log files:
elastic-agent-json.log
endpoint-000000.log

We haven't observed any data under Discover tab.

Please let us know if anything else is required.
Thanks
QAS

@EricDavisX
Copy link
Contributor

EricDavisX commented Feb 23, 2021

I see this repeated in the logs provided, over 2000 times it seems:

{"log.level":"error","@timestamp":"2021-02-23T09:47:30.518-0500","log.origin":{"file.name":"application/fleet_gateway.go","file.line":185},"message":"failed to dispatch actions, error: operator: failed to execute step sc-run, error: failed to start 'C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-04d374\\install\\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\\metricbeat': fork/exec C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-04d374\\install\\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\\metricbeat.exe: %1 is not a valid Win32 application.: failed to start 'C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-04d374\\install\\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\\metricbeat': fork/exec C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-04d374\\install\\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\\metricbeat.exe: %1 is not a valid Win32 application.","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-02-23T09:47:30.518-0500","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2021-02-23T09:47:30-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: metricbeat--7.12.0-SNAPSHOT[82473350-75a5-11eb-9285-9b78fc5e64de]: State changed to FAILED: failed to start 'C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-04d374\\install\\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\\metricbeat': fork/exec C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-04d374\\install\\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\\metricbeat.exe: %! (MISSING)is not a valid Win32 application.","ecs.version":"1.6.0"}

I wonder if this is human error in using the x64 artifact on an x86 environment? @amolnater-qasource can you double check for us?

It could be a build side problem, too

@EricDavisX
Copy link
Contributor

@ph fyi. also the Security Engg prod group has offered to give some second opinions as to using the latest snapshot. Thank you @charlie-pichette and @andrew-garfield101 for any info you can post about if you are seeing the same things on any Windows environments or not

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Feb 24, 2021

Hi @EricDavisX

No Eric, we are using required x64 artifact with Windows 10 x64 machine. Please refer below screenshot for machine details:
image (2)

Please let us know if anything else is required.
Thanks

@dikshachauhan-qasource
Copy link

Hi @EricDavisX,

While performing testing on 7.12 snapshot build, we observed that windows agent is still going unhealthy with no activity logs.

  • Agent policy had: windows and Linux integration only.

Build details:
BUILD 38965
COMMIT 00f7be3e485a4ec7788675ccd3d302eb6b92dc50
Artifact link: https://snapshots.elastic.co/7.12.0-6c7b44c1/downloads/beats/elastic-agent/elastic-agent-7.12.0-SNAPSHOT-windows-x86_64.zip

Hence, blocked to continue testing on Windows package 4.1.

Agent logs:
elastic-agent-json.txt

Screenshot:
image

@michalpristas
Copy link
Contributor

looks like todays build is the same and does not includes anything from yesterday, so if the artifact from yesterday contained invalid metricbeat binary it would be the case for this one as well.
do you still see the logs eric mentioned before?
i tested artifact on the local windows machine and metricbeat executes ok.

can you check hash of the metricbeat.exe?
you should see this:

PS C:\> CertUtil -hashfile 'C:\Program Files\Elastic\Agent\data\elastic-ag
ent-04d374\install\metricbeat-7.12.0-SNAPSHOT-windows-x86_64\metricbeat.exe' MD5

MD5 hash of file C:\Program Files\Elastic\Agent\data\elastic-agent-04d374\install\metricbeat-7.12.0-SNAPSHOT-windows-x86
_64\metricbeat.exe:
4754db5e469a1281803cb2fe520c0934
CertUtil: -hashfile command completed successfully.

@amolnater-qasource
Copy link
Author

Hi @michalpristas
Today we have revalidated this issue on 7.12.0 Snapshot Kibana cloud environment and as per your feedback we have checked hash for the metricbeat.exe. Please refer below screenshot:
1

Elastic-agent logs:
elastic-agent-json.log

Build details:

Build: 38965
Commit: 00f7be3e485a4ec7788675ccd3d302eb6b92dc50

Please let us know if anything else is required.
Thanks

@michalpristas
Copy link
Contributor

michalpristas commented Feb 24, 2021

seems like you have the same version/hash of agent as i was testing and hash of metricbeat does not match.

can we check that

  • hash of metrcibeat in download directory is the same as in the install directory
  • hash of metricbeat in download directory is the same as hash of unpacked metricbeat
    • unpack downloaded agent, go to data/elastic-agent-04d374/download, unpack metricbeat.zip, check hash
  • metricbeat is able to be executed by direct call to the file for all the cases:
    • metricbeat.exe in install directory
    • metricbeat.exe in download directory
    • metricbeat.exe unpacked from downloaded artifact

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Feb 24, 2021

Hi @michalpristas

As per your feedback we have checked hash from both the locations and found it different. Please find screenshot below:
11

Further we attempted to start the metricbeat from:

metricbeat.exe in install directory

Observed Access is denied even when run on admin cmd.
13

metricbeat.exe unpacked from downloaded artifact

Observed no actions after running metricbeat.exe.
14

cc: @EricDavisX
Thanks

@EricDavisX
Copy link
Contributor

Thank you Amol and Michal for working this. Here is a summary of where we see we are today, afte a quick chat with Michal:

  • we do think it is specific to something on Agent side.
  • it seems specific to Win 10 so far (a few confirmed failures on win 10, but other Windows OSes are working).
  • We were testing with the snapshot image so we’ll try again with the latest BC (number 2). ...The Endpoint in the snapshot is not signed, we know - so any testing that included Endpoint and the snapshot is suspect to be the cause on windows or mac.
  • I’m going to help Michal get a template in the endgame vSphere cluster to repro this

@EricDavisX
Copy link
Contributor

ok, we have reports of other Win 10 versions working, and I used the same template (I believe) as is reported originally, and I DO see the problem. I've cloned one for Michal to take a look at, requires Endgame VPN access. notes sent in slack for access.

@EricDavisX EricDavisX changed the title [Windows specific]: No agent activity logs under Logs tab on enrolling agent with policy having System and Endpoint Security. [Win-10-2016-ltsb specific]: Agent's Metricbeat process not running (so no Activity logs under Logs tab for Metricbeat), may relate to Endpoint or Linux / Windows Integrations Feb 24, 2021
@michalpristas
Copy link
Contributor

we had a nice chat with @EricDavisX and Qas team. what we found out is that they ran into issue with race between enroll and install which is fixed few days ago. after restart we ran into issue described.

what we observed was that agent unzips metricbeat but only partially. it results in metricbeat of 2/3 size 77 instead of 118 MBs the rest of the files following metricbeat.exe were entirely missing
error was not present so this leave us with failure during unzip when it thinks zip ended while processing metricbeat.exe file
or something with atomic installer which i think is not the root cause probably.

i will check unzip code and as this was not change i will also change upstream for changes.

@michalpristas
Copy link
Contributor

have a thought about root cause:

my suspicion is this is related to change of order in install/enrollment process:
when we install we first start agent and then enroll
after enroll is done we restart agent to pick up fleet config.

what i think is going on is that agent is restarted while installing filebeat/metricbeat and then beat is not fully copied/unpacked.
then once agent is restarted it tries to run metricbeat but it fails because metricbeat is not whole and part of it is missing.

@blakerouse
Copy link
Contributor

@michalpristas To ensure that Elastic Agent is always running with the correct (unmodified) version of a beat, would it be better to always verify/extract before starting it? That would really reduce the window of a beat being modified and then executed by an attacker, and always ensure that Elastic Agent is running with what it expects (not something that is corrupt or modified)?

I think the overhead of re-extraction is worth the benefit.

@EricDavisX
Copy link
Contributor

I have asked @dikshachauhan-qasource to help test out the PR opened above to see how it works, just to help us with a data point. Request being sent in daily email to QAS team.

@dikshachauhan-qasource
Copy link

Hi @EricDavisX ,

Today, we are unable to deploy build from from staging cloud. Hence, blocked to verify PR merged above. We will validate it once 8.0 build is available for deployment.

Thanks
QAS

@dikshachauhan-qasource
Copy link

Hi @EricDavisX

We have validated this issue on 8.0 snapshot build and found it working fine. Build details are as follows:

BUILD 40910
COMMIT aa62a130eec855513bfc68fa863123a746879057
ARTIFACT LINK: https://storage.cloud.google.com/beats-ci-artifacts/pull-requests/pr-24253/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-windows-x86_64.zip

Observations:

  1. Installed agent with default policy having only system integration.
  • Filebeat and metricbeat are running.
  • Metricbeat folders is now extracted completely by itself .
  1. Installed agent with new policy having only system, windows, Linux and Endpoint integration.
  • Filebeat , metricbeat and Endpoint binaries are running.
  • Data stream generated properly on dataset page.
  • Host is visible on Admin tab.
  • Logs are available on discover tab.

Screenshot:
image

Further, though agent got installed successfully, however few new error logs were displayed on Agent logs tab. Please refer below:
image

Agent logs from UI:
Agentlogs8.0.txt

Thanks
QAS

@EricDavisX
Copy link
Contributor

@michalpristas , now or later (or skip it, if not valuable enough) we could change the log level to 'Warn' maybe here? ... That is since it is being recovered by our code logic and not a totally unexpected problem. Diksha can log a ticket to track it...

@dikshachauhan-qasource that is great. I'd like to add a 'logs message' ticket into the system and link it to our Logs Meta issue to track that those error citations are acceptable / accepted and not indicative of new problems we need to track. Depending on what Michal may say above, we can adjust our follow through

@EricDavisX
Copy link
Contributor

This seems fixed indeed - though we are finding new problems with Metricbeat, but not as frequently. : / It isn't really very clear - but we can track the above in reference to subsequent problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants