fix: rootless k8s integration tests #5290

pkoutsovasilis · 2024-08-13T09:16:34Z

What does this PR do?

This PR addresses the failures of k8s integrations tests:

Fix degraded agent status for rootless agent running with elastic-agent uid (1000). The main reason behind this was the setcap on the agentbeat binary in conjunction with it getting copied from a "builder" image during the building of the elastic-agent container image. I am not sure why this is the case, but I have observed in the past that copying files with capabilities across different docker layers results in an Effective set with no capabilities of the file at hand, ignoring completely the Ambient set of the parent which is elastic-agent in our case. This quirk got highlighted when metricbeat implemented status reporting. Now fun fact, we weren't bumping into this scenario for rootless agent with random uid:gid as in this case, elastic-agent is chowing everything and when you chown a file to a different user the capabilities are getting reset and hence the Ambient set of elastic-agent was getting applied. The solution is to remove setcap from agentbeat completely as it is redundant, elastic-agent will raise and pass down all capabilities in Bounding set and thus the former doesn't need to specify anything special in terms of capabilities.
Add extra required capabilities for rootless agent. The extra capabilities required to read /proc/${pid}/io are DAC_READ_SEARCH and SYS_PTRACE
increase memory limits for elastic-agent container as I observed some failures due to OOM when fixing the above issues
increase the timeout to 90 seconds while waiting elastic-agent to report healthy. As of now we get a degraded agent status until kube-state-metrics is properly initialised. Thus I feel like 60 seconds when running three kind clusters on the same machine with parallel integration k8s tests is kinda tight

Why is it important?

Because it fixes the k8s integration tests

Checklist

My code follows the style guidelines of this project
~~[ ] I have commented my code, particularly in hard-to-understand areas~~
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
I have added an integration test or an E2E test

Disruptive User Impact

N/A

How to test this PR locally

DEV=true SNAPSHOT=true EXTERNAL=true PLATFORMS=linux/arm64 PACKAGES=docker mage -v package
mage integration:auth
mage integration:kubernetes

Related issues

Closes #5275

pkoutsovasilis · 2024-08-13T11:57:17Z

cc @blakerouse @VihasMakwana

elasticmachine · 2024-08-13T12:11:54Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

blakerouse

Overall this change seems good, I just want to get a better understanding before I give it a +1.

Seems like this change removes the need for this PR as well #5271.

Does this only work because of this PR in agentbeat? Or is that PR in agentbeat also not required? elastic/beats#40466

testing/integration/kubernetes_agent_standalone_test.go

pkoutsovasilis · 2024-08-13T17:53:31Z

Overall this change seems good, I just want to get a better understanding before I give it a +1.

Seems like this change removes the need for this PR as well #5271.

Yep I don't think this one is needed and I would even propose to allow Agentbeat to get all capabilities through the Ambient set of elastic-agent and not define the permitted caps at the file level. Thus we will be able to pass down all required capabilities directly from the k8s-manifest->elastic-agent->agentbeat as the capabilities might vary given all the different beats that live inside agentbeat

Does this only work because of this PR in agentbeat? Or is that PR in agentbeat also not required? elastic/beats#40466

I think that this PR is also not required elastic/beats#40466 (comment)

pkoutsovasilis · 2024-08-13T17:58:57Z

@rdner is the setcap removal from agentbeat binary somehow affecting the wolfi-based image? just checking

dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl

cmacknz · 2024-08-13T18:12:01Z

Let's revert elastic/beats#40466 and make sure this still passes, although it may be better for our CI runs if this merges first to ensure nothing is transiently broken.

cmacknz · 2024-08-13T18:13:04Z

Also, be sure to revert #5293 in this PR if it merges first.

blakerouse

Thank you for the detailed explanation's and for ensuring that the revert of the change for the agentbeat works with this change.

blakerouse · 2024-08-13T18:32:29Z

Created PR for the revert in agentbeat.

@cmacknz See @pkoutsovasilis comment here about testing with that PR reverted - elastic/beats#40466 (comment)

rdner · 2024-08-14T08:47:09Z

@rdner is the setcap removal from agentbeat binary somehow affecting the wolfi-based image? just checking

it should not affect the work on Wolfi but I'll keep this change in mind from now on, thanks!

ycombinator · 2024-08-14T13:47:52Z

The PR to temporarily disable k8s integration tests around capabilities has now been merged. Please remember to revert it once this PR here is merged.

pkoutsovasilis · 2024-08-14T16:03:52Z

@cmacknz I have rebased this PR to re-enable the k8s test with the fix in place, after this one is now merged (k8s tests CI run link). next steps? 🙂

elastic-sonarqube · 2024-08-14T16:05:11Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

cmacknz · 2024-08-14T19:28:49Z

You are stuck behind some known flaky tests. I don't have a problem force merging past these since they do not even involve the container image.

rdner · 2024-08-15T11:25:09Z

Fix degraded agent status for rootless agent running with elastic-agent uid (1000). The main reason behind this was the setcap on the agentbeat binary in conjunction with it getting copied from a "builder" image during the building of the elastic-agent container image. I am not sure why this is the case, but I have observed in the past that copying files with capabilities across different docker layers results in an Effective set with no capabilities of the file at hand

@pkoutsovasilis could you elaborate more where this conclusion is coming from?

I've tested in my PR that capabilities were still on the binary after I built the final image with copying from the builder layer to the final layer. See the PR description for more information #5073

Is there a quirk I'm not aware of?

pkoutsovasilis · 2024-08-15T12:01:14Z

#5073

As I mentioned in my comment @rdner this is an experimental observation. However, when you say you've tested this image, how have you tested this? there were no k8s tests back in the day, elastic-agent wasn't raising any capabilities, and most importantly there was nothing in Agentbeat that required that capabilities you were permitting to run successfully (aka no Heartbeat tests). What do I miss?

rdner · 2024-08-15T12:04:09Z

@pkoutsovasilis is not it enough to test it with getcap like I did in my PR?
I'm not trying to say you're wrong, I'm trying to understand how one tests such a thing for the future. I have exactly the same problem to solve in the Heartbeat standalone Docker image.

pkoutsovasilis · 2024-08-15T12:33:31Z

@pkoutsovasilis is not it enough to test it with getcap like I did in my PR?
I'm not trying to say you're wrong, I'm trying to understand how one tests such a thing for the future. I have exactly the same problem to solve in the Heartbeat standalone Docker image.

so the quirk I have observed is that the capabilities are not passed down to child processes through the parent's Ambient set. Specifically, elastic-agent sets the Ambient set and then spawns an Agentbeat child process, if the latter has been copied from another docker image and had capabilities set, the parent's capabilities of the Ambient set are dropped and not passed to it.

With that said, in the Heartbeat standalone docker image I think you should set the capabilities at it, as there is no parent process that sets Ambient set in this case.

Does the above help? 🙂

rdner · 2024-08-15T12:39:06Z

@pkoutsovasilis so it's not about loosing capabilities between Docker layers while the image is built, it's about losing capabilities when spawning a sub-process, right?

Asking because I'm taking the same approach with setting capabilities in the builder layer here too elastic/beats#40524

I'm trying to make sure that this will work with standalone Heartbeat.

pkoutsovasilis · 2024-08-15T13:07:53Z

@pkoutsovasilis so it's not about loosing capabilities between Docker layers while the image is built, it's about losing capabilities when spawning a sub-process, right?

yes no capabilities are lost from the binary, the quirk regards only the Ambient set of the parent getting passed at the child processes.

Asking because I'm taking the same approach with setting capabilities in the builder layer here too elastic/beats#40524

I'm trying to make sure that this will work with standalone Heartbeat.

yep this should have the setcap at the Heartbeat binary.

In simple words, the issue here doesn't exist in the standalone Heartbeat image and you should keep the setcap at the Heartbeat binary

mergify bot assigned pkoutsovasilis Aug 13, 2024

mergify bot added the backport-skip label Aug 13, 2024

pkoutsovasilis added the skip-changelog label Aug 13, 2024

pkoutsovasilis force-pushed the pkoutsovasilis/fix_agentbeat_permissions branch from 7a5211e to 0a82981 Compare August 13, 2024 10:43

pkoutsovasilis added the bug Something isn't working label Aug 13, 2024

elastic deleted a comment from mergify bot Aug 13, 2024

pkoutsovasilis marked this pull request as ready for review August 13, 2024 11:56

pkoutsovasilis requested a review from a team as a code owner August 13, 2024 11:56

pkoutsovasilis requested review from andrzej-stencel and pchila August 13, 2024 11:56

pkoutsovasilis changed the title ~~fix: remove setcap from agentbeat~~ fix: rootless k8s integration tests Aug 13, 2024

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Aug 13, 2024

pierrehilbert requested review from blakerouse and removed request for andrzej-stencel August 13, 2024 12:12

blakerouse reviewed Aug 13, 2024

View reviewed changes

testing/integration/kubernetes_agent_standalone_test.go Outdated Show resolved Hide resolved

This was referenced Aug 13, 2024

[agentbeat] Fix agentbeat capabilities elastic/beats#40466

Merged

Skip failing k8s capabilities test #5293

Merged

[docker][agentbeat]: add cap_sys_ptrace and cap_dac_override in permitted set #5271

Closed

cmacknz reviewed Aug 13, 2024

View reviewed changes

dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl Show resolved Hide resolved

blakerouse approved these changes Aug 13, 2024

View reviewed changes

blakerouse mentioned this pull request Aug 13, 2024

Revert "[agentbeat] Fix agentbeat capabilities" elastic/beats#40510

Merged

pkoutsovasilis added 3 commits August 14, 2024 18:13

fix: remove redundant setcap on agentbeat

b17b172

fix: add required capabilities for rootless agent

616ed7d

fix: increase agent resources to prevent OOM restarts

ec02980

pkoutsovasilis force-pushed the pkoutsovasilis/fix_agentbeat_permissions branch from 7836ca1 to ec02980 Compare August 14, 2024 15:16

cmacknz approved these changes Aug 14, 2024

View reviewed changes

cmacknz merged commit ef69b58 into elastic:main Aug 14, 2024
10 of 13 checks passed

mergify bot mentioned this pull request Aug 21, 2024

[8.15](backport #40510) Revert "[agentbeat] Fix agentbeat capabilities" elastic/beats#40575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rootless k8s integration tests #5290

fix: rootless k8s integration tests #5290

pkoutsovasilis commented Aug 13, 2024 •

edited

Loading

pkoutsovasilis commented Aug 13, 2024

elasticmachine commented Aug 13, 2024

blakerouse left a comment

pkoutsovasilis commented Aug 13, 2024 •

edited

Loading

pkoutsovasilis commented Aug 13, 2024 •

edited

Loading

cmacknz commented Aug 13, 2024

cmacknz commented Aug 13, 2024

blakerouse left a comment

blakerouse commented Aug 13, 2024

rdner commented Aug 14, 2024

ycombinator commented Aug 14, 2024

pkoutsovasilis commented Aug 14, 2024 •

edited

Loading

elastic-sonarqube bot commented Aug 14, 2024

cmacknz commented Aug 14, 2024

rdner commented Aug 15, 2024

pkoutsovasilis commented Aug 15, 2024

rdner commented Aug 15, 2024

pkoutsovasilis commented Aug 15, 2024

rdner commented Aug 15, 2024

pkoutsovasilis commented Aug 15, 2024

fix: rootless k8s integration tests #5290

fix: rootless k8s integration tests #5290

Conversation

pkoutsovasilis commented Aug 13, 2024 • edited Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

pkoutsovasilis commented Aug 13, 2024

elasticmachine commented Aug 13, 2024

blakerouse left a comment

Choose a reason for hiding this comment

pkoutsovasilis commented Aug 13, 2024 • edited Loading

pkoutsovasilis commented Aug 13, 2024 • edited Loading

cmacknz commented Aug 13, 2024

cmacknz commented Aug 13, 2024

blakerouse left a comment

Choose a reason for hiding this comment

blakerouse commented Aug 13, 2024

rdner commented Aug 14, 2024

ycombinator commented Aug 14, 2024

pkoutsovasilis commented Aug 14, 2024 • edited Loading

elastic-sonarqube bot commented Aug 14, 2024

Quality Gate passed

cmacknz commented Aug 14, 2024

rdner commented Aug 15, 2024

pkoutsovasilis commented Aug 15, 2024

rdner commented Aug 15, 2024

pkoutsovasilis commented Aug 15, 2024

rdner commented Aug 15, 2024

pkoutsovasilis commented Aug 15, 2024

pkoutsovasilis commented Aug 13, 2024 •

edited

Loading

pkoutsovasilis commented Aug 13, 2024 •

edited

Loading

pkoutsovasilis commented Aug 13, 2024 •

edited

Loading

pkoutsovasilis commented Aug 14, 2024 •

edited

Loading