Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rootless k8s integration tests #5290

Merged

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Aug 13, 2024

What does this PR do?

This PR addresses the failures of k8s integrations tests:

  1. Fix degraded agent status for rootless agent running with elastic-agent uid (1000). The main reason behind this was the setcap on the agentbeat binary in conjunction with it getting copied from a "builder" image during the building of the elastic-agent container image. I am not sure why this is the case, but I have observed in the past that copying files with capabilities across different docker layers results in an Effective set with no capabilities of the file at hand, ignoring completely the Ambient set of the parent which is elastic-agent in our case. This quirk got highlighted when metricbeat implemented status reporting. Now fun fact, we weren't bumping into this scenario for rootless agent with random uid:gid as in this case, elastic-agent is chowing everything and when you chown a file to a different user the capabilities are getting reset and hence the Ambient set of elastic-agent was getting applied. The solution is to remove setcap from agentbeat completely as it is redundant, elastic-agent will raise and pass down all capabilities in Bounding set and thus the former doesn't need to specify anything special in terms of capabilities.
  2. Add extra required capabilities for rootless agent. The extra capabilities required to read /proc/${pid}/io are DAC_READ_SEARCH and SYS_PTRACE
  3. increase memory limits for elastic-agent container as I observed some failures due to OOM when fixing the above issues
  4. increase the timeout to 90 seconds while waiting elastic-agent to report healthy. As of now we get a degraded agent status until kube-state-metrics is properly initialised. Thus I feel like 60 seconds when running three kind clusters on the same machine with parallel integration k8s tests is kinda tight

Why is it important?

Because it fixes the k8s integration tests

Checklist

  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

N/A

How to test this PR locally

DEV=true SNAPSHOT=true EXTERNAL=true PLATFORMS=linux/arm64 PACKAGES=docker mage -v package
mage integration:auth
mage integration:kubernetes

Related issues

Closes #5275

@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/fix_agentbeat_permissions branch from 7a5211e to 0a82981 Compare August 13, 2024 10:43
@pkoutsovasilis pkoutsovasilis added the bug Something isn't working label Aug 13, 2024
@elastic elastic deleted a comment from mergify bot Aug 13, 2024
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review August 13, 2024 11:56
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner August 13, 2024 11:56
@pkoutsovasilis
Copy link
Contributor Author

cc @blakerouse @VihasMakwana

@pkoutsovasilis pkoutsovasilis changed the title fix: remove setcap from agentbeat fix: rootless k8s integration tests Aug 13, 2024
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Aug 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pierrehilbert pierrehilbert requested review from blakerouse and removed request for andrzej-stencel August 13, 2024 12:12
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change seems good, I just want to get a better understanding before I give it a +1.

Seems like this change removes the need for this PR as well #5271.

Does this only work because of this PR in agentbeat? Or is that PR in agentbeat also not required? elastic/beats#40466

testing/integration/kubernetes_agent_standalone_test.go Outdated Show resolved Hide resolved
@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Aug 13, 2024

Overall this change seems good, I just want to get a better understanding before I give it a +1.

Seems like this change removes the need for this PR as well #5271.

Yep I don't think this one is needed and I would even propose to allow Agentbeat to get all capabilities through the Ambient set of elastic-agent and not define the permitted caps at the file level. Thus we will be able to pass down all required capabilities directly from the k8s-manifest->elastic-agent->agentbeat as the capabilities might vary given all the different beats that live inside agentbeat

Does this only work because of this PR in agentbeat? Or is that PR in agentbeat also not required? elastic/beats#40466

I think that this PR is also not required elastic/beats#40466 (comment)

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Aug 13, 2024

@rdner is the setcap removal from agentbeat binary somehow affecting the wolfi-based image? just checking

@cmacknz
Copy link
Member

cmacknz commented Aug 13, 2024

Let's revert elastic/beats#40466 and make sure this still passes, although it may be better for our CI runs if this merges first to ensure nothing is transiently broken.

@cmacknz
Copy link
Member

cmacknz commented Aug 13, 2024

Also, be sure to revert #5293 in this PR if it merges first.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation's and for ensuring that the revert of the change for the agentbeat works with this change.

@blakerouse
Copy link
Contributor

Created PR for the revert in agentbeat.

@cmacknz See @pkoutsovasilis comment here about testing with that PR reverted - elastic/beats#40466 (comment)

@rdner
Copy link
Member

rdner commented Aug 14, 2024

@rdner is the setcap removal from agentbeat binary somehow affecting the wolfi-based image? just checking

it should not affect the work on Wolfi but I'll keep this change in mind from now on, thanks!

@ycombinator
Copy link
Contributor

The PR to temporarily disable k8s integration tests around capabilities has now been merged. Please remember to revert it once this PR here is merged.

@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/fix_agentbeat_permissions branch from 7836ca1 to ec02980 Compare August 14, 2024 15:16
@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Aug 14, 2024

@cmacknz I have rebased this PR to re-enable the k8s test with the fix in place, after this one is now merged (k8s tests CI run link). next steps? 🙂

Copy link

@cmacknz
Copy link
Member

cmacknz commented Aug 14, 2024

You are stuck behind some known flaky tests. I don't have a problem force merging past these since they do not even involve the container image.

@cmacknz cmacknz merged commit ef69b58 into elastic:main Aug 14, 2024
10 of 13 checks passed
@rdner
Copy link
Member

rdner commented Aug 15, 2024

Fix degraded agent status for rootless agent running with elastic-agent uid (1000). The main reason behind this was the setcap on the agentbeat binary in conjunction with it getting copied from a "builder" image during the building of the elastic-agent container image. I am not sure why this is the case, but I have observed in the past that copying files with capabilities across different docker layers results in an Effective set with no capabilities of the file at hand

@pkoutsovasilis could you elaborate more where this conclusion is coming from?

I've tested in my PR that capabilities were still on the binary after I built the final image with copying from the builder layer to the final layer. See the PR description for more information #5073

Is there a quirk I'm not aware of?

@pkoutsovasilis
Copy link
Contributor Author

#5073

As I mentioned in my comment @rdner this is an experimental observation. However, when you say you've tested this image, how have you tested this? there were no k8s tests back in the day, elastic-agent wasn't raising any capabilities, and most importantly there was nothing in Agentbeat that required that capabilities you were permitting to run successfully (aka no Heartbeat tests). What do I miss?

@rdner
Copy link
Member

rdner commented Aug 15, 2024

@pkoutsovasilis is not it enough to test it with getcap like I did in my PR?
I'm not trying to say you're wrong, I'm trying to understand how one tests such a thing for the future. I have exactly the same problem to solve in the Heartbeat standalone Docker image.

@pkoutsovasilis
Copy link
Contributor Author

@pkoutsovasilis is not it enough to test it with getcap like I did in my PR?
I'm not trying to say you're wrong, I'm trying to understand how one tests such a thing for the future. I have exactly the same problem to solve in the Heartbeat standalone Docker image.

so the quirk I have observed is that the capabilities are not passed down to child processes through the parent's Ambient set. Specifically, elastic-agent sets the Ambient set and then spawns an Agentbeat child process, if the latter has been copied from another docker image and had capabilities set, the parent's capabilities of the Ambient set are dropped and not passed to it.

With that said, in the Heartbeat standalone docker image I think you should set the capabilities at it, as there is no parent process that sets Ambient set in this case.

Does the above help? 🙂

@rdner
Copy link
Member

rdner commented Aug 15, 2024

@pkoutsovasilis so it's not about loosing capabilities between Docker layers while the image is built, it's about losing capabilities when spawning a sub-process, right?

Asking because I'm taking the same approach with setting capabilities in the builder layer here too elastic/beats#40524

I'm trying to make sure that this will work with standalone Heartbeat.

@pkoutsovasilis
Copy link
Contributor Author

@pkoutsovasilis so it's not about loosing capabilities between Docker layers while the image is built, it's about losing capabilities when spawning a sub-process, right?

yes no capabilities are lost from the binary, the quirk regards only the Ambient set of the parent getting passed at the child processes.

Asking because I'm taking the same approach with setting capabilities in the builder layer here too elastic/beats#40524

I'm trying to make sure that this will work with standalone Heartbeat.

yep this should have the setcap at the Heartbeat binary.

In simple words, the issue here doesn't exist in the standalone Heartbeat image and you should keep the setcap at the Heartbeat binary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip bug Something isn't working skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
7 participants