-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: rootless k8s integration tests #5290
fix: rootless k8s integration tests #5290
Conversation
7a5211e
to
0a82981
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this change seems good, I just want to get a better understanding before I give it a +1.
Seems like this change removes the need for this PR as well #5271.
Does this only work because of this PR in agentbeat? Or is that PR in agentbeat also not required? elastic/beats#40466
Yep I don't think this one is needed and I would even propose to allow Agentbeat to get all capabilities through the Ambient set of elastic-agent and not define the permitted caps at the file level. Thus we will be able to pass down all required capabilities directly from the k8s-manifest->elastic-agent->agentbeat as the capabilities might vary given all the different beats that live inside agentbeat
I think that this PR is also not required elastic/beats#40466 (comment) |
@rdner is the setcap removal from agentbeat binary somehow affecting the wolfi-based image? just checking |
Let's revert elastic/beats#40466 and make sure this still passes, although it may be better for our CI runs if this merges first to ensure nothing is transiently broken. |
Also, be sure to revert #5293 in this PR if it merges first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation's and for ensuring that the revert of the change for the agentbeat works with this change.
Created PR for the revert in agentbeat. @cmacknz See @pkoutsovasilis comment here about testing with that PR reverted - elastic/beats#40466 (comment) |
it should not affect the work on Wolfi but I'll keep this change in mind from now on, thanks! |
The PR to temporarily disable k8s integration tests around capabilities has now been merged. Please remember to revert it once this PR here is merged. |
7836ca1
to
ec02980
Compare
|
You are stuck behind some known flaky tests. I don't have a problem force merging past these since they do not even involve the container image. |
@pkoutsovasilis could you elaborate more where this conclusion is coming from? I've tested in my PR that capabilities were still on the binary after I built the final image with copying from the Is there a quirk I'm not aware of? |
As I mentioned in my comment @rdner this is an experimental observation. However, when you say you've tested this image, how have you tested this? there were no k8s tests back in the day, elastic-agent wasn't raising any capabilities, and most importantly there was nothing in Agentbeat that required that capabilities you were permitting to run successfully (aka no Heartbeat tests). What do I miss? |
@pkoutsovasilis is not it enough to test it with |
so the quirk I have observed is that the capabilities are not passed down to child processes through the parent's Ambient set. Specifically, elastic-agent sets the Ambient set and then spawns an Agentbeat child process, if the latter has been copied from another docker image and had capabilities set, the parent's capabilities of the Ambient set are dropped and not passed to it. With that said, in the Heartbeat standalone docker image I think you should set the capabilities at it, as there is no parent process that sets Ambient set in this case. Does the above help? 🙂 |
@pkoutsovasilis so it's not about loosing capabilities between Docker layers while the image is built, it's about losing capabilities when spawning a sub-process, right? Asking because I'm taking the same approach with setting capabilities in the I'm trying to make sure that this will work with standalone Heartbeat. |
yes no capabilities are lost from the binary, the quirk regards only the Ambient set of the parent getting passed at the child processes.
yep this should have the setcap at the Heartbeat binary. In simple words, the issue here doesn't exist in the standalone Heartbeat image and you should keep the setcap at the Heartbeat binary |
What does this PR do?
This PR addresses the failures of k8s integrations tests:
agentbeat
binary in conjunction with it getting copied from a "builder" image during the building of the elastic-agent container image. I am not sure why this is the case, but I have observed in the past that copying files with capabilities across different docker layers results in an Effective set with no capabilities of the file at hand, ignoring completely the Ambient set of the parent which is elastic-agent in our case. This quirk got highlighted when metricbeat implemented status reporting. Now fun fact, we weren't bumping into this scenario for rootless agent with random uid:gid as in this case, elastic-agent is chowing everything and when you chown a file to a different user the capabilities are getting reset and hence the Ambient set of elastic-agent was getting applied. The solution is to remove setcap from agentbeat completely as it is redundant, elastic-agent will raise and pass down all capabilities in Bounding set and thus the former doesn't need to specify anything special in terms of capabilities.DAC_READ_SEARCH
andSYS_PTRACE
Why is it important?
Because it fixes the k8s integration tests
Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog toolDisruptive User Impact
N/A
How to test this PR locally
Related issues
Closes #5275