Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] gitRepo resources aren't applying to clusters when changing fleet workspace #1845

Closed
slickwarren opened this issue Oct 5, 2023 · 27 comments
Assignees
Milestone

Comments

@slickwarren
Copy link

slickwarren commented Oct 5, 2023

Rancher Server Setup

  • Rancher version: v2.8-head (fc12e7d)
  • Installation option (Docker install/Helm Chart): rke2
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.27.6+rke2r1
  • Proxy/Cert Details: valid certs

Information about the Cluster

  • Kubernetes version: v1.27.6 (rke2, k3s, rke1)
  • Cluster Type (Local/Downstream): downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): aws

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: tested with admin

Describe the bug

when changing a cluster's workspace, the git-repo resources for fleet aren't applying to the clusters.

To Reproduce

  • create a cluster (by default, deploys to fleet-default workspace)
  • create a new workspace
  • in the new workspace, create a git-repo
  • transfer the cluster to the new workspace

Result
resources from fleet's new workspace aren't deploying. The cluster count remains at 0

Expected Result

resources from the gitRepo in the user added workspace should apply to any clusters that are matched.

Screenshots

Additional context

adding a git-repo and switching the cluster back to fleet-default results in the expected behavior
tested by using the 'all clusters in workspace' setting and manually selecting a cluster in the workspace, but neither is working

possibly affected by rancher/rancher#43078 as this was found on the same setup

@aiyengar2
Copy link
Contributor

The root cause of this issue is an error seen in the Job resource created by the GitJob resource created by the GitRepo resource in any user-created Fleet workspace.

Error creating: pods "ctw-test1-44c94-bkdzs" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gitcloner-initializer", "fleet" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gitcloner-initializer", "fleet" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gitcloner-initializer", "fleet" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gitcloner-initializer", "fleet" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

This is because the base cluster used when filing this issue was a hardened 1.25+ (1.27) cluster, which means that it enforces a restrictive set of Pod Security Standards.

Since the new namespace is not "allow-listed" to deploy privileged pods, the initContainer of the Job tied to the GitJob tied to the GitRepo is not able to progress, which results in the GitRepo creating no new Bundle resources. From there, fleet-agent is responding as expected by not creating any resources.

I suspect this might be a general issue that was not tested in previous releases, so here are some new steps for reproduction:

To Reproduce

  • create a Rancher instance that is running Kubernetes 1.25+ and follows the relevant hardening guide on the local cluster. Ensure PSA enforcement is happening.
  • create a new workspace
  • in the new workspace, create a git-repo
  • create a cluster in the new workspace

Expected Bad Result
The cluster sees the same symptoms as we see here. The Job in that namespace tied to the GitJob tied to the GitRepo should be prevented from running due to PSA.

@slickwarren can you explicitly test this on an older Rancher version?

cc: @manno @olblak, this is a Fleet issue, so we should ideally transfer this over.

@Sahota1225 Sahota1225 transferred this issue from rancher/rancher Oct 6, 2023
@rancherbot rancherbot added this to Fleet Oct 6, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Oct 6, 2023
@slickwarren
Copy link
Author

slickwarren commented Oct 6, 2023

tested on a released version of rancher, 2.7.8, with a hardened local cluster and am experiencing the same error there:

 7 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

I'm actually seeing similar behavior on non-hardened clusters. (on 2.8-head)

Error creating: pods "ctwtest-44c94-nb7c2" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gitcloner-initializer", "fleet" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gitcloner-initializer", "fleet" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gitcloner-initializer", "fleet" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gitcloner-initializer", "fleet" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

@jiaqiluo
Copy link
Member

jiaqiluo commented Oct 6, 2023

here are my 2 cents:
If it is true that Fleet components must run with those mentioned permissions ( allowPrivilegeEscalation, runAsNonRoot, etc.), we have to either modify the PSS level on that namespace or whitelist that namespace.

@kkaempf
Copy link
Collaborator

kkaempf commented Oct 9, 2023

@sbulage - can you please verify this bug ?

@sbulage
Copy link
Contributor

sbulage commented Oct 9, 2023

Hello @slickwarren, I am hitting this issue while installing Rancher from 2.8-head.

can you please tell me the way to install Rancher on k8s version > 1.27.0-0 ?

Thanks in advance.

@kkaempf kkaempf moved this from 🆕 New to 📋 Backlog in Fleet Oct 9, 2023
@aiyengar2
Copy link
Contributor

here are my 2 cents: If it is true that Fleet components must run with those mentioned permissions ( allowPrivilegeEscalation, runAsNonRoot, etc.), we have to either modify the PSS level on that namespace or whitelist that namespace.

The issue here is that these workspaces are user-created resources, so there may be security implications with any automatic process that would attempt to whitelist such a namespace. cc: @macedogm @pjbgf, this may fall in your wheelhouse

@macedogm
Copy link
Member

We should not add new namespaces to the PSAC exempt list already provided by Rancher, because such namespaces would be user controlled (outside of Rancher's trust boundaries). Additionally, doing such allow lists dynamically is risky.

The recommendation is to fix Fleet's deployment to make sure that all components run with an unprivileged securityContext, as Jiaqi and Arvind mentioned.

I haven't dig deep, but if I saw correctly, wouldn't be the case of fixing GitJob's deployment to match Fleet's and Fleet-Agent's deployment, respectively?

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop:
- ALL

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop:
- ALL

Note: @aiyengar2 thanks for pinging us on this.

@raulcabello
Copy link
Contributor

We should not add new namespaces to the PSAC exempt list already provided by Rancher, because such namespaces would be user controlled (outside of Rancher's trust boundaries). Additionally, doing such allow lists dynamically is risky.

The recommendation is to fix Fleet's deployment to make sure that all components run with an unprivileged securityContext, as Jiaqi and Arvind mentioned.

I haven't dig deep, but if I saw correctly, wouldn't be the case of fixing GitJob's deployment to match Fleet's and Fleet-Agent's deployment, respectively?

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop:
- ALL

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop:
- ALL

Note: @aiyengar2 thanks for pinging us on this.

This is the GitJob deployment, not the k8s Job that is created when a GitRepo is created or modified. I think we should also add the securityContext here and here to fix this issue.

I can't transfer a Cluster to a different workspace, but I don't see anything in the logs.

Rancher version: v2.8.0-alpha2
Installation option (Docker install/Helm Chart): rke2 v1.27.6+rke2r1
Downstream cluster: k3s v1.27.6+k3s1

Steps:

  • create a cluster (by default, deploys to fleet-default workspace)
  • create a new workspace
  • in the new workspace, create a GitRepo
  • transfer the cluster to the new workspace

Then Cluster stays in Wait Check-In state and it is never moved to the new workspace I created. However I don't see anything in the gitjob or fleet-controller logs. And the k8s job that should run fleet apply is never created. Where should I look for the error?

@aiyengar2
Copy link
Contributor

aiyengar2 commented Oct 11, 2023

@raulcabello the relationship between the GitRepo and Job is through the GitJob. When the GitRepo is modified, the related GitJob will also be changed:

That’s why the GitJob spawns a new Job.

So it is indeed tied to Job created when a GitRepo is modified. That is the k8s Job that runs the fleet apply.

@aiyengar2
Copy link
Contributor

aiyengar2 commented Oct 11, 2023

Transferring the Fleet cluster is gated by a feature flag in Rancher, once you enable that in the UI (or by modifying the feature resource in the management cluster) you should be able to execute a transfer.

@sbulage
Copy link
Contributor

sbulage commented Oct 11, 2023

After enabling the feature flag mentioned in this PR: rancher/dashboard#9730 (comment) #1845 (comment).

It is working as expected. As we (me and @raulcabello ) can see the cluster can be moved from fleet-default workspace to newly created workspace (GitRepo is already present.).

Cluster details: It is non-hardened cluster

  • Rancher version: 2.8-head
  • Fleet version: 0.9.0-rc3
  • Upstream cluster RKE2 : v1.27.6+rke2r1
  • Downstream cluster k3s: v1.26.4+k3s1

Also, tried to create new GitRepo in the newly created workspace and it is also working as expected.

No _error or warning_ traces found in gitjob pods.

K8s job is created and completed with no errors.

We will try with hardened cluster and try to reproduce it.

@raulcabello
Copy link
Contributor

raulcabello commented Oct 11, 2023

I think #1852 and rancher/gitjob#331 should fix this issue. However, we can't test it as we are still not able to reproduce the issue.

@sbulage will try tomorrow to reproduce it in a hardened cluster

@kkaempf
Copy link
Collaborator

kkaempf commented Oct 11, 2023

It needs to be

  • a hardened cluster
  • k8s 1.25+
  • PSA enforcement needs to be turned on

Also:

one would not get messages about violates PodSecurity on non-hardened clusters (restricted AdmissionConfiguration
needs to be set at cluster level, see https://kubernetes.io/docs/tutorials/security/cluster-level-pss/ for more context).

@raulcabello
Copy link
Contributor

raulcabello commented Oct 13, 2023

We should not add new namespaces to the PSAC exempt list already provided by Rancher, because such namespaces would be user controlled (outside of Rancher's trust boundaries). Additionally, doing such allow lists dynamically is risky.

The recommendation is to fix Fleet's deployment to make sure that all components run with an unprivileged securityContext, as Jiaqi and Arvind mentioned.

I haven't dig deep, but if I saw correctly, wouldn't be the case of fixing GitJob's deployment to match Fleet's and Fleet-Agent's deployment, respectively?

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop:
- ALL

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop:
- ALL

Note: @aiyengar2 thanks for pinging us on this.

@macedogm is it ok if we don't add the readOnlyRootFilesystem: true to the securityContext? This is causing problems when cloning git repos with go-git. go-git is creating temporary files that are causing the problem. go-git is a third party library.

The following securityContext in both containers in the job created by gitjob is working fine in hardened clusters in our test env.

	SecurityContext: &corev1.SecurityContext{
		AllowPrivilegeEscalation: &[]bool{false}[0],
		Privileged:               &[]bool{false}[0],
		RunAsNonRoot:             &[]bool{true}[0],
		SeccompProfile: &corev1.SeccompProfile{
			Type: corev1.SeccompProfileTypeRuntimeDefault,
		},
		Capabilities: &corev1.Capabilities{Drop: []corev1.Capability{"ALL"}},
	},

Is this securityContext enough? see #1860 and rancher/gitjob#331

@pjbgf
Copy link
Member

pjbgf commented Oct 13, 2023

This is causing problems when cloning git repos with go-git. go-git is creating temporary files that are causing the problem. go-git is a third party library.

@raulcabello You can mount an emptyDir to that specific path (/tmp) to bypass this issue.

@macedogm
Copy link
Member

@raulcabello agree with Paulo's suggestion above (in case its feasible).

@sbulage

This comment was marked as resolved.

@sbulage
Copy link
Contributor

sbulage commented Oct 17, 2023

I have tested with Raul's fix with different images and found that fleet-agent pods are not re-creating with newer images. Till the time Raul's PR gets merged, he uploaded images and gave it to me for testing.

Throwing below error:

Pods "fleet-agent-79fc9f8d57-8k6kw" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "fleet-agent" must set securityContext.capabilities.drop=["ALL"]; container "fleet-agent" must not include "ALL" in securityContext.capabilities.add), runAsNonRoot != true (container "fleet-agent" must not set securityContext.runAsNonRoot=false), seccompProfile (pod or container "fleet-agent" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost"):Updated: 0/1

I closely look at the PSA exempted namespace list and saw that cattle-fleet-local-system namespace which is managed by Fleet-controller is absent. In order test the patch is working, I followed below stesp:

  1. I updated PSA file by adding cattle-fleet-local-system namespace to exempted namespace list.
  2. Restarted rk2 server
  3. Updated the fleet-agents image
    I observed that fleet-agent image gets updated without any error.

@pjbgf @macedogm What should I do? should I create new issue for it, if yes, then where? please let me know thanks 😄

Ignore this comment as I see cattle-fleet-local-system is already exempted. (Here)

@manno
Copy link
Member

manno commented Oct 17, 2023

/backport v2.8.0 release/v0.9

@slickwarren
Copy link
Author

slickwarren commented Oct 17, 2023

@manno or @sbulage it doesn't look like rancher has picked up the new RC. Is this something your team can update so that I can run another round of testing?
rancher/charts#3138

@sbulage
Copy link
Contributor

sbulage commented Oct 18, 2023

@slickwarren There was some issue with CI in rancher/charts which seems like it is fixed. It will be available once rancher/charts#3138 merged(hopefully in next hour or so. 🤞 )

@raulcabello
Copy link
Contributor

It is available now

@raulcabello raulcabello moved this from 🏗 In progress to Needs QA review in Fleet Oct 18, 2023
@sbulage
Copy link
Contributor

sbulage commented Oct 18, 2023

All below test scenario performed on RKE2 cluster with and without hardening it.

Environment Details:

  • Rancher version: 2.8-head
  • Fleet version: 0.9.0-rc.5
  • Downstream cluster: k3s
  • K8s version: v1.27.6+rke2r1

QA TEST PLAN

Scenarios

Scenario Test Case
1 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster.
2 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in non-hardened cluster.
3 Test GitRepo deploys helm application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster.
4 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster.
5 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace non-hardened cluster.

@sbulage
Copy link
Contributor

sbulage commented Oct 18, 2023

TEST RESULT

RKE2 hardened cluster is created by following documentation.

Scenarios

Scenario Test Case Result
1 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster. ✔️
2 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in non-hardened cluster. ✔️
3 Test GitRepo deploys helm application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster. ✔️
4 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster. ✔️
5 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace non-hardened cluster. ✔️

REPRO STEPS

RKE2 Hardened cluster used
Scenario 1

  1. Create a new workspace new-workspace1.
  2. Create a GitRepo which deploys nginx.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace1.
  5. Verified that nginx app deployed cluster which moved to new-workspace without any issue.

RKE2 Non-hardened cluster used
Scenario 2

  1. Create a new workspace new-workspace2.
  2. Create a GitRepo which deploys nginx.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace2.
  5. Verified that nginx app deployed cluster which moved to new-workspace without any issue.

RKE2 Hardened cluster used
Scenario 3

  1. Create a new workspace new-workspace3.
  2. Create a GitRepo which deploys grafana helm application.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-3.
  5. Verified that grafana app deployed cluster which moved to new-workspace without any issue.

RKE2 Hardened cluster used
Scenario 4

  1. Create a new workspace new-workspace4.
  2. Create a GitRepo which deploys nginx from private github repository.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace4.
  5. Verified that nginx app deployed cluster which moved to new-workspace without any issue.

RKE2 Non-hardened cluster used
Scenario 5

  1. Create a new workspace new-workspace5.
  2. Create a GitRepo which deploys grafana from private github repository.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace5.
  5. Verified that grafana app deployed cluster which moved to new-workspace without any issue.

@kkaempf
Copy link
Collaborator

kkaempf commented Oct 18, 2023

@slickwarren - please give it a try 😉

@slickwarren
Copy link
Author

my tests through rancher using rc5 are working well, and I've closed the rancher-side issues. Not sure of fleet's process for closing these, but rancher-qa signs off on this 👍🏼

@sbulage
Copy link
Contributor

sbulage commented Oct 19, 2023

Thanks @slickwarren I will close this issue now. 👍

@sbulage sbulage moved this from Needs QA review to ✅ Done in Fleet Oct 19, 2023
@zube zube bot closed this as completed Oct 19, 2023
@zube zube bot removed the [zube]: Done label Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

10 participants