Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can feature flags in the provenance better reflect the execution of the pipeline run? #6797

Open
chitrangpatel opened this issue Jun 8, 2023 · 10 comments

Comments

@chitrangpatel
Copy link
Contributor

Considering the use case of a pipelineRun for this issue (same applies to the taskrun as well).

Feature-flags are used by the validation web hook and the reconciler to guard the API spec usage. We currently store them in the PipelineRun.Status.Provenance field. The feature-flags are used by Tekton Chains and stored in the generated provenance to indicate that the pipelinerun (or the build) was executed with these feature-flags. However, the feature-flags can be updated on the fly by updating the configMap. If this happens mid-way during the build process, the information surfaced in the provenance is incomplete because the flags were mutated during the pipeline run. This means that some pipeline tasks could have run with a different set of feature flags.

Proposed solutions in Issue #5999 suggested the use of the status field instead of context.

The challenge was as follows:

  • Validation happens in two places: webhook and reconciler.
  • If the feature flags are saved in the status of the taskrun/pipelinerun then when we validate a stand-alone pipelineSpec/taskSpec (particularly at the webhook stage), this taskrun/pipelinerun status field is not accessible and so we cannot rely on it. The status idea would work if we were only validating at the reconciler level.
  • This means that we need to rely on the context. Now, the challenge here is that the webhook and controller watches for updates to the config map and the context is updated when changes are detected (I think. Please correct me if I'm wrong). This means that we cannot rely on the context directly as well.

This issue is to brainstorm about how the feature flags in the provenance can better reflect the execution of the pipeline run.

@lbernick
Copy link
Member

lbernick commented Jun 8, 2023

Thanks Chitrang! Just want to note one additional comment you made: "The feature flags in the provenance... assumes that both pipeline and pipelinerun were validated with the same set of feature flags".

Currently, Pipelines are validated on creation and again when they are used in PipelineRuns. Feature flags might have changed since pipeline creation, but the value of feature flags when the pipelinerun is executed should reflect what was used to validate the pipeline spec. The one exception is "enable-api-fields" as of #6701.

@chitrangpatel
Copy link
Contributor Author

chitrangpatel commented Jun 8, 2023

Execution of the pipeline/task spec.

eg. after something like kubectl apply -f pipeline.yaml.

Screenshot 2023-06-08 at 11 21 56 AM
  • the validation web hook validates the API and stored it into etcd. Note that at this point, the feature flags stored in the config map are used during validation.

Execution of the pipelinerun spec.

eg. after something like kubectl apply -f pipelinerun.yaml.

Screenshot 2023-06-08 at 11 21 52 AM

Note here that the feature flags are used in multiple places when reconciling a pipeline run. A change at any point here could result in a new set of feature flags being used for partial build.

What is a build?

  • a pipeline.yaml is an ingredient to the build. It gets accepted/rejected by the web hook depending on the feature flags defined at the time and the API syntax used. But do we say that from the point of executing a pipeline.yaml, the build process has begun? I don't think so.
  • a pipelinerun.yaml references a pipeline that was applied via pipeline.yaml. I think the build starts as soon as the pipelinerun.yaml is applied. But, the pipeline was validated a while ago and could have been validated with a different set of feature flags.

Possible solution

  1. We could use the admissions mutating webhook to capture the feature flags for the pipeline run in its status.
  2. We confirm that the feature flags in the configmap are still the same as what we stored in the status (during the admission mutating web hook stage) ensuring that the config was not updated.
  3. As soon as we have access to the pipeline spec from the API server, we should validate it again with the current feature flags. This would enforce that the previously applied pipeline is re-validated against the current feature flags.
  4. Before creation of the taskrun, we confirm that the feature flags in the configmap are still the same as what we stored in the status (during the admission mutating web hook stage) ensuring that the config was not updated.
  5. We repeat steps 1, 2 and 3 with all the created taskruns/taskspec as well.
  6. We use the owner references for the task and compare the feature flags stored in the taskrun's status to that in its owner (ie. pipelineRun's status). This would ensure that the task run was running with the same feature flags as its owner.

At any point we detect a change between the feature flags stored in the status vs the config map, we could throw an error!! This ensures that we do not execute pipelines with mutated feature flags. This would in-turn ensure that the feature flags stored in the provenance (ie. for the entire duration of the pipelinerun/build) weren't mutated.

@chitrangpatel
Copy link
Contributor Author

chitrangpatel commented Jun 8, 2023

@lbernick @chuangw6 I have a possible solution that might work. I'd like to get your thoughts on this. see ☝️

@lbernick
Copy link
Member

lbernick commented Jun 9, 2023

The idea of a mutating admission webhook seems reasonable to me.
Instead of confirming that feature flags haven't changed, would it make sense to just use the feature flags from the status rather than feature flags from the configmap? If we threw an error if feature flags changed, this would make it hard for cluster operators to change the value of feature flags without causing running pipelineruns to fail.
It would be interesting to see if we can pass feature flags to the taskrun from the pipelinerun in some way-- I'm not sure if you can create a taskrun with a status.

@vdemeester
Copy link
Member

vdemeester commented Jun 20, 2023

If this happens mid-way during the build process, the information surfaced in the provenance is incomplete because the flags were mutated during the pipeline run. This means that some pipeline tasks could have run with a different set of feature flags.

This is indeed true, but is it something we want to disallow or not ? We probably want to dissallow, a bit similar to the "spec" of a Pipeline being stored in the PipelineRun.status to make sure if it changes we are not affected, it seems to make sense.

Proposed solutions in Issue #5999 suggested the use of the status field instead of context.

What does context mean here ? (is it context.Context in code ? if so, how/why do we compare a field from something in code that is purely an implementation detail ?).

In code we can use context even if we change how the content of the context is populated.

  • We could use the admissions mutating webhook to capture the feature flags for the pipeline run in its status.

This is what the "defaulting" admission controller does today yes.

6. We use the owner references for the task and compare the feature flags stored in the taskrun's status to that in its owner (ie. pipelineRun's status). This would ensure that the task run was running with the same feature flags as its owner.

This would be very slippery to do, as essentially, this would mean the TaskRun controller needs to query PipelineRun, making a kind-of cyclic loop between the 2 controllers. Up until now, the TaskRun never has to know where it comes from (PipelineRun, standalone, something else), and it's for the best — e.g. if I have my object called Build that creates a TaskRun and thus I set it as ownerref of the TaskRun, how can the TaskRun controller verify anything from it if it doesn't know it ? I have a suggestion for this later on.

Instead of confirming that feature flags haven't changed, would it make sense to just use the feature flags from the status rather than feature flags from the configmap? If we threw an error if feature flags changed, this would make it hard for cluster operators to change the value of feature flags without causing running pipelineruns to fail.

Definitely yes.

It would be interesting to see if we can pass feature flags to the taskrun from the pipelinerun in some way-- I'm not sure if you can create a taskrun with a status.

It shouldn't be possible to create an object with an already populated status nope. Even to mutate status, through kubectl for example, you need to use the --subresource='status' (in kubectl patch) to actually make it work.

I think the "worry" here is that we could create a Pipeline that is valid with a set of feature-flag and not with another one, right ? Making the execution kind-of hellish, because you could end-up with a Pipeline in your cluster that is invalid according to the "current" values of the feature-flag configuration map.

This is the behaviour today, and there is little we can do about it really. I do think we should adopt the simplest path (both for users and implementation). We need to decouple (in our head) the Pipeline and the PipelineRun and where the Pipeline comes from — what I mean here is, from the PipelineRun perspective, there should be absolutely no difference if the referred Pipeline comes from in-cluster or outside (embedded, resolvers, …).

  • The Pipeline referred by the PipelineRun is invalid with the feature-flags at the time of creation, the reconciler fails the PipelineRun directly (with an error that is very understandable for the user)
  • The Pipeline referred by the PipelineRun is valid with the feature-flags at the time of creation, the reconciler (or the defaulting admission controller) stores the feature flags value in the PipelineRun (somewher) and relies on these for the rest of the execution.
  • Same with Task and TaskRun for the 2 points above.
  • Changing a Pipeline that is in cluster but with invalid spec with the feature-flags at the time of mutation, will be validated with the current feature-flags (it could surprise the user but.. meh).
  • Finally, one suggestion here would be, to think of a way to pass the feature flags per TaskRun (and probably even per object) and per-namespace as well (see TEP-0085)

With the proposed suggestion above, the existing defaulting admission controller could set those "feature flag" value per object (let's say, using annotations), and the controllers (PipelineRun and TaskRun) would rely on those. And if not present, rely on the configmap, as it works today.

Also one thing to note : we do not expect users to change feature-flags on a regular basis 👼🏼

@chitrangpatel
Copy link
Contributor Author

chitrangpatel commented Jun 20, 2023

Thanks for your feedback @vdemeester!!

If this happens mid-way during the build process, the information surfaced in the provenance is incomplete because the flags were mutated during the pipeline run. This means that some pipeline tasks could have run with a different set of feature flags.

Proposed solutions in Issue #5999 suggested the use of the status field instead of context.

What does context mean here ? (is it context.Context in code ? if so, how/why do we compare a field from something in code that is purely an implementation detail ?).

Yes, I meant context.Context which in turn gets it from the config map that is watched by the controllers.

  1. We use the owner references for the task and compare the feature flags stored in the taskrun's status to that in its owner (ie. pipelineRun's status). This would ensure that the task run was running with the same feature flags as its owner.

This would be very slippery to do, as essentially, this would mean the TaskRun controller needs to query PipelineRun, making a kind-of cyclic loop between the 2 controllers. Up until now, the TaskRun never has to know where it comes from (PipelineRun, standalone, something else), and it's for the best — e.g. if I have my object called Build that creates a TaskRun and thus I set it as ownerref of the TaskRun, how can the TaskRun controller verify anything from it if it doesn't know it ? I have a suggestion for this later on.

Yes, I also feared this case also from the side of performance (ie. querying the parent object every reconcile loop).

It would be interesting to see if we can pass feature flags to the taskrun from the pipelinerun in some way-- I'm not sure if you can create a taskrun with a status.

It shouldn't be possible to create an object with an already populated status nope. Even to mutate status, through kubectl for example, you need to use the --subresource='status' (in kubectl patch) to actually make it work.

I agree, I tried creating a taskrun with a status but it did not work. The status of the created taskrun did not have what I had populated.

I think the "worry" here is that we could create a Pipeline that is valid with a set of feature-flag and not with another one, right ? Making the execution kind-of hellish, because you could end-up with a Pipeline in your cluster that is invalid according to the "current" values of the feature-flag configuration map.

I think the worry here is more, mutating the feature flags in the middle of the pipeline run. Basically, we cannot guarantee that the feature flags in the provenance produced the build.
If we say that the build starts from the creation of the pipeline run then like you say below, it does not matter what the feature flags were when the pipeline spec was applied because the validation will make it fail with a clear message.

This is the behaviour today, and there is little we can do about it really. I do think we should adopt the simplest path (both for users and implementation). We need to decouple (in our head) the Pipeline and the PipelineRun and where the Pipeline comes from — what I mean here is, from the PipelineRun perspective, there should be absolutely no difference if the referred Pipeline comes from in-cluster or outside (embedded, resolvers, …).

  • The Pipeline referred by the PipelineRun is invalid with the feature-flags at the time of creation, the reconciler fails the PipelineRun directly (with an error that is very understandable for the user)
  • The Pipeline referred by the PipelineRun is valid with the feature-flags at the time of creation, the reconciler (or the defaulting admission controller) stores the feature flags value in the PipelineRun (somewher) and relies on these for the rest of the execution.
  • Same with Task and TaskRun for the 2 points above.
  • Changing a Pipeline that is in cluster but with invalid spec with the feature-flags at the time of mutation, will be validated with the current feature-flags (it could surprise the user but.. meh).
  • Finally, one suggestion here would be, to think of a way to pass the feature flags per TaskRun (and probably even per object) and per-namespace as well (see TEP-0085)

With the proposed suggestion above, the existing defaulting admission controller could set those "feature flag" value per object (let's say, using annotations), and the controllers (PipelineRun and TaskRun) would rely on those. And if not present, rely on the configmap, as it works today.

I did think of using annotations earlier but the challenge is the the annotations can be modified. If we embed these in the status, then we can use spire integration to guarantee that the feature flags were not modified in the middle of the run.

Also one thing to note : we do not expect users to change feature-flags on a regular basis 👼🏼
Agreed.

@chitrangpatel
Copy link
Contributor Author

chitrangpatel commented Jun 20, 2023

Another possible solution:

What if we create a new config map/secret per pipelinerun that stores a snapshot of the feature flags from the moment that the pipelinerun object is created. The key could be the pipelinerun meta UUID which allows the feature flags for the pipeline and its underlying tasks to be fetch from cleanly.
Once the pipelinerun finishes, the controller removes this resource so that we don't end up with lots of config maps over time.

This way, only the operators with privilege to update the config maps can actually modify it in the middle of the run. The end users cannot fake them into the provenance.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 18, 2023
@vdemeester
Copy link
Member

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants