Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volumes, Tasks and PipelineResources overlap #1272

Closed
vtereso opened this issue Sep 4, 2019 · 17 comments
Closed

Volumes, Tasks and PipelineResources overlap #1272

vtereso opened this issue Sep 4, 2019 · 17 comments
Labels
area/api Indicates an issue or PR that deals with the API. kind/design Categorizes issue or PR as related to design.

Comments

@vtereso
Copy link

vtereso commented Sep 4, 2019

Background

Tekton Pipelines/Tasks separate functionality from configuration by acting as reusable components for work. Work is actualized/instantiated in the correspondingPipelineRun/TaskRun objects (noted as Run objects onward for the sake of brevity). Run objects store the work configuration through both PipelineResources and parameters.

Goals

  • Understand the true intent/purpose behind PipelineResources
  • Maybe remove PipelineResources
  • Potentially augment volume handling in Tekton

Introduction

According to the documentation:

PipelinesResources in a pipeline are the set of objects that are going to be used as inputs to a Task and can be output by a Task.

This does not not explain their purpose, but rather their current embodiment.
To the case of resuability, it makes a lot of sense to structure things in a functional way (e.g. f(x)=y) where Pipelines/Tasks are the functions that ingest configuration. In contrast, it seems that PipelineResources seem to fall somewhere in-between.

PipelineResource Role

Although each kind of PipelineResource does something different, I believe it is reasonable to consider them as syntactically sweet mounts. In general, mounts are really helpful because they allow some foreign information to be attached to a pod/container.

Without mounts, there would only be two options:

  • Create wrapper images around base images that expect configuration to be injected
  • Have functionality and configuration coupled, which is less than ideal

In actuality, PipelineResources append steps to tasks and handle volumes. Since Tasks are supposed to be reusable (e.g. the catalog), it seems strange to add more carpentry to manipulate their definition. More strange, each PipelineResource does something unique rather than being a normalized operation, which also lends itself to an extensibility problem. PipelineResources would likely be more clear if their responsilities were divided between appending steps and handling volumes. It seems like the orchestration of the volume is the important piece, where the step appending is somewhat less so. PipelineResources can be reorganized as Tasks that could be composed together.

Let's look at a few different PipelineResources:

Git Resource

Git resource represents a git repository, that contains the source code to be built by the pipeline. Adding the git resource as an input to a Task will clone this repository and allow the Task to perform the required actions on the contents of the repo.

Pull Request Resource

Adding the Pull Request resource as an input to a Task will populate the workspace with a set of files containing generic pull request related metadata such as base/head commit, comments, and labels.

...

Adding the Pull Request resource as an output of a Task will update the source control system with any changes made to the pull request resource during the pipeline.

In just these two use cases (although there are more), PipelineResources do help simplify mounting. However, there is a bit of quirkiness between PipelineResources. Some resources like GitResources seem to only be input resources, while PullRequestResources are both, but likely never as inputs to other Tasks. In any case, they could very operate as Tasks rather than being shipped around between Tasks. This sort of behavior is outlined here.

All of the currently supported PipelineResources can be seen here, which is likely to grow especially with consideration for integrations with the notifications proposal. I think it's important to make a distinction on the overlap of responsibilities between Volumes, Tasks, and PipelineResources before we invest further.

PipelineResources Problems

As mentioned in the background section, Run objects take parameters and PipelineResources.
Since these are distinct objects, PipelineResources need to be created ahead of time (separately), which is an inconvenience.

As a byproduct, there is currently a proposal to allow for PipelineResource embedding into the PipelineRun (although it remains a distinct k8s resource) to address this as well as resource littering. The aforementioned extensability problem is also a concern. Further, PipelineResources can be tampered with/deleted. This introduces the cookie-cutter/templating problem, where Run objects utilizing PipelineResources (at least as current) cannot determine whether they have been modified during or between runs.

Potential Solution

Add some logic into the Tekton API to declaratively handle volumes to facilitate data between steps/Tasks in a reusable way.

There are multiple ways to do this, but ultimately this gets as making Tekton much more simple in a few different ways:

  • Simplified yaml
  • Cutting down on interpolation to only params.
  • Reduce load on reconciler
  • Clear separation of resources and more catalog contributions

This also has me wonder: If images and other fields coulds be overrided by Runs, would interpolation be necessary at all? With no interpolation (maybe this is a stretch), tools like Kustomize (not that I am familiar with it) or the sort would be able to be the engine to edit resources rather than it being done internally.

@ghost ghost added area/api Indicates an issue or PR that deals with the API. kind/design Categorizes issue or PR as related to design. labels Sep 4, 2019
@ghost
Copy link

ghost commented Sep 4, 2019

One of the benefits of PipelineResources is that they declare the expected types of inputs into and outputs out of a Task. If we were to get rid of them how could a Task describe what it operates on and what it produces?

@vtereso
Copy link
Author

vtereso commented Sep 4, 2019

One of the benefits of PipelineResources is that they declare the expected types of inputs into and outputs out of a Task. If we were to get rid of them how could a Task describe what it operates on and what it produces?

The above approach does everything with volumes. I imagine the PipelineRun effectively mapping volumes to Tasks where there is no such thing as an input/output, but rather a volume that was previously mounted and modified. Rather than a producer/consumer paradigm, I see this sort of pipeline like an assembly line where each piece just expects it has the necessary tools (more or less the same as current).

Since execution is a DAG, the path of the volume is also known. Of course, mounting would be optional since not all tasks need to share their state. The only weird situations seems like when the DAG fans out and multiple Tasks want access to the volume. In this case, volume cloning seems like a decent solution? Alternatively, this could be restricted or they could be run sequentially in lockstep stages. This has been mentioned similarly here were restricting seems to have been the suggestion.

This approach is less structured than PipelineResources (where a certain form is expected), but I think it has considerable merit. Users will likely be more familiar with volumes, there is less black magic/obfuscation, it is natively "extendable" and it seems easy enough to debug. Thoughts @sbwsg?

@ghost
Copy link

ghost commented Sep 5, 2019

Couple thoughts from my pov. I'm coming at this from a somewhat selfish point of view - I've written and run CI pipelines for a few different projects and so my reaction is based on some bad experiences with existing tools rather than necessarily from the perspective of someone writing a brand new platform on top of Tekton. I can see value in the idea of having easy generic storage passed between Tasks but I think there are considerable benefits to Tekton's declarative resources too. Anyway, here are my immediate thoughts:

  1. I'm generally in favor of anything that can help users or integrating systems validate input/output, detect errors, or otherwise fail early. CI/CD pipelines can take hours - days in some cases - and discovering halfway through that you referenced a file on a volume with the wrong path and just wasted time on a run can be frustrating. Using generic volumes of data in place of PipelineResources feels like we'd be losing a lot of opportunity to do some kind of helpful linting and validating. Would you push that kind of work into Tasks or Conditions instead, or something else? I see a risk that we end up burdening a potential user more with this approach, even if they have less YAML.

  2. I'm not clear which kinds of users we're aiming to help here. Do you have a sense of who benefits most from this change and why? And of the groups that it doesn't help, are there possibilities that we end up hurting their use cases instead? We've attempted to codify some of the user types here if it helps to frame the conversation: https://github.com/tektoncd/community/blob/master/user-profiles.md

  3. Is there an opportunity to have both instead of one-or-other? It may be that there are really strong use cases for improved volume support and also for typed resources. I'd personally always prefer to see a feature implemented non-destructively and then naturally adopted and approved by the community as the better choice before removing the thing it's replacing.

@vtereso
Copy link
Author

vtereso commented Sep 5, 2019

I'm generally in favor of anything that can help users or integrating systems validate input/output, detect errors, or otherwise fail early.

I think this is a really strong point and the main concern with this sort of approach. I wonder if the pipeline step/process had already been going for a long time, would the failure at this particular section be intrinsic to the volume? Maybe in the same way, a PipelineResource would have failed to be output? For concerns about mounting location, something like a volume diff of a step or restricting/recommendation all data to be stored in a single dir are my initial ideas. This way, if there is a failure relating to a volume, it is the fault of the previous step for succeeding.

Somewhat similar, if a contract needed to be fulfilled between Tasks, maybe a ConfigMap would be another option. As I see it, PipelineResources are mostly ConfigMaps wrappers anyway, albeit ConfigMaps have "arbitrary" data. Ultimately, I think it's always up to the user to correctly or incorrectly pass data around where constraining the pass data to a type like GitResource or PullRequestResource isn't infallible either (they could have empty/bogus values).

Originally, I put some consideration to using ConfigMaps to replace PipelineResources, but they still share some of the same issues such as having to be generated ahead of time of PipelineRuns/TaskRuns. However, if they were generated within PipelineRuns/TaskRuns (acting sort of like a volume), that might be more advantageous?

Is there an opportunity to have both instead of one-or-other?

I was just looking to start a discussion to consider this approach since there any many issues and conversations about improving PipelineResources and this is some sort of alternative. The release schedule includes items like PipelineResources are fixed, so I suppose this was an attempt to potentially forgo that work (many ifs), but that probably isn't realistic. My only concern is that it seems like having multiple resources encapsulate logic will always carry some caveats, where this seems better at face value. I don't claim to know everything necessary to make this happen or other potential problems this introduces. I'm not opinionated one way or the other, but I feel that if one design was satisfactory it would make sense to favor one in the long run. If considered, I agree it would only make sense to gauge adoption before any sort of removing of features.

@vtereso
Copy link
Author

vtereso commented Sep 5, 2019

To the point about who benefits from this:

Pipeline Distributor

I think broadly speaking it makes the catalog more well defined since you don't need to create PipelineResources ahead of time, where catalog Task with only params encapsulate everything (assuming dynamic volume generation/etc.)

Platform Builder

Instead of having to fork the code base to add their own PipelineResource, they could use this new paradigm that enables anything. Currently (and AFAIK), any use case that is not covered by the PipelineResource would have to be handled by volumes/secrets in any case?

Tekton Developer

Instead of having to deal with new PipelineResources, new first-party Tasks could be added that would make things more agile than otherwise. Also, assuming this was a success and adopted (a big if), the codebase could be consolidated somewhat.

I think other roles also get indirect benefits from the above.

@vtereso vtereso changed the title Potentially remove PipelineResources Volume support over PipelineResources Sep 6, 2019
@vtereso vtereso changed the title Volume support over PipelineResources Volumes, Tasks and PipelineResources overlap Sep 6, 2019
@vtereso
Copy link
Author

vtereso commented Sep 6, 2019

I haven't looked https://github.com/tektoncd/pipeline/pull/1184/files over extensively, but I suppose my thinking is that this is not just a particular PipelineResource kind, but could effectively model any inter-step communication. Then taking this one step forward, do so in a more flexible way that doesn't have the pain points as current. The above ConfigMap idea is more or less a generic way of passing around params between tasks that is a hybrid/middleground between volumes and current PipelineResources.

@ghost
Copy link

ghost commented Sep 6, 2019

I wonder if #1285 alleviates some of the same pain. A FileSet is essentially an "untyped" group of files to be output from one Task and input to another but a benefit I see there is that the paths of those files is declared as part of the resource.

@vtereso
Copy link
Author

vtereso commented Sep 6, 2019

To the same point of PipelineResources being dicey, I don't feel particularly strong about one PipelineResource being able to reference another as this further compounds/complicates things. I do like the idea between being able to select between raw data and config maps. Would the raw data support a compressed directory? That would essentially satisfy the "volume" approach. In any case, I think the volume PipelineResource and FileSet PipelineResource are a step toward being more generic, where at the very least, extensibility would be a non issue.

@imjasonh
Copy link
Member

@vtereso I'm having trouble imagining a world without PipelineResources.

Can you give me an example pipeline (perhaps a version of this canonical example pipeline) that doesn't define any PipelineResources? How would a Task know what a previous Task produced, and how would the overall PipelineRun report on the new state of things updated during its run?

@vtereso
Copy link
Author

vtereso commented Sep 12, 2019

There are many different ways this could be accomplished, but it goes without saying this isn't entirely without some compromise as mentioned previously by @sbwsg:

One of the benefits of PipelineResources is that they declare the expected types of inputs into and outputs out of a Task.

I had a demo on last week's community call where I showed a rough idea of using the PullRequestResource as follows:

apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
  name: pull-request-pull-task
spec:
  inputs:
    params:
    - name: pullRequestUrl
      description: The url of the pull request
  volumes:
    - name: clone-dir
      persistentVolumeClaim:
        claimName: host-path-pvc
  steps:
  - name: pull-request-pull
    image: github.com/tektoncd/pipeline/cmd/pullrequest-init
    volumeMounts:
    - mountPath: /workspace2
      name: clone-dir
    command: 
    - /ko-app/pullrequest-init
    env:
    - name: GITHUBTOKEN
      valueFrom:
        secretKeyRef:
          name: github-secret
          key: tekton_volumes_token.txt
    args:
    - -url
    - ${inputs.params.pullRequestUrl}
    - -path
    - /workspace2
    - -mode
    - download

The same volume mount was used throughout. This is less than ideal since the claim is referenced within the Task, which should be reusable. The same should hold as for the above with pulling the repository, building images and the like. Ideally, the TaskRun/PipelineRun would inject the actual volume(s) where the Task would just expect it. Parameters/ConfigMaps/signals or something might pair well with this sort of flow to help provide more validation.

Going more with the implicit flow, perhaps some sort of dry run option/UI could be provided to explicitly model out what will happen for a particular run.

@bobcatfish
Copy link
Collaborator

Hey @vtereso ! Thanks for starting this discussion and helping us to critically examine PipelineResources :D

As you pointed out, our docs on PipelineResources focus completely on how they currently work and not really on why we think they're an important abstraction. I like @dlorenc 's definition from #1076 a lot more than the one in our docs right now:

Resources in general should represent reconciliation between a system in the outside world and a filesystem injected into a Tekton Task.

We can take this further by defining what input and output resources mean:

  • Input resources are a way to represent the outside world on disk
  • Output resources are a way to update the outside world from data on disk

Although each kind of PipelineResource does something different, I believe it is reasonable to consider them as syntactically sweet mounts

The way we use PipelineResources today is basically PVC (or GCS bucket) wrapped with input steps and output steps. This is actually changing thanks to #1076, and for many PipelineResources, there will be no mount at all. But there will be data, so I feel like maybe it's true to say something like:

  • Each PipelineResource is syntactically sweet data

I'd personally always prefer to see a feature implemented non-destructively and then naturally adopted and approved by the community as the better choice before removing the thing it's replacing.

Actually, everything that @vtereso is describing is possible today - folks could use Tasks completely without PipelineResources and can use volumes and ConfigMaps and any other kube thing that they want.

I think broadly speaking it makes the catalog more well defined since you don't need to create PipelineResources ahead of time, where catalog Task with only params encapsulate everything (assuming dynamic volume generation/etc.)

So I think you're saying @vtereso that if you look for example at the goland build Task in the catalog and you see:

    resources:
    - name: source
      type: git
      targetPath: src/${inputs.params.package}

The user still has to understand how to instantiate the git PipelineResource in order to use the Task, but in your example above, that's done explicitly in the Task so all of the logic is encapsulated in one place?

The biggest con that I see is that in the above example, the section that pulls from git will need to be repeated in every Task that needs to do it:

  - name: pull-request-pull
    image: github.com/tektoncd/pipeline/cmd/pullrequest-init
    volumeMounts:
    - mountPath: /workspace2
      name: clone-dir
    command: 
    - /ko-app/pullrequest-init
    env:
    - name: GITHUBTOKEN
      valueFrom:
        secretKeyRef:
          name: github-secret
          key: tekton_volumes_token.txt
    args:
    - -url
    - ${inputs.params.pullRequestUrl}
    - -path
    - /workspace2
    - -mode
    - download

With a PipelineResource that provides this, Task authors can assume that the git interaction is handled for them (and with FileSet like @sbwsg was pointing out, they don't even need to know where the files came from), and that it's well tested and vetted.

@vtereso
Copy link
Author

vtereso commented Sep 14, 2019

@bobcatfish That was a good summary.

The biggest con that I see is that in the above example, the section that pulls from git will need to be repeated in every Task that needs to do it

What do you mean by this? My rationale behind Tasks being analogous to the PipelineResources is because in the above/most/all scenarios a TaskRun/PipelineRun will need to be created regardless, where this would be where the parameters are injected rather than in the ahead of time PipelineResource. With consideration for the catalog or otherwise (e.g. shipping certain Task with the Tekton install), it should be easy enough for users incorporate these into workflows as current.

I don't think that volumes can be passed into Tasks in a declarative way at the moment (they need to be hard specified like my above Task example?), where this was my motivation behind the goal to "Potentially augment volume handling in Tekton".

Although it does not always apply, I generally like the idea of having one robust way to do things rather than multiple to achieve different use cases. If it is actually possible to do this already using Tekton in the same fully reusable form, I would find it strange that users probably don't know about this.

I think your point about PipelineResources being "syntactically sweet data" and potentially not using mounts is notable. Also, this also gets into the heard of the overlap as:

- Input resources are a way to represent the outside world on disk
- Output resources are a way to update the outside world from data on disk

seems very close to describing volumes and the like to me. If something like the FileSet resource were introduced, I believe this sort of becomes a one-size fits all sort of abstraction, where we again get into volume territory. With that said, I think it is a good idea since it gets around dealing with volumes and provides essentially the same at face value. To keep with the current PipelineResource paradigm, it makes sense to label it as a kind of such, but perhaps it would become the PipelineResource.

@vdemeester
Copy link
Member

vdemeester commented Sep 16, 2019

@vtereso do you envisonned Task as composable, a bit like GitHub actions ? Being able to refer and encapsulate a Task into another one ?

      - name: Set up Go 1.13
        uses: actions/setup-go@v1
        with:
          go-version: 1.13
      - name: Check out source code
        uses: actions/checkout@v1

On the "volume handling", currently you can do the following with Task.stepTemplate and TaskRun.podTemplate :

apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
  name: task-volume
spec:
  steps:
  - name: write
    image: ubuntu
    command: ["/bin/bash"]
    args: ["-c", "echo some stuff > /im/a/custom/mount/path/file"]
  - name: read
    image: ubuntu
    command: ["/bin/bash"]
    args: ["-c", "cat /short/and/stout/file"]
  stepTemplate:
    volumeMounts:
    - name: custom
      mountPath: /short/and/stout
---
apiVersion: tekton.dev/v1alpha1
kind: TaskRun
metadata:
  name: test-template-volume
spec:
  taskRef:
    name:  task-volume
  podTemplate:
    volumes:
    - name: custom
      emptyDir: {}

That allows you to decouple the mount point from the volume type. You could have all your task referencing a volume that is "attached"/"declared" in your PipelineRun and do all by hands indeed.

One thing we do not yet tackle (and that we are not going to tackle with the current "Pipeline Resource Extensibility" design) is more generic input resource types – e.g. a Task that works with both a GitResource and a MercurialResource. But this would be possible to do with the design as we could have a VCSResource that can handle both.

Although it does not always apply, I generally like the idea of having one robust way to do things rather than multiple to achieve different use cases. If it is actually possible to do this already using Tekton in the same fully reusable form, I would find it strange that users probably don't know about this.

I generally agree with having one robust way to do things rather than multiple to achieve different use cases". But right now, doing things without PipelineResource, although possible, is very very verbose and induce a lot of duplication. If we had a way to include a Task definition into another one — à-la GitHub action — or maybe better a set of common steps, we might not need PipelineResource concept altogether.

This would mean :

  • have a concept of Step instead (as a CRD) that can be used into the Task.Steps
  • have a way to document which parameters a step uses ?
apiVersion: tekton.dev/v1alpha1
kind: Step
metadata:
  name: git-clone-with-params
    image: ubuntu
    command: ["/bin/bash"]
    args: ["-c", "git clone $(inputs.params.repository) /workspace/src"] # the dest. could be a param too
---
apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
  name: task-foo-bar
spec:
  inputs:
    params:
    - name: repository
      value: https://github.com/tektoncd/pipeline.git
  steps:
  - name: foo
    uses: git-clone-with-params
  - name: read
    image: ubuntu
    command: ["/bin/bash"]
    args: ["-c", "cat /foo/bar/README.md"]
  stepTemplate:
    volumeMounts:
    - name: custom
      mountPath: /foo/bar

One of the main question is then, what happens if the referred steps is not present ? Should we have a tighter integration with the catalog ? and how should we version those ? Is this the responsability of the pipeline controller itself or something above (like the operator) ?

Also, as it's kinda going the opposite way of the idea of PipelineResource, is that approach compatible with PipelineResource or not ? Could/should they co-exists ?

@vtereso
Copy link
Author

vtereso commented Sep 16, 2019

If a step/Task can be encapsulated in another, the Task would still have to specify it uses the correct mount in order for it to interoperate correctly? In this way, there would be at least an implicit contract that the mountPath would be consistent across Task that work together. Alternatively and with consideration for encapsulating steps, creating a new EmbedStep or something like that could be one way, where it would inherit the stepTemplate of the Task, but some replication across Pipelines seem ok? However, Pipeline can already aggregates the correct Task references. This is an explicit declaration rather than updating the TaskRun spec at runtime as is current with PipelineResources.

As mentioned above, I believe the two value add of PipelineResources are to:

  • Append steps to tasks
  • Handle volumes/data

A solution to extensibility regarding source providers (preventing the above duplication) could be to have one big task that handles pulling from different providers. Another idea as current would be to use interpolation to map to the correct Task.

@chhsia0
Copy link
Contributor

chhsia0 commented Sep 16, 2019

In the current API, PipelineResources provides two functionalities:

  1. Task parametrization with static structured data
  2. Procedure to bring data to tasks
    Right now 1 is overlapped with task parameters, and 2 is very similar to task steps and/or volumes. It seems to me that this could make us think that PipelineResource is not absolutely necessary.

However, it seems to me that to there's a missing concept here: Task parametrization with dynamic structured data, and a redesigned PipelineResource could fill in this gap. What @vtereso proposed here as a potential solution is basically make volumes to be the redesigned PipelineResource (but with unstructured data). The main concerns I have for this approach is:
a. Lack of the ability to do static validation because data are unstructured, as @sbwsg raised above.
b. What's the "memory consistency model" for the volumes presenting the dynamic data in a DAG execution model, given that all tasks could read/write to the volumes?

It makes me feel that we probably could first agree on what a good "memory consistency model" for dynamic data between tasks is, then decide whether we need is just volumes or structured dynamic data between tasks.

Personally I like the design of having PipelineResources being generated by tasks and immutable to represent dynamic structured data, and volumes are separated from PipelineResources where users needs to explicitly specify the order between tasks sharing volumes.

@ghost
Copy link

ghost commented Sep 16, 2019

Notes from our initial meeting on this subject are available here for review: https://docs.google.com/document/d/1p6HJt-QMvqegykkob9qlb7aWu-FMX7ogy7UPh9ICzaE/edit

@bobcatfish
Copy link
Collaborator

Latest design from @sbwsg to address confusion and bugs by re-designing PipelineResources: https://docs.google.com/document/d/1euQ_gDTe_dQcVeX4oypODGIQCAkUaMYQH5h7SaeFs44/edit#

Note we also have #1076 which outlines similar problems and proposes a re-design.

Since there hasn't been any activity on this issue and we've stopped having the PipelineResources meetings (for now!) I'm going to resolve this issue, we can keep discussing in #1076 or re-open this if needed.

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue or PR that deals with the API. kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

No branches or pull requests

5 participants