Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototypes rev 1.1: inputs/outputs #103

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
317 changes: 317 additions & 0 deletions 037-prototypes/proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -599,8 +599,325 @@ prototypes with special pipeline pipeline semantics and step syntax. These

* Is this terminology unrelatable?

### Inputs/Outputs

An open question is how best to provide inputs and outputs to prototypes,
particularly via the `run` step. The mockup of the `run` step in the [Pipeline
Usage](#pipeline-usage) section above suggested these could be configured via
`inputs`/`input_mapping` and `outputs`/`output_mapping`, but this approach has
some downsides:

* Stutter when `run.params` references the inputs, as you need to specify the
artifact name twice, e.g.
```yaml
run: build
type: go
inputs: [my-repo]
params:
package: my-repo/cmd/my-cmd
output_mapping: {binary: my-repo-binary}
```
* In this case, this could be avoided by using an `input_mapping` to a name
specific to the prototype. However, this only works when the prototype
takes in a fixed set of inputs with a fixed set of names, but doesn't
work when the prototype takes in a list of inputs, for instance.
* It's awkward to define a set of outputs that depends on the inputs. For
instance, imagine a `go` prototype that can compile multiple packages
simultaneously, and emit an `output` for each one. Under this input/output
approach, this may look something like:
```yaml
run: build
type: go
inputs: [repo1, repo2]
params:
packages:
cmd1-binary: repo1/cmd/cmd1
cmd2-binary: repo1/cmd/cmd2
cmd3-binary: repo2/cmd/cmd3
outputs:
- cmd1-binary
- cmd2-binary
- cmd3-binary
```
or
```yaml
run: build
type: go
inputs: [repo1, repo2]
params:
packages:
- repo1/cmd/cmd1
- repo1/cmd/cmd2
- repo2/cmd/cmd3
output_mapping:
binary1: cmd1-binary
binary2: cmd2-binary
binary3: cmd3-binary
```
In the first case, the prototype defines a pseudo-`output_mapping` in its
config, which requires repetition when defining the set of `outputs`. In the
second case, the `outputs` repetition is gone, but the prototype needed to
invent a naming scheme for the outputs (in this case, suffixing a fixed name
with the 1-based index of the package). Both approaches are fairly awkward to
work with.

Here are some alternative approaches that have been considered:

#### Option 1a - dynamic input/output config

The prototype's `info` response will include the required
inputs/outputs(/caches?) based on the request object (i.e. `run.params`).

For instance, with the following `run` step:

```yaml
run: some-message
type: some-prototype
params:
files: [some-artifact-1/some-file, some-artifact-2/some-other-file]
other_config: here
output_as: some-output
```

...the prototype may emit the following config, ascribing special meaning to
`files` and `output_as`:

```yaml
inputs:
- name: some-artifact-1
- name: some-artifact-2

outputs:
- name: some-output
```

...and Concourse will mount these artifacts appropriately.

**Pros**

* No new concepts
* If you're familiar with `task` configs, this is effectively just a more
flexible version of the same concept
* No new pipeline syntax/semantics
* Prototype details can be encapsulated behind config
* e.g. with the [oci-build-task], if you want to persist the build cache, you
need to specify: `caches: [{path: cache}]`
* With an approach like this, you could just specify: `cache: true` (i.e. you
don't need to know where the cache is)

**Cons**

* Requires inventing a naming scheme when the set of outputs is dynamic based
on the inputs
* Has performance implications. Ignoring caching, `run` steps will need to spin
up two containers (one for making the `info` request, and one for running the
prototype message)
* Caching is possible when the configuration is fixed, but with variable
interpolation it may not work so well
* More burden on prototype authors as each message now needs two handlers - one
for executing the message, and one for generating the inputs/outputs.

#### Option 1b - [JSON schema] based dynamic input/output config

Similar to 1a in that we rely on the prototype to instruct Concourse on what
the inputs/outputs are. However, rather than the `info` response providing the
inputs/outputs config, it instead gives a [JSON schema] for each message, with
some special semantics for defining inputs/outputs:

e.g. for a `build` message on an `oci-image` prototype:

```json
{
"type": "object",
"properties": {
"files": {
"type": "array",
"items": {
"type": "string",
"concourse:input": {
"name": "((name))"
}
}
},
"other_config": {
"type": "string"
},
"output_as": {
"type": "string",
"concourse:output": {
"name": "((name))"
}
}
},
"required": ["context"]
}
```

This isn't fully fleshed out, but the key concept is these `concourse:*`
keywords, which allow you to say: this element in the object is an
input/output.

**Pros**

* Performance implications from 1a disappear - much easier to cache
* All inputs/outputs need to be mentioned in the config, so nothing is implicit
* Can also be viewed as a Con - more verbose, and is closer to the original
* Unlike the `inputs`/`input_mapping`, `outputs`/`output_mapping` approach,
however, this approach requires less repetition, as the inputs/outputs only
need to be defined in one place
* Gives a way for Concourse to easily validate input to a prototype, and for
IDE's to provide more useful auto-complete suggestions - related to
https://github.com/concourse/concourse/issues/481
* The IDE aspect may not be practical as getting the JSON schema requires
making the `info` request against a Docker image.

**Cons**

* Makes it much more of a burden to write prototypes, as each message *needs* a
JSON schema
* Probably too restrictive - requires Concourse to support specific custom
keywords in the JSON schema. If you wanted to define a map of inputs to
paths, for instance, we'd need to provide a special keyword for that (as
`concourse:input` alone isn't enough)

#### Option 2a - emit outputs in response objects

This option implies two things:

1. A way for pipelines to explicitly specify what inputs should be provided to
a prototype (whereas options 1a/b had the prototypes telling Concourse what
inputs should be provided)
2. A way for prototypes to provide output artifacts as part of its message
response (whereas option 1a included in the info response, i.e. not at
"runtime" w.r.t. running the message)

The response objects could look something like:

```json
{
"object": {
"some_data": 123,
"some_output": {"artifact": "./path"}
}
}
```

Concourse will interpret `{"artifact": "./path"}` as an output, where `./path`
is relative to some path that's mounted by default to all prototypes. *This
means that multiple outputs may share an output volume, and differ only by the
path within that volume.*

Since prototypes can emit multiple response objects, this also means you can
have *streams* of outputs sharing a name that are identified by some other
field(s). e.g. the above artifact could be identified as
`some_output{some_data: 123}` (or something along those lines). That gives you
Comment on lines +810 to +813
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see use cases for emitting streams of outputs (across MessageResponses) - for instance, you have a Go prototype that can compile multiple packages simultaneously, and you want each package to have its own artifact - or you have a code scanner prototype that can scan multiple repos simultaneously.

However, in these cases, it's probably not uncommon to want single standalone outputs as well. For the go build prototype, for instance, you'd probably want to cache the module cache in the GOPATH. Since all packages would be built using the same GOPATH, there's really only a single modcache output - not one per MessageResponse. Forcing the outputs to be defined alongside streams of data make this a bit awkward - should the same artifact be emitted in each of the MessageResponses? Just the first?


Granted, the use cases I mentioned are mainly around performance (not wanting to spin up many containers, and instead shoehorning some of the across step's behaviour into prototypes). tbh don't have any strong use cases in mind for streams of outputs that can't be accomplished by having a stream of run steps instead, at the cost of performance.

the option of aggregating/filtering the stream of outputs by a subset of the
fields that identify them - a similar idea has been fleshed out for the
`across` step in
https://github.com/concourse/rfcs/pull/29#discussion_r619863020.

w.r.t. providing inputs, we can use an approach similar to `put.inputs:
detect`, but more explicit (from the pipeline's perspecive). e.g.

```yaml
run: some-message
type: some-prototype
params:
files: [@some-artifact-1/some-file, @some-artifact-2/some-other-file]
other_config: here
```

Here, `@` is a special syntax that points to an artifact name and both *mounts
it to the container* and *resolves to an absolute path to the artifact*. In
this example, the prototype would receive something like:

```json
{
"object": {
"files": ["/tmp/build/some-artifact-1/some-file", "/tmp/build/some-artifact-2/some-file"],
"other_config": "here"
},
"response_path": "..."
}
```

Interestingly, data and artifacts are all collocated, which raises the question
- do we need a separate namespace for artifacts like we have now, or can they
be treated as "just vars" and share the local vars namespace?

In order to use emitted artifacts within the pipeline, you can "set" them to
the local namespace:

```yaml
run: some-message
type: some-prototype
params:
files: [@some-artifact-1/some-file, @some-artifact-2/some-other-file]
other_config: here
set_artifacts: # or, if we do collapse the namespaces, this could be `set_vars`
some_output: my-output # some awkwardness around _ vs - here
```

**Pros**

* Removes burden from prototype authors to define what inputs/outputs are
required up-front
* More flexible - if set of outputs depends on things only determinable at
runtime, you can express that here
* Output streams let you do interesting filtering (example in
https://github.com/concourse/rfcs/pull/29#discussion_r619863020), and avoid
the issue of needing to invent a naming scheme with sets of common outputs.

**Cons**

* New pipeline syntax/concepts to learn
* Can't mount inputs at specific paths - if your prototype requires a specific
filesystem layout, it needs to shuffle the inputs around
* e.g. [oci-build-task] may depend on inputs being at certain paths relative to
the `Dockerfile`
Comment on lines +874 to +877
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vito brought up an interesting idea for this - adding a special syntax for defining a directory layout within run.params. For instance, in order to build an OCI image (like the oci-build-task does) with a dependency mounted under the main build context, you could do something like:

run: build
type: oci-image
params:
  context:
    tree:                     # this is a special syntax that says...
      .: @concourse           # ...remount the "concourse" artifact to . (within the tree)
      ./dumb-init: @dumb-init # ...remount the "dumb-init" artifact to ./dumb-init (within the tree)

(p.s. this is just an example syntax)

The prototype would then receive something like:

{
  "object": {
    "context": "/tmp/build/context"
  }
}

...where /tmp/build/context has the following directory structure:

/tmp/build/context
    README.md
    Dockerfile
    atc/
    ... # other concourse files
    dumb-init/
        ... # dumb-init binaries

For comparison, this is how you'd do something similar with the oci-build-task:

task: build
privileged: true
config:
  platform: linux
  image_resource:
    type: registry-image
    source:
      repository: vito/oci-build-task
  inputs:
  - name: concourse
    path: .
  - name: dumb-init
  outputs:
  - name: image
  run:
    path: build

Copy link
Author

@aoldershaw aoldershaw May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this thought - the @ syntax is a way to say "mount this artifact as an input and give me the path to it". {tree: {...}} (or whatever syntax we may come up with) is saying something very similar: "mount this tree of artifacts as an input and give me the path to it". What if we unified the two concepts, such that @artifact is just a short-hand for {tree: {.: artifact}} (or {tree: artifact})?

One difficulty here is that now you can't really append strings to the {tree: ...} syntax, e.g. to represent globs. For instance, you might do something like: @binaries/*-linux to refer to the linux binary, but you can't really do {tree: binaries}/*-linux without some hackery (especially if the tree: {...} spans multiple lines, like with the initial context: example)

Copy link
Author

@aoldershaw aoldershaw May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe, what if we supported configuring run.inputs: [...] - we just provided a short-hand syntax (e.g. @foo) for avoiding needing to repeat artifact names in both params and inputs (where possible).

Consider the following run step:

run: build
type: go
params:
  module: @my-repo
  package: ./cmd/my-cmd

This could be a short-hand for the following equivalent run step:

run: build
type: go
inputs:
- name: my-repo
params:
  module: my-repo
  package: ./cmd/my-cmd

If you need a more complex setup, e.g. inputs mounted at specific paths, you could still do that explicitly:

run: build
type: oci-image
params:
  context: @concourse
inputs:
- name: dumb-init
  path: concourse/dumb-init

or:

run: build
type: oci-image
params:
  context: concourse
inputs:
- name: concourse
- name: dumb-init
  path: concourse/dumb-init

or even:

run: build
type: oci-image
params:
  context: .
inputs:
- name: concourse
  path: .
- name: dumb-init

Copy link
Author

@aoldershaw aoldershaw May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question with the @ syntax is how we can differentiate it from normal uses of the "@" sign (e.g. to signify a slack handle/twitter username). There are a couple options I can see:

  1. Provide a syntax to escape the "@" sign (e.g. handle: \@username_here or handle: @@username_here). However, this adds a bit of overhead when you have to work with "@" (which admittedly is likely to be pretty rare, I would guess)
  2. Disambiguate the @ syntax. For instance, maybe you need to wrap the artifact name in {...}, e.g. context: @{concourse}. Given that this is the common case, though, it could be a bit annoying - but should have less mental overhead
    • I realize that if you wanted to use the literal string @{something}, you'd still need some escape mechanism - but that seems like a pretty unlikely case

Wrapping the artifact name in {...} also may allow us to extend the syntax. For instance, currently the @artifact syntax only lets you mount artifacts at a path identical to their name, and if you want a different path you need to use inputs: [...]. However, we could possibly introduce some syntax to configure the mount path as well, e.g.:

run: build
type: oci-image
params:
  context: @{concourse:.} # resolves to ".", and mounts concourse at "."

...which would be equivalent to:

run: build
type: oci-image
inputs:
- name: concourse
  path: .
params:
  context: .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the proposed @ syntax isn't very practical, since @ is a reserved character in YAML. It's still possible to use, but requires quoting the value, which is pretty annoying.

Another possible syntax is using < to mean input, since it points "inward" toward the params e.g.:

run: build
type: oci-image
params:
  context: <{concourse} # feels like injecting `concourse` into the context

Copy link
Author

@aoldershaw aoldershaw May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To build on the idea of having a short-hand syntax for specifying inputs, what if we did the same for specifying outputs? Currently, all the proposed approaches have the prototype reporting the outputs. What if instead we made it explicit in the build plan? For instance, we could use another character to denote outputs, e.g.:

run: build
type: oci-image
params:
  context: <{concourse:.} # `concourse` is an input mounted to `.`
  targets:
    ={builder-image}: builder # `builder-image` is an output
    ={final-image}: app # `final-image` is an output

...which is equivalent to:

run: build
type: oci-image
params:
  context: .
  targets:
    builder-image: builder
    final-image: app
inputs:
- name: concourse
  path: .
outputs:
- name: builder-image
- name: final-image

Here, = is used to denote "the name (and optionally path) following is an output of the step". The thinking is that = denotes assignment to the build scope. I wanted to use something like > to parallel < for inputs, but unfortunately that's a control flow character, so it's not possible to use without quoting. Plus, it could be easy to mix < and > up.


The example in the motivation for alternative approaches could be written as:

run: build
type: go
params:
  packages:
    ={cmd1-binary}: <{repo1}/cmd/cmd1
    ={cmd2-binary}: <{repo1}/cmd/cmd2
    ={cmd3-binary}: <{repo2}/cmd/cmd3

Copy link
Author

@aoldershaw aoldershaw May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought - we don't have to try to come up with symbols to denote inputs and outputs, and could make it more verbose, e.g.:

run: build
type: go
params:
  packages:
    output{cmd1-binary}: input{repo1}/cmd/cmd1
    output{cmd2-binary}: input{repo1}/cmd/cmd2
    output{cmd3-binary}: input{repo2}/cmd/cmd3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a bummer about the @ syntax. I guess one alternative to consider could be ${foo}, although it kind of carries a 'regular old var syntax' connotation, not artifacts. TBH I'm not a huge fan of the proposed syntaxes in the last few comments; < made me think of the > YAML multiline syntax and the rest don't feel like they suggest inputs/outputs/interpolation to me. (TBH I'm not sure why @ did either - maybe from my background in Ruby. 😆)

I like the idea behind @{foo:mount-path} - though one downside I see is that in situations where you're trying to set up a nested tree, you night not actually have a valid place to pass the value for setting up the nested paths in params. For example, if you're setting up a directory tree to docker build you may be tempted to pass other artifacts as bogus values just to mount them beneath the context dir:

params:
  context: @{concourse:context}
  blah: @{dumb-init:context/dumb-init}

Which of course wouldn't work if the prototype validates its fields. In this case the more explicit inputs: form would probably be favorable.

Just noting, not arguing in favor of it, but the tree: approach from #103 (comment) avoids this issue by representing the root path and its subpaths as one nested value under params:, with only the root path being ultimately passed to the prototype, and the nested values just used for setting up mounts beneath it.

Re: #103 (comment) - explicitly declaring outputs is closer to how the prototype RFC is now, just without special syntax, and it only has a list of output names. (ref.)


Here's a thought: what if instead of having params: implicitly fill in inputs:, we went the other way around and had inputs: fill in params:?

In this mockup, I've added artifact: to the input config which is the artifact that will be passed along, and param is a field to set under params:. This way you can configure artifact mounts however you like, and you don't have to repeat yourself in params:.

run: build
type: oci-image
inputs:
- param: context
  artifact: concourse
  path: .
- artifact: dumb-init
  path: ./dumb-init

With this example the prototype would receive {"context": "."}.

One downside of this approach is that it works best with simple named params, not arbitrary data structures such as arrays or maps of artifacts. I guess we could support path-style things like params: foo.bar but it'd get pretty awkward with arrays; foo[0], foo[1] would be annoying to maintain. But maybe we just don't care about either case and we ain't gonna need it.

Not sure how this approach could be extended to outputs. I'll leave this comment for now and see if anything springs to mind. :)

Copy link
Author

@aoldershaw aoldershaw May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs filling in params is an interesting approach, but as you mention, it does feel a little awkward for certain cases. For instance, suppose you wanted to specify a list of globs, e.g. for creating a GitHub release. Using some syntax for embedding inputs in the params (I'll go with the more explicit syntax in #103 (comment)), it feels pretty natural:

put: gh-release # could also be a `run:` step
params:
  globs: [input{concourse-binaries}/*, input{fly-binaries}/*, input{license}/*]

...whereas with inputs[].param, it's not so easy - in addition to the array indexing issue, how do you add the /* suffix? Would you ever want to add a prefix?

one downside I see is that in situations where you're trying to set up a nested tree, you night not actually have a valid place to pass the value for setting up the nested paths in params

Yeah, that's true - the thinking was that you'd have to use run.inputs for cases where it doesn't make sense to embed the inputs directly in params. Something like the tree syntax could solve that, but there are a few things I'm not sure about there:

  • Would you ever need one of these "sibling" inputs to be up a directory from the root of the tree? e.g. you have a go module which has a replace directive to a sibling directory (e.g. guardian does this - https://github.com/cloudfoundry/guardian/blob/2f945c09a983e4/go.mod#L60-L63). If you want to compile guardian, you'd want params.module to point to the mount path of guardian, but then you also want to mount "implicit" inputs (garden, etc.) up a directory. Do we want to allow going up a directory in tree? e.g.:
run: build
type: go
params:
  module:
    tree:
      .: guardian
      ../garden: garden

Maybe tree isn't the best name in this case, given that it would be able to modify mounts outside of the directory tree that it returns?

  • Is there a syntax we can use to add suffixes (e.g. globs) to the tree result? If it's just defined as a YAML object, it's not obvious how you could append a string

* Since outputs can appear anywhere in the response object, different
prototypes may provide a different way to interact with outputs, rather than
having a single flat namespace of outputs
* Can also be viewed as a Pro, but I feel like having a consistent way of
referring to artifacts within a pipeline is beneficial

**Questions**

* Is merging the concepts of vars and artifacts confusing/unintuitive?

#### Option 2b - emit outputs adjacent to the response object

This approach uses the same `@` syntax semantics as 2a - the main difference is
that it explicitly differentiates between outputs and data by emitting outputs
in the message response, but not in the response object. e.g.

```json
{
"object": {
"some_data": 123
},
"outputs": [
{"name": "some_output", "path": "./path/to/output"}
]
}
```

**Pros**

Like 2a, but also:
* Consistent with the existing notion of outputs (i.e. flat namespace of
artifacts than can be referenced by name)
* Still allows for filtering the stream down by the corresponding `object`

**Cons**

Like 2a, but also:
* Can't unify concepts of vars and artifacts


[rfc-1]: https://github.com/concourse/rfcs/pull/1
[rfc-1-comment]: https://github.com/concourse/rfcs/pull/1#issuecomment-477749314
[rfc-24]: https://github.com/concourse/rfcs/pull/24
[rfc-38]: https://github.com/concourse/rfcs/pull/38
[oci-build-task]: https://github.com/vito/oci-build-task
[JSON schema]: https://json-schema.org/