Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: TFX DSL Data Model and IR #271

Merged
merged 5 commits into from
Sep 14, 2020
Merged

RFC: TFX DSL Data Model and IR #271

merged 5 commits into from
Sep 14, 2020

Conversation

ruoyu90
Copy link
Contributor

@ruoyu90 ruoyu90 commented Jul 24, 2020

Update: we extend the feedback phase to Friday, August 21, 2020.

Status Proposed
RFC # 271
Author(s) Ruoyu Liu (ruoyu@google.com), Hui Miao (huimiao@google.com), Hongye Sun (hongyes@google.com), Renmin Gu (renming@google.com)
Sponsor Konstantinos Katsiapis (katsiapis@google.com), Mitch Trott (trott@google.com), Zhitao Li (zhitaoli@google.com)
Updated 2020-07-05

Objective

This RFC documents the data model that supports the TFX DSL
semantics.
It also introduces the TFX DSL intermediate representation (IR) and the workflow
based on that. The IR is the bridge between the DSL and its orchestration /
execution on all supported platforms and the workflow is the procedure that all
platforms should follow to reflect the data model in MLMD.

NOTE: While this doc has more than usual details than a typical design doc, it's
still a design doc rather than a spec doc. However the long term goal is to make
IR a specification for ML pipelines.

@hughmiao
@hongye-sun
@rmgogogo
@zhitaoli
@paveldournov
@neuromage
@james-jwu
@theadactyl
@rcrowe-google

@ruoyu90
Copy link
Contributor Author

ruoyu90 commented Jul 24, 2020

@casassg FYI this is the RFC I mentioned in this issue.

@theadactyl theadactyl added the RFC: Proposed RFC Design Document label Jul 24, 2020
@theadactyl theadactyl changed the title RFC for TFX DSL data model and IR RFC: TFX DSL Data Model and IR Jul 24, 2020
@casassg
Copy link
Member

casassg commented Jul 24, 2020

@ruoyu90 Thanks for sharing, did a first read now. Will followup with comments (if any) next week as I need to re-read a couple more times to understand some extra concepts from ml-metadata. The problems trying to solve seem super aligned on what I found (and the reason of my issue), specially the The importance of a consistent data model across platforms section.

@aoen
Copy link

aoen commented Jul 27, 2020

Regarding the "consistent data model" how will dependencies for component execution be handled, e.g. python libraries? Seems that to make pipelines fully hermetic there needs to be a serialized representation of the pipeline definition but also some kind of container for dependencies (e.g. docker image, PEX, pickle, etc).

@ruoyu90
Copy link
Contributor Author

ruoyu90 commented Jul 28, 2020

@aoen the 'consistent data model' mainly refers to the data model of the underlying orchestration of the pipeline. We need this consistent data model so that the orchestration / inter-connection between nodes can be portable across platforms.

Regarding dependencies of component execution, we do plan to provide a solution on that, which might be more related to ExecutorSpec. We will have a follow-up design for platform-related extensions to that :)

@charlesccychen
@zhitaoli
@hongye-sun
@rmgogogo

### 2.1. The importance of a uniform data model

When TFX DSL and orchestration were first
[introduced](https://github.com/tensorflow/community/blob/master/rfcs/20190718-tfx-orchestration.md),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to get a renewed architecture diagram with the TFX IR, to make it easier to understand and compare the changes being introduced (and compare with https://github.com/tensorflow/community/blob/master/rfcs/20190718-tfx-orchestration.md#architecture )

Comment on lines +1541 to +1542
- `ExecutorSpec` needs to be enhanced to support more executor form
factors and platforms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: May be worth elaborating more on what this means (aka component environment declaration, dependency declaration, container environment declaration). I understand this is a generic phrase as it will be specified in a future RFC, but may be worthwhile tweaking language to make it more clear (had to read #271 (comment) to understand this one)

@jlewi
Copy link

jlewi commented Jul 29, 2020

Thanks for this detailed proposal; its clear a lot of thought went into it.

One of the key motivations for the proposal is to support portability; e.g.

  • Building different front ends/UIs
  • Building different SDKs
  • Building different backends/execution engines

Does anyone have suggestions for how to evaluate how easy it would be to implement one or more of the above based on the IR?

/cc @animeshsingh

@animeshsingh
Copy link

Thanks @jlewi for pointing to this. Will dive into and come back with comments

Copy link

@animeshsingh animeshsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass - will be going through more.

It also introduces the TFX DSL intermediate representation (IR) and the workflow
based on that. The IR is the bridge between the DSL and its orchestration /
execution on all supported platforms and the workflow is the procedure that all
platforms should follow to reflect the data model in MLMD.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend making a statement about the IR being a foundation for KFP as well somewhere in the objectives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, KFP will move towards a slightly modified version of this proposed IR as its own foundation, most likely dropping some async pipeline related semantics to keep things simple.

With that understanding, we would probably not call that out this way.

platform can hardly reuse a python-based module), a more explicit contract is
desired to serve as the bridge between the pipeline definition script and the
platforms that run the pipelines. For this reason, we would like to introduce
TFX Intermediate representation (IR) in this RFC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, pointing somewhere the motivations to align KFP and TFX also around a common IR should be listed...?


* **Artifact**: an `Artifact` maps to an output of a node in a TFX pipeline
and can be potentially fed into another node as an input. An `artifact` is
always typed and the payload of the data is always referenced by the `uri`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during the community call, this notion of "always typed" doesn't hold true necessarily in context of KFP. cc @Ark-kun

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML-metadata is being extended to support several "GenericType" as escape hatches by @charlesccychen . That should make sure we can map an untyped artifact object into the system.

combining `ResolverNode` and its only consumer) or can be specialized to
a certain platform (e.g., better leverage
[Fusion](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization)
in Dataflow to make data processing more efficient).
Copy link

@animeshsingh animeshsingh Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, for organizations with existing Metadata/Lineage tracking services, the IR should provide an extension mechanism to define and plugin custom "post-processors" so that the metadata can be synced to a different target.....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like platform specific extensions? Or does this need to be done in TFX layer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @casassg. A consistent data model published to MLMD should be viewed as a firm goal of this doc. I can imagine different ways to extend the "publisher" part to also publish to a different Metadata tracking services, but non-trivial amount of data model design probably need to happen, which I don't know if we could really address in IR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing @zhitaoli : the top goal of this RFC is to guarantee a consistent data model published to underlying datastore. MLMD is the one TFX choose to use, which itself is also a layered solution that has multiple backend support. For more customized use cases, customization in the code level will be needed.

as a pair of symmetrical actions under the context of TFX DSL, they are very
different when being mapped to the underlying data model:

* ‘Writing to a `Channel`’ essentially means publishing artifacts to MLMD. In
Copy link

@animeshsingh animeshsingh Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give flexibility here for user to create their own "channels", so that artifacts can be published to a different metadata backend if they are written on that channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is better achieved by extending the publisher object? Channel actually keeps entire publishing and resolving in abstract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@animeshsingh I see your point. The Channel here is mainly an abstract concept that carries the data model. Implementation can be customized in the execution stack (launcher, etc).

@jlewi
Copy link

jlewi commented Aug 7, 2020

@neuromage @james-jwu Per the discussion in kubeflow/pipelines#3703 does the proposed IR define the contract between the KFP SDK, UI, and backend? Would an SDK/UI/Backend have to fully support the IR to be compliant?

/cc @animeshsingh

copybara-service bot pushed a commit to tensorflow/tfx that referenced this pull request Aug 11, 2020
… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271

PiperOrigin-RevId: 325953252
copybara-service bot pushed a commit to tensorflow/tfx that referenced this pull request Aug 12, 2020
… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271

PiperOrigin-RevId: 325953252
copybara-service bot pushed a commit to tensorflow/tfx that referenced this pull request Aug 12, 2020
… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271

PiperOrigin-RevId: 326141955
@Tomcli
Copy link

Tomcli commented Aug 18, 2020

This RFC for TFX DSL looks great. Is there a recommended way to define the KFP exit handler in terms of this TFX IR? Thanks.


Required? | Type of Predicates
:-------: | :------------------------------------------------------------------
Yes | Predicates on the type of artifacts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm new to TFX Channel, what are the supported type of artifacts in Channel? Is it possible to define the artifact with NoType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current supported 'type' can be any MLMD artifacttype.
A GenericType is something we're working on.

/cc @charlesccychen
/cc @zhitaoli

@RedbackThomson
Copy link

  • How strongly does this IR depend on MLMD as the specific artifact store? Is it possible to replace this with any generic key-value or SQL database?
  • From which parts of the existing TFX implementation does the IR differ? (Where did you generalise from the existing TFX representation)

@tonanhngo
Copy link

tonanhngo commented Aug 21, 2020

Great to see much details in this design doc. I suspect that arriving at a full, formal IR specification will take a number of iterations, trying things out and getting feedback from the community. It would be helpful to lay out the process for developing the spec; for instance, will it be additional RFC's in the future? Or should a working group be formed to discuss the spec? Or some other process?

Comment on lines +119 to +120
* Enables attaching platform specific extensions to IR in a transparent
way so that it can be understood by specific platform runners.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be useful for targeting features in different backend, for instance K8s options in Kubeflow. It also implies that the DSL would provide ways to expose these platform specific features to the users. Is there such support currently in TFX? If so, an example would be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruoyu90 is working on this and I believe it will be available soon.

The IR should provide a serialized format of TFX DSL that has the following
properties:

1. Carries over all TFX DSL semantics in a uniform way. This means that the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a strong support for existing Kubeflow Pipelines 1.0 based Components and DSLs. I would imagine existing pipeline components would need additional markup to be converted to TFX IR, which TFX can then convert into KFP/Argo-usable YAML. However, this upgrade path must be there for strong adoption in a reasonable time frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not cover the topic of adapting existing KFP components into TFX in this doc, because it's unclear how far we will attempt to do that.

@charlesccychen might have some further proposals to load (some subset of) KFP component in YAML format into TFX as a component, but that is still in early stage.

Note that KFP might propose its own IR which is similar to this, so I'm not sure we need to address this path immediately.

* Enables using different frontend languages to compose TFX pipelines.
* Enables attaching platform specific extensions to IR in a transparent
way so that it can be understood by specific platform runners.
* Enables applying different optimization strategies on top of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any research papers / more details on how optimization can be applied? The dataflow doc isn't clear how to create optimizations or what this actually means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there is no paper about this specific optimization mentioned here yet. We will have an RFC for that once detailed design is available :)

Copy link
Contributor

@zhitaoli zhitaoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think we can proceed to merge this PR. Some healthy discussion may still happen but they don't seem to suggest major concern to the direction.

It also introduces the TFX DSL intermediate representation (IR) and the workflow
based on that. The IR is the bridge between the DSL and its orchestration /
execution on all supported platforms and the workflow is the procedure that all
platforms should follow to reflect the data model in MLMD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, KFP will move towards a slightly modified version of this proposed IR as its own foundation, most likely dropping some async pipeline related semantics to keep things simple.

With that understanding, we would probably not call that out this way.

The IR should provide a serialized format of TFX DSL that has the following
properties:

1. Carries over all TFX DSL semantics in a uniform way. This means that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not cover the topic of adapting existing KFP components into TFX in this doc, because it's unclear how far we will attempt to do that.

@charlesccychen might have some further proposals to load (some subset of) KFP component in YAML format into TFX as a component, but that is still in early stage.

Note that KFP might propose its own IR which is similar to this, so I'm not sure we need to address this path immediately.

Comment on lines +119 to +120
* Enables attaching platform specific extensions to IR in a transparent
way so that it can be understood by specific platform runners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruoyu90 is working on this and I believe it will be available soon.

combining `ResolverNode` and its only consumer) or can be specialized to
a certain platform (e.g., better leverage
[Fusion](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization)
in Dataflow to make data processing more efficient).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @casassg. A consistent data model published to MLMD should be viewed as a firm goal of this doc. I can imagine different ways to extend the "publisher" part to also publish to a different Metadata tracking services, but non-trivial amount of data model design probably need to happen, which I don't know if we could really address in IR.


* **Artifact**: an `Artifact` maps to an output of a node in a TFX pipeline
and can be potentially fed into another node as an input. An `artifact` is
always typed and the payload of the data is always referenced by the `uri`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML-metadata is being extended to support several "GenericType" as escape hatches by @charlesccychen . That should make sure we can map an untyped artifact object into the system.

as a pair of symmetrical actions under the context of TFX DSL, they are very
different when being mapped to the underlying data model:

* ‘Writing to a `Channel`’ essentially means publishing artifacts to MLMD. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is better achieved by extending the publisher object? Channel actually keeps entire publishing and resolving in abstract.

@ruoyu90
Copy link
Contributor Author

ruoyu90 commented Sep 10, 2020

This RFC for TFX DSL looks great. Is there a recommended way to define the KFP exit handler in terms of this TFX IR? Thanks.

KFP community has a similar but separate IR proposal that might have the exit handler explicitly on the roadmap.

/cc @hongye-sun

@ruoyu90
Copy link
Contributor Author

ruoyu90 commented Sep 10, 2020

Re: @RedbackThomson:

  • How strongly does this IR depend on MLMD as the specific artifact store? Is it possible to replace this with any generic key-value or SQL database?

MLMD itself is already a layered solution that has multiple backend support. It is possible to extend it with a new datastore backend.

  • From which parts of the existing TFX implementation does the IR differ? (Where did you generalise from the existing TFX representation)

This proposal is mainly for formalizing data model as well as execution pattern of TFX, so that more features / generalization can be added on top in a principled way.

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Sep 14, 2020
@ematejska
Copy link
Contributor

This RFC has been accepted.

@ematejska ematejska merged commit cf6faa2 into tensorflow:master Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.