RFC: TFX DSL Data Model and IR #271

ruoyu90 · 2020-07-24T18:45:37Z

Update: we extend the feedback phase to Friday, August 21, 2020.

Status	Proposed
RFC #	271
Author(s)	Ruoyu Liu (ruoyu@google.com), Hui Miao (huimiao@google.com), Hongye Sun (hongyes@google.com), Renmin Gu (renming@google.com)
Sponsor	Konstantinos Katsiapis (katsiapis@google.com), Mitch Trott (trott@google.com), Zhitao Li (zhitaoli@google.com)
Updated	2020-07-05

Objective

This RFC documents the data model that supports the TFX DSL
semantics.
It also introduces the TFX DSL intermediate representation (IR) and the workflow
based on that. The IR is the bridge between the DSL and its orchestration /
execution on all supported platforms and the workflow is the procedure that all
platforms should follow to reflect the data model in MLMD.

NOTE: While this doc has more than usual details than a typical design doc, it's
still a design doc rather than a spec doc. However the long term goal is to make
IR a specification for ML pipelines.

@hughmiao
@hongye-sun
@rmgogogo
@zhitaoli
@paveldournov
@neuromage
@james-jwu
@theadactyl
@rcrowe-google

ruoyu90 · 2020-07-24T19:00:58Z

@casassg FYI this is the RFC I mentioned in this issue.

casassg · 2020-07-24T22:06:06Z

@ruoyu90 Thanks for sharing, did a first read now. Will followup with comments (if any) next week as I need to re-read a couple more times to understand some extra concepts from ml-metadata. The problems trying to solve seem super aligned on what I found (and the reason of my issue), specially the The importance of a consistent data model across platforms section.

aoen · 2020-07-27T15:40:59Z

Regarding the "consistent data model" how will dependencies for component execution be handled, e.g. python libraries? Seems that to make pipelines fully hermetic there needs to be a serialized representation of the pipeline definition but also some kind of container for dependencies (e.g. docker image, PEX, pickle, etc).

ruoyu90 · 2020-07-28T00:59:26Z

@aoen the 'consistent data model' mainly refers to the data model of the underlying orchestration of the pipeline. We need this consistent data model so that the orchestration / inter-connection between nodes can be portable across platforms.

Regarding dependencies of component execution, we do plan to provide a solution on that, which might be more related to ExecutorSpec. We will have a follow-up design for platform-related extensions to that :)

@charlesccychen
@zhitaoli
@hongye-sun
@rmgogogo

casassg · 2020-07-28T21:24:58Z

rfcs/20200705-tfx-ir.md

+### 2.1. The importance of a uniform data model
+
+When TFX DSL and orchestration were first
+[introduced](https://github.com/tensorflow/community/blob/master/rfcs/20190718-tfx-orchestration.md),


It would be interesting to get a renewed architecture diagram with the TFX IR, to make it easier to understand and compare the changes being introduced (and compare with https://github.com/tensorflow/community/blob/master/rfcs/20190718-tfx-orchestration.md#architecture )

casassg · 2020-07-29T00:06:21Z

rfcs/20200705-tfx-ir.md

+    -   `ExecutorSpec` needs to be enhanced to support more executor form
+        factors and platforms.


nitpick: May be worth elaborating more on what this means (aka component environment declaration, dependency declaration, container environment declaration). I understand this is a generic phrase as it will be specified in a future RFC, but may be worthwhile tweaking language to make it more clear (had to read #271 (comment) to understand this one)

jlewi · 2020-07-29T13:36:57Z

Thanks for this detailed proposal; its clear a lot of thought went into it.

One of the key motivations for the proposal is to support portability; e.g.

Building different front ends/UIs
Building different SDKs
Building different backends/execution engines

Does anyone have suggestions for how to evaluate how easy it would be to implement one or more of the above based on the IR?

/cc @animeshsingh

animeshsingh · 2020-07-29T15:32:32Z

Thanks @jlewi for pointing to this. Will dive into and come back with comments

Fixes format

animeshsingh

first pass - will be going through more.

animeshsingh · 2020-08-06T23:45:29Z

rfcs/20200705-tfx-ir.md

+It also introduces the TFX DSL intermediate representation (IR) and the workflow
+based on that. The IR is the bridge between the DSL and its orchestration /
+execution on all supported platforms and the workflow is the procedure that all
+platforms should follow to reflect the data model in MLMD.


I would recommend making a statement about the IR being a foundation for KFP as well somewhere in the objectives.

AFAIK, KFP will move towards a slightly modified version of this proposed IR as its own foundation, most likely dropping some async pipeline related semantics to keep things simple.

With that understanding, we would probably not call that out this way.

animeshsingh · 2020-08-06T23:47:50Z

rfcs/20200705-tfx-ir.md

+platform can hardly reuse a python-based module), a more explicit contract is
+desired to serve as the bridge between the pipeline definition script and the
+platforms that run the pipelines. For this reason, we would like to introduce
+TFX Intermediate representation (IR) in this RFC.


Again, pointing somewhere the motivations to align KFP and TFX also around a common IR should be listed...?

animeshsingh · 2020-08-06T23:52:35Z

rfcs/20200705-tfx-ir.md

+
+*   **Artifact**: an `Artifact` maps to an output of a node in a TFX pipeline
+    and can be potentially fed into another node as an input. An `artifact` is
+    always typed and the payload of the data is always referenced by the `uri`


As discussed during the community call, this notion of "always typed" doesn't hold true necessarily in context of KFP. cc @Ark-kun

ML-metadata is being extended to support several "GenericType" as escape hatches by @charlesccychen . That should make sure we can map an untyped artifact object into the system.

animeshsingh · 2020-08-06T23:57:50Z

rfcs/20200705-tfx-ir.md

+        combining `ResolverNode` and its only consumer) or can be specialized to
+        a certain platform (e.g., better leverage
+        [Fusion](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization)
+        in Dataflow to make data processing more efficient).


Additionally, for organizations with existing Metadata/Lineage tracking services, the IR should provide an extension mechanism to define and plugin custom "post-processors" so that the metadata can be synced to a different target.....

This sounds like platform specific extensions? Or does this need to be done in TFX layer?

+1 to @casassg. A consistent data model published to MLMD should be viewed as a firm goal of this doc. I can imagine different ways to extend the "publisher" part to also publish to a different Metadata tracking services, but non-trivial amount of data model design probably need to happen, which I don't know if we could really address in IR.

Echoing @zhitaoli : the top goal of this RFC is to guarantee a consistent data model published to underlying datastore. MLMD is the one TFX choose to use, which itself is also a layered solution that has multiple backend support. For more customized use cases, customization in the code level will be needed.

animeshsingh · 2020-08-06T23:59:30Z

rfcs/20200705-tfx-ir.md

+as a pair of symmetrical actions under the context of TFX DSL, they are very
+different when being mapped to the underlying data model:
+
+*   ‘Writing to a `Channel`’ essentially means publishing artifacts to MLMD. In


I would give flexibility here for user to create their own "channels", so that artifacts can be published to a different metadata backend if they are written on that channel.

I guess that is better achieved by extending the publisher object? Channel actually keeps entire publishing and resolving in abstract.

@animeshsingh I see your point. The Channel here is mainly an abstract concept that carries the data model. Implementation can be customized in the execution stack (launcher, etc).

jlewi · 2020-08-07T13:42:05Z

@neuromage @james-jwu Per the discussion in kubeflow/pipelines#3703 does the proposed IR define the contract between the KFP SDK, UI, and backend? Would an SDK/UI/Backend have to fully support the IR to be compliant?

/cc @animeshsingh

… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271 PiperOrigin-RevId: 325953252

… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271 PiperOrigin-RevId: 326141955

Tomcli · 2020-08-18T21:40:26Z

This RFC for TFX DSL looks great. Is there a recommended way to define the KFP exit handler in terms of this TFX IR? Thanks.

Tomcli · 2020-08-18T21:46:56Z

rfcs/20200705-tfx-ir.md

+
+Required? | Type of Predicates
+:-------: | :------------------------------------------------------------------
+Yes       | Predicates on the type of artifacts


Sorry I'm new to TFX Channel, what are the supported type of artifacts in Channel? Is it possible to define the artifact with NoType?

The current supported 'type' can be any MLMD artifacttype.
A GenericType is something we're working on.

/cc @charlesccychen
/cc @zhitaoli

RedbackThomson · 2020-08-20T17:57:14Z

How strongly does this IR depend on MLMD as the specific artifact store? Is it possible to replace this with any generic key-value or SQL database?
From which parts of the existing TFX implementation does the IR differ? (Where did you generalise from the existing TFX representation)

tonanhngo · 2020-08-21T05:45:11Z

Great to see much details in this design doc. I suspect that arriving at a full, formal IR specification will take a number of iterations, trying things out and getting feedback from the community. It would be helpful to lay out the process for developing the spec; for instance, will it be additional RFC's in the future? Or should a working group be formed to discuss the spec? Or some other process?

tonanhngo · 2020-08-21T22:51:03Z

rfcs/20200705-tfx-ir.md

+    *   Enables attaching platform specific extensions to IR in a transparent
+        way so that it can be understood by specific platform runners.


This should be useful for targeting features in different backend, for instance K8s options in Kubeflow. It also implies that the DSL would provide ways to expose these platform specific features to the users. Is there such support currently in TFX? If so, an example would be helpful.

@ruoyu90 is working on this and I believe it will be available soon.

buildgreatthings · 2020-08-21T23:35:46Z

rfcs/20200705-tfx-ir.md

+The IR should provide a serialized format of TFX DSL that has the following
+properties:
+
+1.  Carries over all TFX DSL semantics in a uniform way. This means that the


There needs to be a strong support for existing Kubeflow Pipelines 1.0 based Components and DSLs. I would imagine existing pipeline components would need additional markup to be converted to TFX IR, which TFX can then convert into KFP/Argo-usable YAML. However, this upgrade path must be there for strong adoption in a reasonable time frame.

We did not cover the topic of adapting existing KFP components into TFX in this doc, because it's unclear how far we will attempt to do that.

@charlesccychen might have some further proposals to load (some subset of) KFP component in YAML format into TFX as a component, but that is still in early stage.

Note that KFP might propose its own IR which is similar to this, so I'm not sure we need to address this path immediately.

buildgreatthings · 2020-08-21T23:37:18Z

rfcs/20200705-tfx-ir.md

+    *   Enables using different frontend languages to compose TFX pipelines.
+    *   Enables attaching platform specific extensions to IR in a transparent
+        way so that it can be understood by specific platform runners.
+    *   Enables applying different optimization strategies on top of the


Are there any research papers / more details on how optimization can be applied? The dataflow doc isn't clear how to create optimizations or what this actually means.

AFAIK there is no paper about this specific optimization mentioned here yet. We will have an RFC for that once detailed design is available :)

zhitaoli

Thanks. I think we can proceed to merge this PR. Some healthy discussion may still happen but they don't seem to suggest major concern to the direction.

zhitaoli · 2020-09-01T20:52:39Z

rfcs/20200705-tfx-ir.md

+It also introduces the TFX DSL intermediate representation (IR) and the workflow
+based on that. The IR is the bridge between the DSL and its orchestration /
+execution on all supported platforms and the workflow is the procedure that all
+platforms should follow to reflect the data model in MLMD.


AFAIK, KFP will move towards a slightly modified version of this proposed IR as its own foundation, most likely dropping some async pipeline related semantics to keep things simple.

With that understanding, we would probably not call that out this way.

zhitaoli · 2020-09-01T20:56:47Z

rfcs/20200705-tfx-ir.md

+The IR should provide a serialized format of TFX DSL that has the following
+properties:
+
+1.  Carries over all TFX DSL semantics in a uniform way. This means that the


We did not cover the topic of adapting existing KFP components into TFX in this doc, because it's unclear how far we will attempt to do that.

@charlesccychen might have some further proposals to load (some subset of) KFP component in YAML format into TFX as a component, but that is still in early stage.

Note that KFP might propose its own IR which is similar to this, so I'm not sure we need to address this path immediately.

zhitaoli · 2020-09-01T20:57:48Z

rfcs/20200705-tfx-ir.md

+    *   Enables attaching platform specific extensions to IR in a transparent
+        way so that it can be understood by specific platform runners.


@ruoyu90 is working on this and I believe it will be available soon.

zhitaoli · 2020-09-01T21:00:31Z

rfcs/20200705-tfx-ir.md

+        combining `ResolverNode` and its only consumer) or can be specialized to
+        a certain platform (e.g., better leverage
+        [Fusion](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization)
+        in Dataflow to make data processing more efficient).


+1 to @casassg. A consistent data model published to MLMD should be viewed as a firm goal of this doc. I can imagine different ways to extend the "publisher" part to also publish to a different Metadata tracking services, but non-trivial amount of data model design probably need to happen, which I don't know if we could really address in IR.

zhitaoli · 2020-09-01T21:01:50Z

rfcs/20200705-tfx-ir.md

+
+*   **Artifact**: an `Artifact` maps to an output of a node in a TFX pipeline
+    and can be potentially fed into another node as an input. An `artifact` is
+    always typed and the payload of the data is always referenced by the `uri`


ML-metadata is being extended to support several "GenericType" as escape hatches by @charlesccychen . That should make sure we can map an untyped artifact object into the system.

zhitaoli · 2020-09-01T21:03:10Z

rfcs/20200705-tfx-ir.md

+as a pair of symmetrical actions under the context of TFX DSL, they are very
+different when being mapped to the underlying data model:
+
+*   ‘Writing to a `Channel`’ essentially means publishing artifacts to MLMD. In


I guess that is better achieved by extending the publisher object? Channel actually keeps entire publishing and resolving in abstract.

ruoyu90 · 2020-09-10T22:57:08Z

This RFC for TFX DSL looks great. Is there a recommended way to define the KFP exit handler in terms of this TFX IR? Thanks.

KFP community has a similar but separate IR proposal that might have the exit handler explicitly on the roadmap.

/cc @hongye-sun

ruoyu90 · 2020-09-10T23:07:54Z

Re: @RedbackThomson:

How strongly does this IR depend on MLMD as the specific artifact store? Is it possible to replace this with any generic key-value or SQL database?

MLMD itself is already a layered solution that has multiple backend support. It is possible to extend it with a new datastore backend.

From which parts of the existing TFX implementation does the IR differ? (Where did you generalise from the existing TFX representation)

This proposal is mainly for formalizing data model as well as execution pattern of TFX, so that more features / generalization can be added on top in a principled way.

ematejska · 2020-09-14T17:52:57Z

This RFC has been accepted.

ruoyu90 and others added 3 commits July 23, 2020 22:17

RFC for TFX DSL data model and IR

db01a85

Format fixes.

c1fd5d3

Updated format to fit Github flavor markdown

a312a5d

ruoyu90 requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners July 24, 2020 18:45

googlebot added the cla: yes label Jul 24, 2020

theadactyl added the RFC: Proposed RFC Design Document label Jul 24, 2020

theadactyl changed the title ~~RFC for TFX DSL data model and IR~~ RFC: TFX DSL Data Model and IR Jul 24, 2020

casassg reviewed Jul 29, 2020

View reviewed changes

Update 20200705-tfx-ir.md

761ab0b

Fixes format

animeshsingh reviewed Aug 7, 2020

View reviewed changes

copybara-service bot mentioned this pull request Aug 11, 2020

Starting to check in IR-based execution stack. When finished, it will realize the IR-based execution workflow introduced in the TFX IR RFC: https://github.com/tensorflow/community/pull/271 tensorflow/tfx#2311

Merged

Tomcli reviewed Aug 18, 2020

View reviewed changes

tonanhngo reviewed Aug 21, 2020

View reviewed changes

buildgreatthings reviewed Aug 21, 2020

View reviewed changes

charlesccychen mentioned this pull request Aug 25, 2020

Kubernetes Dag Runner Part #2 tensorflow/tfx#2329

Merged

zhitaoli approved these changes Sep 1, 2020

View reviewed changes

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Sep 14, 2020

Marking as accepted.

1e7808e

ematejska approved these changes Sep 14, 2020

View reviewed changes

ematejska merged commit cf6faa2 into tensorflow:master Sep 14, 2020

		- `ExecutorSpec` needs to be enhanced to support more executor form
		factors and platforms.

		* Enables attaching platform specific extensions to IR in a transparent
		way so that it can be understood by specific platform runners.

RFC: TFX DSL Data Model and IR #271

RFC: TFX DSL Data Model and IR #271

Conversation

ruoyu90 commented Jul 24, 2020 • edited Loading

Objective

ruoyu90 commented Jul 24, 2020

casassg commented Jul 24, 2020

aoen commented Jul 27, 2020

ruoyu90 commented Jul 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Jul 29, 2020

animeshsingh commented Jul 29, 2020

animeshsingh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

animeshsingh Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

animeshsingh Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Aug 7, 2020

Tomcli commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RedbackThomson commented Aug 20, 2020

tonanhngo commented Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhitaoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruoyu90 commented Sep 10, 2020

ruoyu90 commented Sep 10, 2020

ematejska commented Sep 14, 2020

ruoyu90 commented Jul 24, 2020 •

edited

Loading

animeshsingh Aug 6, 2020 •

edited

Loading

animeshsingh Aug 6, 2020 •

edited

Loading

tonanhngo commented Aug 21, 2020 •

edited

Loading