-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: TFX DSL Data Model and IR #271
Conversation
@ruoyu90 Thanks for sharing, did a first read now. Will followup with comments (if any) next week as I need to re-read a couple more times to understand some extra concepts from ml-metadata. The problems trying to solve seem super aligned on what I found (and the reason of my issue), specially the |
Regarding the "consistent data model" how will dependencies for component execution be handled, e.g. python libraries? Seems that to make pipelines fully hermetic there needs to be a serialized representation of the pipeline definition but also some kind of container for dependencies (e.g. docker image, PEX, pickle, etc). |
@aoen the 'consistent data model' mainly refers to the data model of the underlying orchestration of the pipeline. We need this consistent data model so that the orchestration / inter-connection between nodes can be portable across platforms. Regarding dependencies of component execution, we do plan to provide a solution on that, which might be more related to |
### 2.1. The importance of a uniform data model | ||
|
||
When TFX DSL and orchestration were first | ||
[introduced](https://github.com/tensorflow/community/blob/master/rfcs/20190718-tfx-orchestration.md), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to get a renewed architecture diagram with the TFX IR, to make it easier to understand and compare the changes being introduced (and compare with https://github.com/tensorflow/community/blob/master/rfcs/20190718-tfx-orchestration.md#architecture )
- `ExecutorSpec` needs to be enhanced to support more executor form | ||
factors and platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: May be worth elaborating more on what this means (aka component environment declaration, dependency declaration, container environment declaration). I understand this is a generic phrase as it will be specified in a future RFC, but may be worthwhile tweaking language to make it more clear (had to read #271 (comment) to understand this one)
Thanks for this detailed proposal; its clear a lot of thought went into it. One of the key motivations for the proposal is to support portability; e.g.
Does anyone have suggestions for how to evaluate how easy it would be to implement one or more of the above based on the IR? /cc @animeshsingh |
Thanks @jlewi for pointing to this. Will dive into and come back with comments |
Fixes format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass - will be going through more.
It also introduces the TFX DSL intermediate representation (IR) and the workflow | ||
based on that. The IR is the bridge between the DSL and its orchestration / | ||
execution on all supported platforms and the workflow is the procedure that all | ||
platforms should follow to reflect the data model in MLMD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend making a statement about the IR being a foundation for KFP as well somewhere in the objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, KFP will move towards a slightly modified version of this proposed IR as its own foundation, most likely dropping some async pipeline related semantics to keep things simple.
With that understanding, we would probably not call that out this way.
platform can hardly reuse a python-based module), a more explicit contract is | ||
desired to serve as the bridge between the pipeline definition script and the | ||
platforms that run the pipelines. For this reason, we would like to introduce | ||
TFX Intermediate representation (IR) in this RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, pointing somewhere the motivations to align KFP and TFX also around a common IR should be listed...?
|
||
* **Artifact**: an `Artifact` maps to an output of a node in a TFX pipeline | ||
and can be potentially fed into another node as an input. An `artifact` is | ||
always typed and the payload of the data is always referenced by the `uri` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during the community call, this notion of "always typed" doesn't hold true necessarily in context of KFP. cc @Ark-kun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML-metadata is being extended to support several "GenericType" as escape hatches by @charlesccychen . That should make sure we can map an untyped artifact object into the system.
combining `ResolverNode` and its only consumer) or can be specialized to | ||
a certain platform (e.g., better leverage | ||
[Fusion](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization) | ||
in Dataflow to make data processing more efficient). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, for organizations with existing Metadata/Lineage tracking services, the IR should provide an extension mechanism to define and plugin custom "post-processors" so that the metadata can be synced to a different target.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like platform specific extensions? Or does this need to be done in TFX layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @casassg. A consistent data model published to MLMD should be viewed as a firm goal of this doc. I can imagine different ways to extend the "publisher" part to also publish to a different Metadata tracking services, but non-trivial amount of data model design probably need to happen, which I don't know if we could really address in IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing @zhitaoli : the top goal of this RFC is to guarantee a consistent data model published to underlying datastore. MLMD is the one TFX choose to use, which itself is also a layered solution that has multiple backend support. For more customized use cases, customization in the code level will be needed.
as a pair of symmetrical actions under the context of TFX DSL, they are very | ||
different when being mapped to the underlying data model: | ||
|
||
* ‘Writing to a `Channel`’ essentially means publishing artifacts to MLMD. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would give flexibility here for user to create their own "channels", so that artifacts can be published to a different metadata backend if they are written on that channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that is better achieved by extending the publisher object? Channel actually keeps entire publishing and resolving in abstract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@animeshsingh I see your point. The Channel
here is mainly an abstract concept that carries the data model. Implementation can be customized in the execution stack (launcher, etc).
@neuromage @james-jwu Per the discussion in kubeflow/pipelines#3703 does the proposed IR define the contract between the KFP SDK, UI, and backend? Would an SDK/UI/Backend have to fully support the IR to be compliant? /cc @animeshsingh |
… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271 PiperOrigin-RevId: 325953252
… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271 PiperOrigin-RevId: 325953252
… realize the IR-based execution workflow introduced in the TFX IR RFC: tensorflow/community#271 PiperOrigin-RevId: 326141955
This RFC for TFX DSL looks great. Is there a recommended way to define the KFP exit handler in terms of this TFX IR? Thanks. |
|
||
Required? | Type of Predicates | ||
:-------: | :------------------------------------------------------------------ | ||
Yes | Predicates on the type of artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm new to TFX Channel, what are the supported type
of artifacts in Channel? Is it possible to define the artifact with NoType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current supported 'type' can be any MLMD artifacttype.
A GenericType
is something we're working on.
/cc @charlesccychen
/cc @zhitaoli
|
Great to see much details in this design doc. I suspect that arriving at a full, formal IR specification will take a number of iterations, trying things out and getting feedback from the community. It would be helpful to lay out the process for developing the spec; for instance, will it be additional RFC's in the future? Or should a working group be formed to discuss the spec? Or some other process? |
* Enables attaching platform specific extensions to IR in a transparent | ||
way so that it can be understood by specific platform runners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be useful for targeting features in different backend, for instance K8s options in Kubeflow. It also implies that the DSL would provide ways to expose these platform specific features to the users. Is there such support currently in TFX? If so, an example would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruoyu90 is working on this and I believe it will be available soon.
The IR should provide a serialized format of TFX DSL that has the following | ||
properties: | ||
|
||
1. Carries over all TFX DSL semantics in a uniform way. This means that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a strong support for existing Kubeflow Pipelines 1.0 based Components and DSLs. I would imagine existing pipeline components would need additional markup to be converted to TFX IR, which TFX can then convert into KFP/Argo-usable YAML. However, this upgrade path must be there for strong adoption in a reasonable time frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not cover the topic of adapting existing KFP components into TFX in this doc, because it's unclear how far we will attempt to do that.
@charlesccychen might have some further proposals to load (some subset of) KFP component in YAML format into TFX as a component, but that is still in early stage.
Note that KFP might propose its own IR which is similar to this, so I'm not sure we need to address this path immediately.
* Enables using different frontend languages to compose TFX pipelines. | ||
* Enables attaching platform specific extensions to IR in a transparent | ||
way so that it can be understood by specific platform runners. | ||
* Enables applying different optimization strategies on top of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any research papers / more details on how optimization can be applied? The dataflow doc isn't clear how to create optimizations or what this actually means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK there is no paper about this specific optimization mentioned here yet. We will have an RFC for that once detailed design is available :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think we can proceed to merge this PR. Some healthy discussion may still happen but they don't seem to suggest major concern to the direction.
It also introduces the TFX DSL intermediate representation (IR) and the workflow | ||
based on that. The IR is the bridge between the DSL and its orchestration / | ||
execution on all supported platforms and the workflow is the procedure that all | ||
platforms should follow to reflect the data model in MLMD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, KFP will move towards a slightly modified version of this proposed IR as its own foundation, most likely dropping some async pipeline related semantics to keep things simple.
With that understanding, we would probably not call that out this way.
The IR should provide a serialized format of TFX DSL that has the following | ||
properties: | ||
|
||
1. Carries over all TFX DSL semantics in a uniform way. This means that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not cover the topic of adapting existing KFP components into TFX in this doc, because it's unclear how far we will attempt to do that.
@charlesccychen might have some further proposals to load (some subset of) KFP component in YAML format into TFX as a component, but that is still in early stage.
Note that KFP might propose its own IR which is similar to this, so I'm not sure we need to address this path immediately.
* Enables attaching platform specific extensions to IR in a transparent | ||
way so that it can be understood by specific platform runners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruoyu90 is working on this and I believe it will be available soon.
combining `ResolverNode` and its only consumer) or can be specialized to | ||
a certain platform (e.g., better leverage | ||
[Fusion](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization) | ||
in Dataflow to make data processing more efficient). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @casassg. A consistent data model published to MLMD should be viewed as a firm goal of this doc. I can imagine different ways to extend the "publisher" part to also publish to a different Metadata tracking services, but non-trivial amount of data model design probably need to happen, which I don't know if we could really address in IR.
|
||
* **Artifact**: an `Artifact` maps to an output of a node in a TFX pipeline | ||
and can be potentially fed into another node as an input. An `artifact` is | ||
always typed and the payload of the data is always referenced by the `uri` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML-metadata is being extended to support several "GenericType" as escape hatches by @charlesccychen . That should make sure we can map an untyped artifact object into the system.
as a pair of symmetrical actions under the context of TFX DSL, they are very | ||
different when being mapped to the underlying data model: | ||
|
||
* ‘Writing to a `Channel`’ essentially means publishing artifacts to MLMD. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that is better achieved by extending the publisher object? Channel actually keeps entire publishing and resolving in abstract.
KFP community has a similar but separate IR proposal that might have the exit handler explicitly on the roadmap. /cc @hongye-sun |
Re: @RedbackThomson:
MLMD itself is already a layered solution that has multiple backend support. It is possible to extend it with a new datastore backend.
This proposal is mainly for formalizing data model as well as execution pattern of TFX, so that more features / generalization can be added on top in a principled way. |
This RFC has been accepted. |
Update: we extend the feedback phase to Friday, August 21, 2020.
Objective
This RFC documents the data model that supports the TFX DSL
semantics.
It also introduces the TFX DSL intermediate representation (IR) and the workflow
based on that. The IR is the bridge between the DSL and its orchestration /
execution on all supported platforms and the workflow is the procedure that all
platforms should follow to reflect the data model in MLMD.
NOTE: While this doc has more than usual details than a typical design doc, it's
still a design doc rather than a spec doc. However the long term goal is to make
IR a specification for ML pipelines.
@hughmiao
@hongye-sun
@rmgogogo
@zhitaoli
@paveldournov
@neuromage
@james-jwu
@theadactyl
@rcrowe-google