Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design Proposal] Vega/Vega-Lite as visualization dsl for Kubeflow metadata-ui #3187

Closed
eterna2 opened this issue Feb 28, 2020 · 17 comments
Closed
Labels
area/frontend kind/discussion kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label. needs investigation status/triaged Whether the issue has been explicitly triaged

Comments

@eterna2
Copy link
Contributor

eterna2 commented Feb 28, 2020

Background

Currently, kfp manages visualization through a collections of viewer components.

Ignoring viewers like markdown, html, tensorboard, etc, visualization in kfp can be separated into 2 groups:

  • pre-created vis by kfp developer with react-vis (usually via metadata-ui artifacts)
  • pre-defined and user-defined vis based on python (generated via jupyter notebook display function)

Proposal

Vega/Vega-Lite to be used as a visualization dsl for

  • specifying metadata-ui artifact, and
  • generating custom vis for data (that do not have a readily-available python lib for visualizing the data) - i.e. alternative to python custom vis.

Pros

  • language agnostic: Uses a JSON-based dsl to describe visualization

  • simple: Simple and concise grammar to generate most common visualization (esp Vega-Lite)
    Example: barchart

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "description": "A simple bar chart with embedded data.",
  "data": {
    "values": [
      {"a": "A", "b": 28}, {"a": "B", "b": 55}, {"a": "C", "b": 43},
      {"a": "D", "b": 91}, {"a": "E", "b": 81}, {"a": "F", "b": 53},
      {"a": "G", "b": 19}, {"a": "H", "b": 87}, {"a": "I", "b": 52}
    ]
  },
  "mark": "bar",
  "encoding": {
    "x": {"field": "a", "type": "ordinal"},
    "y": {"field": "b", "type": "quantitative"}
  }
}
  • supports multiple data format: e.g. csv, tsv, geojson/topojson (for maps), json, etc

  • supports multiple/custom loader types: e.g. http request, inlined, data stream, etc

  • composable: Vega dsl is designed to be composable, which makes it easy to create visualization from existing vis components - i.e. easy to wrap the dsl as composable vis components in UI.

  • can be implemented entirely on client-side: Do not need to have any backend service to generate the html for the visualization

Cons

Non I can think of.

Concept Details

  1. Propose for mlpipeline-ui-metadata.json artifact to support Vega and Vega-Lite:

i.e.

{
  "version": 1,
  "outputs": [
    {
      "type": "vega-lite",
      "data": { "my-matrix": "my-dir/my-matrix.csv" },  // data to be passed to vega spec to be rendered
      "spec": { ... },  // vega or vega-lite spec or reference pre-defined vega/vega-lite specs
    }
  ]
}
  1. Replace existing vis component created with react-vis with react-vega:

It is easier to map existing ui-metadata schema to a vega-lite spec then generate the corresponding vis component, rather than implementing individual vis component, each time we need to support a new vis.

import { Vega } from 'react-vega';

export const RocCurve = props => {
  const spec = {...}  // pre-defined spec
  return <Vega spec={spec} data={props.data} />
}
  1. Enhance custom vis creator to support Vega/Vega-Lite
  • provides a simple UI to edit Vega/Vega-Lite spec and renders the vis (does not need a backend)

See https://vega.github.io/editor/#/examples/vega-lite/airport_connections

@Bobgy
Copy link
Contributor

Bobgy commented Feb 28, 2020

Thanks for the suggestion. This is a great idea!
I have some concerns about integrating vega as a first party visualization:

  1. How popular is it among data scientists?
  2. How big is bundle size, do we need to include both vega and vega-lite?

Would it be enough if we provide some documentation of using them with embeded HTML?
e.g. https://vega.github.io/vega-lite/usage/embed.html#start-using-vega-lite-with-vega-embed
After my recent change of supporting inline HTML visualization in #3177, I think it's fairly straight forward to generate a HTML using vega without any change from KFP. And we can also make a python wrapper for that, if there isn't already one.

@eterna2
Copy link
Contributor Author

eterna2 commented Feb 28, 2020

How popular is it among data scientists?

Hard to quantify it's popularity as there are too many vis tools out there.

It is probably more an engineer tool rather than a data scientist tool - as in it is generally used more as a specification for vis (i.e. instead of saving png of the charts, you save the spec together with ur experiment metadata, params, dataset, etc).

For data scientist, they probably use an abstraction layer on top of it, e.g. altair or py-vega.

How big is bundle size, do we need to include both vega and vega-lite?

Fairly big as it is quite comprehensive.
165 kb for Vega. Not sure about Vega lite. U only need Vega lite to transpile Vega lite spec into Vega spec.

U probably can do tree shaking to remove features u don't intend to support.

If size is a concern, we can do server side rendering for the vis. Vega can output svg, png/jpeg, or data url.

Would it be enough if we provide some documentation of using them with embeded HTML?

That would work, but that adds overheads to data scientist. Or do u think it is a better solution for me to add a ui-metadata sdk?

Because one of the issue I have is that I always have to search for the format of ui-metadata and where to store it, and how to generate it for my kfp operator. In the ideal world, I would prefer a simple sdk to generate whatever vis I wants, without needing to know the actual io.

import kfp.dsl

from kfp.dsl.vis import ConfusionMatrix, WebVis

@kfp.dsl.pipeline()
def some_pipeline():
    op = some_op()
    conf_mat = ConfusionMatrix(..., source= op.outputs.data1)
    // Or
    op.add_vis(WebVis(html=some_html_creator_func))

It is probably not important enough a justification to switch to Vega unless kubeflow is going to provider a richer sets of vis.

But I like Vega particularly because the grammar is elegant and easy to remember. And switching between diff vis for same data is quite trivial - i.e. because it is composition rather than templates (unlike many other solutions, diff chart types have diff params - Vega has very good separation).

Tldr
Essentially, I wanted to package vis artifacts as Vega spec together with data artifacts - i.e. vis should have it's own consistent specification, and shld be stored just like data artifacts.

And the front end can render vis artifacts as is, w/o much additional work.

And these vis artifacts can be used at different parts of kubeflow or other apps because Vega spec can be used as a common standard for vis artifacts - i.e. it is easy to render Vega charts with provided spec.

Currently, there is no consistent std for vis in kubeflow as there is a mix of solutions - from dynamically generated py vis, to custom format for specific vis (e.g. roc, confusion matrix, etc), to html web app.

Alternatively, we can consider a separate vis service for kubeflow with its own crd - generates the required vis from a rest or grpc service.

@eterna2
Copy link
Contributor Author

eterna2 commented Feb 28, 2020

Something like this

I have a simple cloud function to render my chart (that takes a data from a http source) as a png.

Or as a web app link
Link

@Bobgy
Copy link
Contributor

Bobgy commented Feb 28, 2020

I think my main argument is that, vega support is a feature that can be made convenient completely by 3rd party library/components, so I'm not seeing strong enough reason to integrate it in KFP system.

Because there are also other visualization libraries and new libraries coming out.

The only exception is -- If we re-implement or introduce new first-party visualizations using vega directly, then it's probably worth it to support vega json spec directly though.

@eterna2
Copy link
Contributor Author

eterna2 commented Feb 28, 2020

yeah, i agree with you on that. I probably can build it as an extension/plugin outside of kfp.

But what do u think about my suggestion of adding a vis sdk in the kfp dsl? Not about vega but more on my pain with /mlpipeline-ui-metadata.json.

Cuz the biggest pain point for me when creating kfp ops is remember the path and the format (i.e. how to populate this json).

I am proposing to add a kfp.dsl.vis module which can either:

  • append a vis to the op inside a pipeline, or
  • or serve as a lib like tf.io to generate the /mlpipeline-ui-metadata.json inside the op itself.

Maybe I will do an actual mvp as a kfp.contrib to demonstrate my idea.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 28, 2020

@eterna2 I'm no expert on sdk, but personally it also took me some hard time to figure out how to write metadata with sdk, so I'd prefer sdk have builtin support.

/cc @Ark-kun
/cc @hongye-sun
/cc @numerology
for sdk related proposal

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 2, 2020

Ok, I created a kfx package at https://github.com/e2fyi/kfx/ to demonstrate my idea.

It works now, although I feel it is abit convoluted.

In this example, I am using ArtfactLocationHelper to modify the kfp task with env variables that contains metadata about the Argo configs (needs to be inputted by user)

  • then I use KfpArtifact to retrieve these metadata to generate the url (to be used as source for mlpipeline-ui-metadata) to the artifact as well as the API call to UI to get the artifact (for data loading in Vega).

This is bad mostly because it relies on the user to know the Argo configmap and to set it.

I would prefer the UI artifact API to be able to support workflow.name or some identifier - i.e. instead of just source, bucket and key, we can support workflow.name + artifact name, where these can be used to retrieve the necessary info to get the artifact (similar to what I did to get the pod logs from Argo artifactory).

This removes the need for user to know anything about Argo. And we can meta-declare a source or url to be an artifact generated by kfp tasks.

import kfp.components
import kfp.dsl
import kfx.dsl


# creates the helper that has the argo configs (tells you how artifacts will be stored)
# see https://github.com/argoproj/argo/blob/master/docs/workflow-controller-configmap.yaml
helper = kfx.dsl.ArtifactLocationHelper(
    scheme="minio", bucket="mlpipeline", key_prefix="artifacts/"
)

@kfp.components.func_to_container_op
def test_op(
    mlpipeline_ui_metadata: OutputTextFile(str), markdown_data_file: OutputTextFile(str)
):
    "A test kubeflow pipeline task."

    import json

    import kfx.dsl
    import kfx.vis
    import kfx.vis.vega

    data = [
        {"a": "A", "b": 28},
        {"a": "B", "b": 55},
        {"a": "C", "b": 43},
        {"a": "D", "b": 91},
        {"a": "E", "b": 81},
        {"a": "F", "b": 53},
        {"a": "G", "b": 19},
        {"a": "H", "b": 87},
        {"a": "I", "b": 52},
    ]
    vega_data_file.write(json.dumps(data))

    # `KfpArtifact` provides the reference to data artifact created
    # inside this task
    spec = {
        "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
        "description": "A simple bar chart",
        "data": {
            "url": kfx.dsl.KfpArtifact("vega_data_file"),
            "format": {"type": "json"},
        },
        "mark": "bar",
        "encoding": {
            "x": {"field": "a", "type": "ordinal"},
            "y": {"field": "b", "type": "quantitative"},
        },
    }

    # write the markdown to the `markdown-data` artifact
    markdown_data_file.write("### hello world")

    # creates an ui metadata object
    ui_metadata = kfx.vis.kfp_ui_metadata(
        # Describes the vis to generate in the kubeflow pipeline UI.
        [
            # markdown vis from a markdown artifact.
            # `KfpArtifact` provides the reference to data artifact created
            # inside this task
            kfx.vis.markdown(kfx.dsl.KfpArtifact("markdown_data_file")),
            # a vega web app from the vega data artifact.
            kfx.vis.vega.vega_web_app(spec),
        ]
    )

    # writes the ui metadata object as the `mlpipeline-ui-metadata` artifact
    mlpipeline_ui_metadata.write(kfx.vis.asjson(ui_metadata))

    # prints the uri to the markdown artifact
    print(ui_metadata.outputs[0].source)


@kfp.dsl.pipeline()
def test_pipeline():
    "A test kubeflow pipeline"

    op: kfp.dsl.ContainerOp = test_op()

    # modify kfp operator with artifact location metadata through env vars
    op.apply(helper.set_envs())

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 2, 2020

I also written the pydantic data models for the mlpipeline-ui-metadata and generated the corresponding json scheme for the file.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2020

@eterna2 Looks great!
A quick question: is it a requirement to store visualization data in external source? If you just store it inline inside mlpipeline-ui-metadata, then user doesn't need to know so much other context.

I guess you have good reasons to do so, just wanting to know.

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 2, 2020

Depends on the data size. For small dataset, we probably can inline.

Becuz my prev use cases are mostly geospatial simulation which generates quite a bit of logs.

And usually we want to store these logs separately.

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 2, 2020

But I agree that inline shld solve 90% of the use cases. And probably a better solution. Did not think of that.

Probably, I can generate multiple "baked" vis, separately from the logs.

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 2, 2020

In this case, I probably can provide helper classes to convert sklearn confusion matrix etc into inline UI metadata.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2020

I see, external source is definitely needed if data size is huge or ACL is needed on the data. That seems like complexity unrelated to KFP. A helper sdk will be helpful in this case.

Also for inline cases, helper is as good as 1st party integrations.

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 3, 2020

I encountered some issues getting my artifacts to work inside the iframe as the iframe does not provide allow-same-origin permission - request origin = null.

Aka I can't make any request to the node server.

The only 3 solutions I see are:

  • only support inlined data
  • add CORs flag to the node server (as a param)
  • add allow-same-origin permission to the iframe.

Both seems equally risky from security pov.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 3, 2020

Yes, the visualization is expected to only access inline data and open data for security reasons.

@Bobgy Bobgy added the status/triaged Whether the issue has been explicitly triaged label Mar 4, 2020
@stale
Copy link

stale bot commented Jun 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 24, 2020
@stale
Copy link

stale bot commented Jul 1, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend kind/discussion kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label. needs investigation status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

No branches or pull requests

2 participants