-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support extracting transformed chart data using VegaFusion #3081
Conversation
Make method internal while still experimental
altair/utils/transformed_data.py
Outdated
# Compile to Vega and extract inline DataFrames | ||
with data_transformers.enable("vegafusion-inline"): | ||
vega_spec = chart.to_dict(format="vega") | ||
inline_datasets = get_inline_datasets_for_spec(vega_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to bring the vegafusion-inline
transformer and the get_inline_datasets_for_spec
function over to Altair eventually, but will save that for a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding, does the vegafusion-inline
transformer use the entrypoint system to register itself or why is it available here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's defined and registered in vegafusion during import: https://github.com/hex-inc/vegafusion/blob/ae2040ebf93b1a69328be9df2ec831bd8fbb6ff4/python/vegafusion/vegafusion/transformer.py#L256
altair/utils/transformed_data.py
Outdated
... ] | ||
... } | ||
>>> get_group_mark_for_scope(spec, (1,)) | ||
{'type': 'group', 'marks': [{'type': 'rect'}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does hatch run test
automatically run doctests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hatch run test
runs the following:
black --diff --color --check .
ruff check .
mypy altair tests
python -m pytest --pyargs --doctest-modules tests
See https://github.com/altair-viz/altair/blob/master/pyproject.toml#L104-L107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonmmease, this is going to be a great feature! I don't have a preference if we keep it as ._transformed_data
for now until we have some documentation as well.
altair/utils/transformed_data.py
Outdated
DataFrame or list of DataFrame | ||
If input chart is a Chart or Facet Chart, returns a DataFrame of the transformed data | ||
Otherwise, returns a list of DataFrames of the transformed data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame or list of DataFrame | |
If input chart is a Chart or Facet Chart, returns a DataFrame of the transformed data | |
Otherwise, returns a list of DataFrames of the transformed data | |
Pandas DataFrame, list of Pandas DataFrames or None | |
If input chart is a Chart or Facet Chart, returns a Pandas DataFrame of the transformed data if it exists and None if the chart does not have a dataset | |
Otherwise, returns a list of Pandas DataFrames of the transformed data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 75cf958.
As a side note, pandas is supposed to always be lower case, even at the start of a sentence. See https://pandas.pydata.org/about/citing.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, thanks!
altair/utils/transformed_data.py
Outdated
# Compile to Vega and extract inline DataFrames | ||
with data_transformers.enable("vegafusion-inline"): | ||
vega_spec = chart.to_dict(format="vega") | ||
inline_datasets = get_inline_datasets_for_spec(vega_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding, does the vegafusion-inline
transformer use the entrypoint system to register itself or why is it available here?
altair/utils/transformed_data.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making this module "internal/private" by renaming it to _transformed_data.py
? Almost everything in Altair so far is public which makes further development difficult in some areas (and bloats up the top-level API and the "API reference" section in the documentation, see #2918, although that would not be the case here). I'd be in favour that we are more selective moving forward with new features and mark modules or functions as private if we don't expect a user to access them directly. Users should use the Chart.transformed_data
method anyway and it would make it easier to iterate on this once you start integrating more of the VegaFusion functionality.
altair/utils/transformed_data.py
Outdated
from altair.utils.schemapi import Undefined | ||
|
||
Scope = Tuple[int, ...] | ||
FacetMapping = Dict[Tuple[str, Scope], Tuple[str, Scope]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate that you're adding type hints! :)
altair/vegalite/v5/api.py
Outdated
DataFrame | ||
Transformed data as a DataFrame | ||
""" | ||
from ...utils.transformed_data import transformed_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing codebase uses relative imports in many places so this would be consistent but I'm in favour of switching to absolute ones. They are easier to read, especially as Altair has many modules with the same names on different levels.
source = pkgutil.get_data(examples_methods_syntax.__name__, filename) | ||
chart = eval_block(source) | ||
df = chart._transformed_data() | ||
assert len(df) == rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it makes sense to also check if this dataframe no nulls? assert df.notnull().all().all()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. When the input DataFrame has nulls it's possible for these to be pass through to the transformed data. Vega-Lite usually filters null values for the columns that are used in the chart, but transformed_data
returns all of the columns, so the unused columns can still have nulls.
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
…tair into jonmmease/transformed_data
Thanks for the review @binste! I think I've addressed all of your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonmmease! I've placed a few more questions inline.
altair/utils/_transformed_data.py
Outdated
Scope = Tuple[int, ...] | ||
FacetMapping = Dict[Tuple[str, Scope], Tuple[str, Scope]] | ||
|
||
MAGIC_CHART_NAME = "_vf_mark{}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so fond of the word magic
. Can we describe what it actually is? And replace all places where the word is used?
Is this the name of each view? I think we name if elsewhere view
+_
+<incrementing value>
. Maybe align? Or, if different, add a trailing underscore before the incrementing number.
Or is this the reference name of the dataset within a view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, happy to rename this. This is the name that is applied to each unnamed Chart/Subchart/View.
Changed to:
VIEW_NAME = "altair_view_{}"
in 280eb0f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, if this also defines a view name, could we also use _get_name()
? Or will you run into conflicts then?
_get_name()
is defined on the Chart
class, see here: https://github.com/altair-viz/altair/blob/master/altair/vegalite/v5/api.py#L2569-L2571.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't familiar with _get_name(). Yeah, I think this would work. I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 88fceb5
altair/utils/_transformed_data.py
Outdated
) from err | ||
|
||
if isinstance(chart, Chart): | ||
# Add dummy mark if None specified to satisfy Vega-Lite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename into template mark
if this section is really necessary, but I was wondering if we need this? Shouldn't this raise a warning or error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded comment in aabf5d6.
The reason to do this is so that it's possible to call chart.transformed_data()
on a chart with transforms even if no mark has been defined. For example:
import pandas as pd
chart = alt.Chart(
pd.DataFrame({"a": [1, 2, 3], "b": ["A", "BB", "CCC"]})
)
chart.transform_filter("datum.a > 1")._transformed_data()
a | b | |
---|---|---|
0 | 2 | BB |
1 | 3 | CCC |
Not sure if this will be a common use case, but I think it's neat to be able to use an Altair chart kind of like a DataFrame this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a chart anymore😉! Thinking aloud, without judging: I always thought it was good that dataframe libraries make a clear distinction between dataframe operations and visualizations. With the powerful options introduced in this PR, altair is slowly turning into a dataframe library as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea I've had is that Altair could provide a alt.Dataset
object that supports the .transform_*
methods and the mark_*
methods. The .transform_*
methods would return a new Dataset
and the .mark_*
methods would return a Chart
.
This is somewhat similar to how HoloViews works (https://holoviews.org/getting_started/Tabular_Datasets.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this concept. I don't think it is possible to run transforms in Vega-Lite without defining a mark.
Naming is also an issue. alt.Data
and alt.Dataset
already exists. Maybe we can extend these? Worth raising a follow up issue on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this it is possible to run transforms in Vega-Lite without defining a mark
Yes, this is why I add a mark when none is defined before calling vega-lite through vl-convert.
Naming is also an issue. alt.Data and alt.Dataset already exists
Or maybe just alt.DataFrame
? I'll open a follow up issue.
subcharts = chart.vconcat | ||
elif isinstance(chart, ConcatChart): | ||
subcharts = chart.concat | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make a special raise for RepeatChart
? Or is this not applicable here? It may feel like an omission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, RepeatChart
doesn't have a _transformed_data()
method, so it won't get this far. We could add a _transformed_data()
method to RepeatChart
that always raises an exception that it's not supported. But this might also be confusing (why have the method as all if it never works). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, why can it never works? A NotImplementedError
seems OK for me for now. It is also confusing if RepeatChart
is the only Chart
without a transformed_data()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a NotImplementedError sounds like a good approach. I'll add that. I think it's probably possible to support RepeatChart
in the future, I just haven't work through quite what it means, and how to process the structure of the Vega that Vega-Lite produces in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a738408
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that all top-level charts have the ._transformed_data
method, you could add it to TopLevelMixin
so there is no need anymore to repeat it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. The return type in the type signature is different between Chart
/FacetChart
(where it's Optional[DataFrameLike]
, and the others (where it's List[DataFrameLike]
.
For the standalone utility transformed_data
function I used typeing.overload
to express the two signatures. But I wasn't sure how to accomplish this in a superclass method when the type signature depends on the type of self
. Do you have any ideas on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Just tried and got this far with defining _transformed_data on TopLevelMixin:
@overload
def _transformed_data(
self: Union["Chart", "FacetChart", "RepeatChart"],
row_limit: Optional[int] = ...,
exclude: Optional[Iterable[str]] = ...,
) -> Optional[DataFrameLike]:
...
@overload
def _transformed_data(
self: Union["LayerChart", "ConcatChart", "HConcatChart", "VConcatChart"],
row_limit: Optional[int] = ...,
exclude: Optional[Iterable[str]] = ...,
) -> List[DataFrameLike]:
...
def _transformed_data(
self,
row_limit=None,
exclude=None,
):
But then mypy complains, rightly so, that these types are not superclasses of TopLevelMixin
:
altair/vegalite/v5/api.py:2407: error: The erased type of self "Union[altair.vegalite.v5.api.Chart, altair.vegalite.v5.api.FacetChart, altair.vegalite.v5.api.RepeatChart]" is not a supertype of its class "altair.vegalite.v5.api.TopLevelMixin" [misc]
altair/vegalite/v5/api.py:2415: error: The erased type of self "Union[altair.vegalite.v5.api.LayerChart, altair.vegalite.v5.api.ConcatChart, altair.vegalite.v5.api.HConcatChart, altair.vegalite.v5.api.VConcatChart]" is not a supertype of its class "altair.vegalite.v5.api.TopLevelMixin" [misc]
According to this GH issue and some others, the solution might be to create protocols whcih identify the individual chart types but that seems to be more complicated then just keeping it as it is now -> I'm ok with leaving it.
altair/vegalite/v5/api.py
Outdated
self, | ||
row_limit: Optional[int] = None, | ||
exclude: Optional[Iterable[str]] = None, | ||
) -> List[pd.DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output is always a list of pandas dataframes? Do we want this to eventually support other dataframes-a-like as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now in VegaFusion, these results can actually be an Arrow tables (if the input Chart(s) wrap arrow tables), a Polars DataFrames (if the input Chart(s) wrap Polars DataFrames), or pandas DataFrames (all other cases).
It seemed to me that this would make for a pretty awkward type signature, so I left it as just pandas. It could be a Union
, but arrow and polars are optional dependencies so I'm not sure how to have the Union include different options based on what's installed. Do you have any ideas here @mattijn or @binste?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define a DataFrameLike
type (elsewhere in the codebase) that can refer to any object with a .__dataframe__
attribute, regardless of whether it's a pandas DataFrame, polars DataFrame, or any other object? And use that type here? Would that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I'll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6f43d6b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check this @binste? Should a type definition comes back in the root of altair (see changes in __init__
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of renaming it to _DataFrameLike
so no one starts using it themselves and it won't show up in __init__
. I'm doing the same in #2976. Gives us the flexibility to gather these types at a later stage into a types.py
module if that makes sense or move them around/redefine otherwise.
Apart from that I like the protocol usage. We can narrow down the type further once we figure out how to best do that but at least it's better then Any
while being correct. At a later stage (not in this PR), we can try to use generics to define the return type of transformed_data
based on the type of the .data
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also move the Protocol
import into a if sys.version_info >= (3, 8):
statement? It was introduced in Python 3.8 and we try to import from the official library as soon as it's available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll move the protocol.
I might not have the right understanding here, but since DataFrameLike
is in the return type of the public method, doesn't the type need to be public as well? Otherwise, Altair users wouldn't be able to work with it in their own typed code.
The returned DataFrames can be arrow or Polars as well if those are used as the input to the Chart.
One comment. Naming is hard, but I’m not sure if |
I chose |
What about
|
I'm not a fan of If we think of the transforms on a chart as being analogous to a lazy data frame, we could consider names like I like |
Agree, naming is hard... If GitHub CoPilot could solve this I would be very happy ;) Two points from my side:
Hence, I like |
I liked the
Should the output type be identical to the input type (eg. what is the input type of a url)? Otherwise can do Anyhow, no strong feelings for or against the suggestions. |
That's a good point. Makes it more stable if a user can declare what they want in return independent of what they put in. As another idea: |
To summarize, up until now we have discussed:
Currently in Altair we do it as such:
So than it could be consistent to say
Personally, I like it when there is a grouping using For example:
I'm aware that the latter (using |
Thanks for the thorough summary @mattijn. It sounds like we've all landed on leaving the name as transformed_data. I'll still leave it as an internal method for this PR, so it's not completely final, but let's move forward as-is. I think the last remaining question for this PR is whether the |
It's a great point that the types should be public as well. I've never type hinted a publicly accessible library so if someone has more experience please chime in but it sounds reasonable to me. I'm still unsure how to best organise these types. In #2976 I actually wanted to introduce the same protocol. I'll probably switch to yours in case this PR is merged first. How about this: We mark these types as private for now to give us the freedom to refactor. Altair is not yet marked as a typed package (i.e. it does not have a Thanks @mattijn for the overview, this is helpful! I also find the |
Overview
As discussed in #3054, this PR ports the
transformed_data
logic from VegaFusion into Altair. See https://vegafusion.io/transformed_data.html for some background on VegaFusion'stransformed_data
function.A standalone
transformed_data
utility function is added toaltair/utils/transformed_data.py
, and achart._transformed_data
method is added toChart
,FacetChart
,LayerChart
,ConcatChart
,HConcatChart
, andVConcatChart
(but notRepeatChart
, as I'm not sure how to do this yet). I added a leading underscore to the chart method so that we can use this for internal testing (and documentation) before making it a public, but happy to reconsider that decision.Example
Testing
I added tests for the expected transformed data length and (a subset of) the columns for all of the example charts that are currently supported
Limitations
VegaFusion doesn't support every Vega transform, and it doesn't support every option for all of the transforms it does support. When an unsupported transform is required, VegaFusion raises an, admittedly not very helpful, error like this:
Tracking improvements to this error message in vega/vegafusion#336. Tracking improvements to the documentation of supported transforms in vega/vegafusion#337.