-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888
Conversation
I'll review this later, it looks good, but I'm not against being a bit more experimental here. Maybe we can explore using the dataframe protocol, https://data-apis.org/dataframe-protocol/latest/index.html. I know pyarrow, polars and pandas already support this through I tried a few things, see here: https://gist.github.com/mattijn/45752432a65e1018512305d0ca228d40. Fallback would be to serialise it into an IPC stream / feather byte array and parse this into the Vega-lite spec or in the |
I noticed a few issues that I was not able to review using inline suggestions. You can have a look to commit 1fbb7c1 what I changed.
Now the following example: import polars as pl
import pandas as pd
import altair as alt
df_pl = pl.DataFrame(
{
"A": [9, 8, 7, 6, 5],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
"optional": [28, 300, None, 2, -30],
}
)
df_pd = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
"optional": [28, 300, None, 2, -30],
}
)
c1 = alt.Chart(df_pl, height=20, title='polars.DataFrame').mark_bar().encode(
x="A:O",
y="optional:Q",
color="cars:N"
)
c2 = alt.Chart(df_pd, height=20, title='pandas.DataFrame').mark_bar().encode(
x="A:O",
y="optional:Q",
color="cars:N"
)
print(alt.vconcat(c1, c2).to_json()) returns: {
"$schema": "https://vega.github.io/schema/vega-lite/v5.2.0.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 300
}
},
"datasets": {
"data-90855424d1d3c3df2b450ef6e4564242": [
{
"A": 9,
"cars": "beetle",
"optional": 28
},
{
"A": 8,
"cars": "audi",
"optional": 300
},
{
"A": 7,
"cars": "beetle",
"optional": null
},
{
"A": 6,
"cars": "beetle",
"optional": 2
},
{
"A": 5,
"cars": "beetle",
"optional": -30
}
],
"data-faf1e5382ce70f32dc6c22613bf3493d": [
{
"A": 1,
"cars": "beetle",
"optional": 28.0
},
{
"A": 2,
"cars": "audi",
"optional": 300.0
},
{
"A": 3,
"cars": "beetle",
"optional": null
},
{
"A": 4,
"cars": "beetle",
"optional": 2.0
},
{
"A": 5,
"cars": "beetle",
"optional": -30.0
}
]
},
"vconcat": [
{
"data": {
"name": "data-90855424d1d3c3df2b450ef6e4564242"
},
"encoding": {
"color": {
"field": "cars",
"type": "nominal"
},
"x": {
"field": "A",
"type": "ordinal"
},
"y": {
"field": "optional",
"type": "quantitative"
}
},
"height": 20,
"mark": {
"type": "bar"
},
"title": "polars.DataFrame"
},
{
"data": {
"name": "data-faf1e5382ce70f32dc6c22613bf3493d"
},
"encoding": {
"color": {
"field": "cars",
"type": "nominal"
},
"x": {
"field": "A",
"type": "ordinal"
},
"y": {
"field": "optional",
"type": "quantitative"
}
},
"height": 20,
"mark": {
"type": "bar"
},
"title": "pandas.DataFrame"
}
]
} |
@ChristopherDavisUCI, can you extend this PR to make sure this experimental support of polars also covers the following two functions?: https://github.com/altair-viz/altair/blob/45bbbb7398e68e6c696d3af6cbfcb16addb6c803/altair/utils/data.py#L189-L190 |
I just saw the recent merges of apache/arrow#14804 and pola-rs/polars#6581. Based on this I could get this to work: import pyarrow as pa
import pyarrow.interchange as pi
import polars as pl
import pandas as pd
data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
pa_table = pa.table(data)
pl_df = pl.DataFrame(data)
pd_df = pd.DataFrame(data)
interchange_pyarrow = pa_table.__dataframe__()
interchange_polars = pl_df.__dataframe__()
interchange_pandas = pd_df.__dataframe__()
interchange_pyarrow2table = pi.from_dataframe(interchange_pyarrow)
interchange_polars2table = pi.from_dataframe(interchange_polars)
interchange_pandas2table = pi.from_dataframe(interchange_pandas)
print(interchange_pyarrow2table.to_pylist() == interchange_polars2table.to_pylist() == interchange_pandas2table.to_pylist())
interchange_pyarrow2table.to_pylist()
Currently we then first can do the serialization to pylist style on the python-side, but eventually in the future can transfer the buffer within the vega-lite specification or html template and do the serialization on the javascript-side. Meaning that we can support the dataframe protocol with a single soft dependency on |
Thanks for all these improvements @mattijn, it looks very promising! I don't think I'll have a chance to look closely before Saturday, but I will go through this over the weekend. |
Updated PR to use the dataframe protocol in combination with import altair as alt
import pyarrow as pa
import polars as pl
import pandas as pd
import vaex
def chart(source, title):
return (
alt.Chart(source, height=20, title=title)
.mark_bar()
.encode(x="x:O", y="y:Q")
)
data = {"x": [1, 2, 3], "y": [4, 5, 6]}
pa_table = pa.table(data)
df_polars = pl.DataFrame(data)
df_pandas = pd.DataFrame(data)
df_vaex = vaex.from_pandas(df_pandas)
dataframes = {
"pyarrow": pa_table,
"polars": df_polars,
"pandas": df_pandas,
"vaex": df_vaex,
}
alt.hconcat(*[chart(dataframes[df], df) for df in dataframes]) |
This is cool! Could we you a note in the changelog saying that Altair now has basic support for all data frame libraries that support the |
Since @ChristopherDavisUCI likes to go through this this weekend, will wait for his approval or suggestions before merging. |
This seems great to me @mattijn, much more ambitious than what I started with! Is pyarrow a dependency for Altair now, or only for using these new data sources? Do you see any downside to that? (When I tried my code above with the new updates, I got an error that I needed to install pyarrow. It worked fine after I installed pyarrow.) Am I right in understanding that pyarrow is a much more lightweight requirement than something like Polars? (I see pandas itself is a dependency of pyarrow.) I trust your and @joelostblom's intuition, so good to merge from my perspective! Is there anything in particular you'd like me to try out? (I think your requests from #2888 (comment) are no longer relevant, because you got them implemented yourself, right?) |
Thanks for the comment @ChristopherDavisUCI! I did some changes to the error messages in the latest commit: e0cda9e. "Usage of the DataFrame Interchange Protocol requires the package 'pyarrow', but it is not installed." And if the installed version of "The installed version of 'pyarrow' does not meet the minimum requirement of version 11.0.0. "
"Please update 'pyarrow' to use the DataFrame Interchange Protocol."
I would be surprised if pandas is a dependency of pyarrow. Where did you see that? I can't see it in here: https://github.com/apache/arrow/tree/main/python (only in the test-requirements). |
To add: if this works out well and we can sanitize through pyarrow tables and do proper type checking of the fields, then eventually we can replace pyarrow over pandas. At that moment pyarrow is a hard dependency and pandas is not a dependency anymore. But currently it is just experimental. |
You're right, I misremembered. pandas is listed here, but only as an optional dependency: https://arrow.apache.org/docs/python/install.html Good to merge from my perspective! |
Based on the discussion in #2868, this is an attempt to provide modest support for specifying data as a Polars DataFrame. We do not convert to a pandas DataFrame, so at this point, the encoding type (e.g.,
"Q"
,"N"
, ...) needs to be specified explicitly. (My reasoning is that it would be easier to add type inference later than to take it away.)Example: