Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

Merged
merged 11 commits into from
Feb 18, 2023
Merged

Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

merged 11 commits into from
Feb 18, 2023

Conversation

ChristopherDavisUCI
Copy link
Contributor

Based on the discussion in #2868, this is an attempt to provide modest support for specifying data as a Polars DataFrame. We do not convert to a pandas DataFrame, so at this point, the encoding type (e.g., "Q", "N", ...) needs to be specified explicitly. (My reasoning is that it would be easier to add type inference later than to take it away.)

Example:

import polars as pl
import altair as alt

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)

alt.Chart(df).mark_bar().encode(
    x="A:O",
    y="optional:Q",
    color="cars:N"
)

Screenshot 2023-02-14 at 4 12 47 PM

@mattijn
Copy link
Contributor

mattijn commented Feb 15, 2023

I'll review this later, it looks good, but I'm not against being a bit more experimental here.

Maybe we can explore using the dataframe protocol, https://data-apis.org/dataframe-protocol/latest/index.html. I know pyarrow, polars and pandas already support this through .__dataframe__().

I tried a few things, see here: https://gist.github.com/mattijn/45752432a65e1018512305d0ca228d40.
I was somehow hoping that the __dataframe__() returns an arrow serialisable object, but I'm not sure if this is the case, see also here: apache/arrow#33986 (comment) and pola-rs/polars#3727 (comment)

Fallback would be to serialise it into an IPC stream / feather byte array and parse this into the Vega-lite spec or in the HTML_TEMPLATE as var object.
I could not get this part to work, but it should be similar to https://observablehq.com/@vega/vega-lite-and-apache-arrow-no-plugin (more info https://github.com/vega/vega-loader-arrow#browser-use).

@mattijn
Copy link
Contributor

mattijn commented Feb 15, 2023

I noticed a few issues that I was not able to review using inline suggestions. You can have a look to commit 1fbb7c1 what I changed.

  • I introduce a check on existence of dataframe protocol using hasattr(data, "__dataframe__"), I then continue checking if polars is in the __module__ name, but if we do it right then this becomes the agnostic part and that check can be removed.
  • I made sure that all of this happens after checking for pandas.DataFrame, we like to experiment with this without touching the current behaviour for pandas DataFrames.
  • There is currently no sanitization on the polars dataframe and we directly write it using .to_dicts(), until this moment it is all a dictionary, so no need to write it to row oriented json.
  • I extended the check_data_type and limit_rows functions with support for the __dataframe__ protocol.
  • And I changed the order in the _prepare_data in order to place it in front of the _consolidate_data which will place the inline data to top-level with a unique name.

Now the following example:

import polars as pl
import pandas as pd
import altair as alt


df_pl = pl.DataFrame(
    {
        "A": [9, 8, 7, 6, 5],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)

df_pd = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)

c1 = alt.Chart(df_pl, height=20, title='polars.DataFrame').mark_bar().encode(
    x="A:O",
    y="optional:Q",
    color="cars:N"
)

c2 = alt.Chart(df_pd, height=20, title='pandas.DataFrame').mark_bar().encode(
    x="A:O",
    y="optional:Q",
    color="cars:N"
)

print(alt.vconcat(c1, c2).to_json())

returns:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.2.0.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 300
    }
  },
  "datasets": {
    "data-90855424d1d3c3df2b450ef6e4564242": [
      {
        "A": 9,
        "cars": "beetle",
        "optional": 28
      },
      {
        "A": 8,
        "cars": "audi",
        "optional": 300
      },
      {
        "A": 7,
        "cars": "beetle",
        "optional": null
      },
      {
        "A": 6,
        "cars": "beetle",
        "optional": 2
      },
      {
        "A": 5,
        "cars": "beetle",
        "optional": -30
      }
    ],
    "data-faf1e5382ce70f32dc6c22613bf3493d": [
      {
        "A": 1,
        "cars": "beetle",
        "optional": 28.0
      },
      {
        "A": 2,
        "cars": "audi",
        "optional": 300.0
      },
      {
        "A": 3,
        "cars": "beetle",
        "optional": null
      },
      {
        "A": 4,
        "cars": "beetle",
        "optional": 2.0
      },
      {
        "A": 5,
        "cars": "beetle",
        "optional": -30.0
      }
    ]
  },
  "vconcat": [
    {
      "data": {
        "name": "data-90855424d1d3c3df2b450ef6e4564242"
      },
      "encoding": {
        "color": {
          "field": "cars",
          "type": "nominal"
        },
        "x": {
          "field": "A",
          "type": "ordinal"
        },
        "y": {
          "field": "optional",
          "type": "quantitative"
        }
      },
      "height": 20,
      "mark": {
        "type": "bar"
      },
      "title": "polars.DataFrame"
    },
    {
      "data": {
        "name": "data-faf1e5382ce70f32dc6c22613bf3493d"
      },
      "encoding": {
        "color": {
          "field": "cars",
          "type": "nominal"
        },
        "x": {
          "field": "A",
          "type": "ordinal"
        },
        "y": {
          "field": "optional",
          "type": "quantitative"
        }
      },
      "height": 20,
      "mark": {
        "type": "bar"
      },
      "title": "pandas.DataFrame"
    }
  ]
}

@mattijn
Copy link
Contributor

mattijn commented Feb 16, 2023

@mattijn
Copy link
Contributor

mattijn commented Feb 16, 2023

I just saw the recent merges of apache/arrow#14804 and pola-rs/polars#6581. Based on this I could get this to work:

import pyarrow as pa
import pyarrow.interchange as pi
import polars as pl
import pandas as pd

data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
pa_table = pa.table(data)
pl_df = pl.DataFrame(data)
pd_df = pd.DataFrame(data)

interchange_pyarrow = pa_table.__dataframe__()
interchange_polars = pl_df.__dataframe__()
interchange_pandas = pd_df.__dataframe__()

interchange_pyarrow2table = pi.from_dataframe(interchange_pyarrow)
interchange_polars2table = pi.from_dataframe(interchange_polars)
interchange_pandas2table = pi.from_dataframe(interchange_pandas)

print(interchange_pyarrow2table.to_pylist() == interchange_polars2table.to_pylist() == interchange_pandas2table.to_pylist())

interchange_pyarrow2table.to_pylist()
True
[{'a': 1, 'b': 4}, {'a': 2, 'b': 5}, {'a': 3, 'b': 6}]

Currently we then first can do the serialization to pylist style on the python-side, but eventually in the future can transfer the buffer within the vega-lite specification or html template and do the serialization on the javascript-side.

Meaning that we can support the dataframe protocol with a single soft dependency on pyarrow>=11.0.0

@ChristopherDavisUCI
Copy link
Contributor Author

Thanks for all these improvements @mattijn, it looks very promising! I don't think I'll have a chance to look closely before Saturday, but I will go through this over the weekend.

@mattijn
Copy link
Contributor

mattijn commented Feb 16, 2023

Updated PR to use the dataframe protocol in combination with pyarrow.interchange.

import altair as alt
import pyarrow as pa
import polars as pl
import pandas as pd
import vaex


def chart(source, title):
    return (
        alt.Chart(source, height=20, title=title)
        .mark_bar()
        .encode(x="x:O", y="y:Q")
    )


data = {"x": [1, 2, 3], "y": [4, 5, 6]}
pa_table = pa.table(data)
df_polars = pl.DataFrame(data)
df_pandas = pd.DataFrame(data)
df_vaex = vaex.from_pandas(df_pandas)

dataframes = {
    "pyarrow": pa_table,
    "polars": df_polars,
    "pandas": df_pandas,
    "vaex": df_vaex,
}
alt.hconcat(*[chart(dataframes[df], df) for df in dataframes])

image

I think this is good to go.

@mattijn mattijn changed the title Allow Polars DataFrames Support dataframe protocol (allow Polars DataFrames) Feb 16, 2023
@mattijn mattijn changed the title Support dataframe protocol (allow Polars DataFrames) Support DataFrame Interchange Protocol (allow Polars DataFrames) Feb 16, 2023
@joelostblom
Copy link
Contributor

This is cool! Could we you a note in the changelog saying that Altair now has basic support for all data frame libraries that support the __dataframe__ exchange protocol? I don't know how we would create tests for this without requiring all the df libraries to be in the dev requirements, so maybe let's skip that and just mention that this is rudimentary and might not pick up on all types of data like ordinal correctly (if that is true)?

@mattijn
Copy link
Contributor

mattijn commented Feb 18, 2023

Since @ChristopherDavisUCI likes to go through this this weekend, will wait for his approval or suggestions before merging.

@ChristopherDavisUCI
Copy link
Contributor Author

This seems great to me @mattijn, much more ambitious than what I started with!

Is pyarrow a dependency for Altair now, or only for using these new data sources? Do you see any downside to that? (When I tried my code above with the new updates, I got an error that I needed to install pyarrow. It worked fine after I installed pyarrow.) Am I right in understanding that pyarrow is a much more lightweight requirement than something like Polars? (I see pandas itself is a dependency of pyarrow.)

I trust your and @joelostblom's intuition, so good to merge from my perspective! Is there anything in particular you'd like me to try out? (I think your requests from #2888 (comment) are no longer relevant, because you got them implemented yourself, right?)

@mattijn
Copy link
Contributor

mattijn commented Feb 18, 2023

Thanks for the comment @ChristopherDavisUCI! I did some changes to the error messages in the latest commit: e0cda9e.
Now it will says, if pyarrow is not installed:

"Usage of the DataFrame Interchange Protocol requires the package 'pyarrow', but it is not installed."

And if the installed version of pyarrow is too low:

"The installed version of 'pyarrow' does not meet the minimum requirement of version 11.0.0. "
"Please update 'pyarrow' to use the DataFrame Interchange Protocol."

pyarrow by itself is not a hard dependency of altair, but to access the DataFrame Interchange Protocol it is required as a soft dependency.
I see what you mean, with your question regarding pyarrow vs polars. Polars support reading of the DataFrame Interchange Protocol through pyarrow, so you'll need them both if you read the dataframe using functionality of polars.

I would be surprised if pandas is a dependency of pyarrow. Where did you see that? I can't see it in here: https://github.com/apache/arrow/tree/main/python (only in the test-requirements).

@mattijn
Copy link
Contributor

mattijn commented Feb 18, 2023

To add: if this works out well and we can sanitize through pyarrow tables and do proper type checking of the fields, then eventually we can replace pyarrow over pandas. At that moment pyarrow is a hard dependency and pandas is not a dependency anymore. But currently it is just experimental.

@ChristopherDavisUCI
Copy link
Contributor Author

I would be surprised if pandas is a dependency of pyarrow. Where did you see that?

You're right, I misremembered. pandas is listed here, but only as an optional dependency: https://arrow.apache.org/docs/python/install.html

Good to merge from my perspective!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants