Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release the spec as PyPI package #73

Open
vnlitvinov opened this issue Feb 7, 2022 · 9 comments
Open

Release the spec as PyPI package #73

vnlitvinov opened this issue Feb 7, 2022 · 9 comments

Comments

@vnlitvinov
Copy link
Contributor

vnlitvinov commented Feb 7, 2022

Hello everyone!

I've been mulling over introducing the Dataframe Exchange protocol in Pandas and Modin, and I think it would be beneficial for every end library implementing the protocol to have the exact same base.

Right now the protocol interface is defined by code, but said code is not "published" as a ready to use Python source.

I would like to make it a real PyPI package to use it in type hinting and (ideally) mypy type checking and to enable other libraries to do the same.

I propose to publish the package as dataframe-protocol or df-protocol and rename protocol/ directory to df_protocol turning it in a real Python package.
That way any library which would be implementing the protocol would just from df_protocol import exchange and use it for type hints (and for enum values - as now they're embedded in docstrings which just look really weird to me).

Am I missing something here? Are there any objections?

I can make the PR with necessary changes if there is the agreement, and can keep it published both in PyPI and conda-forge (or can turn the publishing to someone else in the consortium).

P.S. Keeping the top-level df_protocol would allow us to add another subpackage for cross-operation API if/when we feel ready for that (so keeping this future-proof).

@rgommers
Copy link
Member

rgommers commented Feb 7, 2022

Thanks @vnlitvinov

Am I missing something here? Are there any objections?

I think the main thing is that a runtime dependency is expensive, and I would not imagine Pandas or others would want a new dependency just for some improved type hinting. Nor would I recommend anyone doing that - vendoring seems preferable here. In which case, does it make sense to release a PyPI package - just tagging versions may be better?

@vnlitvinov
Copy link
Contributor Author

That's a good point, though I personally don't think it is so expensive, given how many packages usually a library already depends on, and taking into account that this particular package would be updated very rarely.

But anyway... I think I will make a PR preparing for the package (probably without setup.cfg), so we can publish that or vendor later on if we decide against publishing.

@vnlitvinov
Copy link
Contributor Author

We've also started working on a simple, "smoke"-like test suite which should be pluggable anywhere (@Rubtsowa is working on that now).

@rgommers
Copy link
Member

Related, the array-api-tests is used in test suites of array libraries; it's not on PyPI, but does make releases by tagging with a date-based version scheme: https://github.com/data-apis/array-api-tests/tags

@rgommers
Copy link
Member

@vnlitvinov given that all that is left in this repo is a single file, dataframe_protocol.py, can we close this? I doubt it's helpful to add packaging files now, and if that changes in the future we can still reconsider.

@astrojuanlu
Copy link

I think the main thing is that a runtime dependency is expensive, and I would not imagine Pandas or others would want a new dependency just for some improved type hinting.

Maybe implementors (like pandas) don't need such an additional dependency, but downstream users would benefit from it I think. Both Altair (vega/altair#2888) and Plotly Express (plotly/plotly.py#3387) seem to make use of if hasattr(data, "__dataframe__"), but one could envision dedicated methods that could do

from dataframe_protocol import ImplementsDataFrameProtocol  # an actual PEP 544 protocol and not an ABC

def process(df: ImplementsDataFrameProtocol) -> ...:
    return df.__dataframe__()...

which would definitely help during development.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Aug 31, 2023

Big +1 to publishing

My experience using both the interchange protocol and the dataframe API is that it's not a great user experience to type such unergonomic and unusual names (e.g. select_columns_by_name and friends)

Being able to tab-complete would be a big plus


Regarding dependencies - this definitely wouldn't be a runtime dependency of pandas', but consumers of the api / interchange protocol could use it as a dev dependency

@MarcoGorelli
Copy link
Contributor

You can now install the dataframe api directly from github:

pip install 'git+https://github.com/data-apis/dataframe-api.git#egg=dataframe_api&subdirectory=spec/API_specification'

Then, for the dataframe api at least, you'll get type hints / autocomplete:

image

@astrojuanlu
Copy link

xref #278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants