Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tests for the dataframe interchange protocol #75

Closed
wants to merge 5 commits into from

Conversation

Rubtsowa
Copy link

@Rubtsowa Rubtsowa commented Feb 16, 2022

Here we add generic tests for the df protocol. These tests can be used for any product that works with dataframes. It is only necessary to change conftest, replace the existing constructor for pandas with the one necessary for the project that will be used.

@dchigarev
Copy link

@Rubtsowa could you please elaborate on what these tests are supposed to be testing? Going from what I see in the PR, this tests the protocol implementation of some pandas-like dataframe (the suites use pandas specific methods of df like df.columns.values, df.categorize, an assumption that string column has 'np.object' type, etc).

Is there a plan to make these tests more generic so not only pandas-like dataframes could be tested with them?

@rgommers rgommers changed the title add tests Add more tests for the dataframe interchange protocol Feb 23, 2022
@rgommers
Copy link
Member

Thanks for the tests @Rubtsowa and for the initial review @dchigarev! @Rubtsowa it would be nice to fill in the PR description a bit, e.g., what are the main things this PR aims to cover, and are these tests written from scratch or taken from an existing implementation (cuDF, Vaex)?

Is there a plan to make these tests more generic so not only pandas-like dataframes could be tested with them?

That would be quite nice, and was indeed something that multiple maintainers of Pandas and other libraries have asked for.

protocol/tests/conftest.py Outdated Show resolved Hide resolved
protocol/tests/conftest.py Outdated Show resolved Hide resolved
@YarShev
Copy link

YarShev commented Mar 1, 2022

@Rubtsowa, please leave comments what is fixed.

@Rubtsowa
Copy link
Author

@rgommers please, check PR

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @Rubtsowa. I resolved all the comments that were addressed. This is looking good overall. I do see some failures though. It's a little tricky to run these tests, since this is now a proper pytest test suite with a conftest.py but it doesn't apply the needed monkeypatch to attach __dataframe__ to pd.DataFrame. So what I did to run the tests is:

$ ipython  # in the `protocol/` directory
In [1]: run pandas_implementation.py

In [2]: run -m pytest tests/

The result is:

FAILED tests/test_protocol.py::test_dataframe - pandas.core.indexing.IndexingError: Too ...
FAILED tests/test_protocol.py::test_df_get_chunks[10-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_df_get_chunks[12-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_df_get_chunks[12-5] - assert 1 == 5
FAILED tests/test_protocol.py::test_column_get_chunks[10-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_column_get_chunks[12-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_column_get_chunks[12-5] - assert 1 == 5
FAILED tests/test_protocol.py::test_buffer - TypeError: cannot unpack non-iterable metho...
========================= 8 failed, 8 passed, 1 warning in 0.15s ==========================

Do the tests pass for you, and if so are you working from a different version of the protocol implementation for Pandas, or running the tests differently?

@Rubtsowa
Copy link
Author

@rgommers thanks for your comment.

tests/test_protocol.py::test_dataframe - In this test I see problem with function select_columns. I have error:

def select_columns(self, indices: Sequence[int]) -> '_PandasDataFrame':
>       if not isinstance(indices, collections.Sequence):
E       AttributeError: module 'collections' has no attribute 'Sequence'

tests/test_protocol.py::test_df_get_chunks, tests/test_protocol.py::test_column_get_chunks - Pandas do not plan to use more chunks than 1. Because assert len(chunks) == n_chunks will be don't correct. I fixed this.

tests/test_protocol.py::test_buffer - In this test was don't correct assignment. I fixed this.

Comment on lines 97 to 106
chunks = list(dfX.get_chunks(n_chunks))
"""
Pandas do not plan to use more chunks than 1.
But if chunking is supported in your implementation,
then it is the place to use the following check

``assert len(chunks) == n_chunks``

"""
assert len(chunks) == 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we explicitly specify the number of chunks that we desire to recieve for the .get_chunks() via n_chunks parameter. Going by the protocol's spec, if the parameter is specified, the producer must subdivide each chunk before yielding it [1]. So it doesn't mater whether an underlying implementation internally chunks data or not, it still has to return n_chunks amount of chunks when we explicitly request it.

The reason why @rgommers experienced failures [2] for this test, is that the draft protocol's implementation in pandas_implementation.py doesn't respect n_chunks parameter:

def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable['_PandasDataFrame']:
"""
Return an iterator yielding the chunks.
"""
return (self,)

The latest known pandas implementation for the protocol (pandas-dev/pandas#46141) do respect this parameter and should work properly with the previous version of this test (I haven't tested it for myself)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a library doesn't support physical chunking (i.e. the column of a dataframe is a continuous block of memory), should that library implement virtual chunking to support get_chunks()?

cc @rgommers, @vnlitvinov

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations @Rubtsowa and @dchigarev. That makes perfect sense.

should that library implement virtual chunking to support get_chunks()?

Yes indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is right, so the last commit should be reverted I think.

@YarShev
Copy link

YarShev commented Mar 21, 2022

Do the tests pass for you, and if so are you working from a different version of the protocol implementation for Pandas, or running the tests differently?

I would like the tests from this PR fit every implementation of the protocol, but not the implementation for pandas only.

@rgommers
Copy link
Member

I would like the tests from this PR fit every implementation of the protocol, but not the implementation for pandas only.

They should indeed. That was not the point of the comment though, I just had the wrong (in-progress) implementation of the protocol for Pandas. We should just link to pandas-dev/pandas#46141, I think that's the right implementation and not the one in this repo.

@Rubtsowa Rubtsowa force-pushed the add_tests_protocol branch from 72bdae4 to 90a6598 Compare March 21, 2022 12:20
@Rubtsowa
Copy link
Author

Rubtsowa commented Mar 21, 2022

Then Failed in tests which have @rgommers at the moment they will be present, but with the updated pandas, they will not occur.

@data-apis data-apis deleted a comment from Rubtsowa Jul 14, 2022
@honno
Copy link
Member

honno commented Jul 29, 2022

So dataframe-interchange-tests now covers everything in this PR, except for asserting the actual data held in buffers e.g.

for idx, truth in enumerate(arr):
    val = ctype.from_address(dataBuf.ptr + idx * (bitwidth // 8)).value
    assert val == truth, f"Buffer at index {idx} mismatch"

I can mull doing something like this, similarly only in scenarios where checking the actually values of a buffer is easy (e.g. on CPU, and only when libraries are returning valid Column.dtype tuples in the first place).

@rgommers
Copy link
Member

Thanks @honno. I'll close this then. Can you migrate that last loose end to a new issue (or PR) on the dataframe-interchange-tests repo so it doesn't get lost?

Thanks a lot @Rubtsowa and @dchigarev. Given that:

Thanks for opening this PR! And also for working on the Pandas prototype (I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants