-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more tests for the dataframe interchange protocol #75
Conversation
@Rubtsowa could you please elaborate on what these tests are supposed to be testing? Going from what I see in the PR, this tests the protocol implementation of some pandas-like dataframe (the suites use pandas specific methods of Is there a plan to make these tests more generic so not only pandas-like dataframes could be tested with them? |
Thanks for the tests @Rubtsowa and for the initial review @dchigarev! @Rubtsowa it would be nice to fill in the PR description a bit, e.g., what are the main things this PR aims to cover, and are these tests written from scratch or taken from an existing implementation (cuDF, Vaex)?
That would be quite nice, and was indeed something that multiple maintainers of Pandas and other libraries have asked for. |
768ca9f
to
46830f4
Compare
@Rubtsowa, please leave comments what is fixed. |
@rgommers please, check PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping @Rubtsowa. I resolved all the comments that were addressed. This is looking good overall. I do see some failures though. It's a little tricky to run these tests, since this is now a proper pytest test suite with a conftest.py
but it doesn't apply the needed monkeypatch to attach __dataframe__
to pd.DataFrame
. So what I did to run the tests is:
$ ipython # in the `protocol/` directory
In [1]: run pandas_implementation.py
In [2]: run -m pytest tests/
The result is:
FAILED tests/test_protocol.py::test_dataframe - pandas.core.indexing.IndexingError: Too ...
FAILED tests/test_protocol.py::test_df_get_chunks[10-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_df_get_chunks[12-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_df_get_chunks[12-5] - assert 1 == 5
FAILED tests/test_protocol.py::test_column_get_chunks[10-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_column_get_chunks[12-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_column_get_chunks[12-5] - assert 1 == 5
FAILED tests/test_protocol.py::test_buffer - TypeError: cannot unpack non-iterable metho...
========================= 8 failed, 8 passed, 1 warning in 0.15s ==========================
Do the tests pass for you, and if so are you working from a different version of the protocol implementation for Pandas, or running the tests differently?
@rgommers thanks for your comment. tests/test_protocol.py::test_dataframe - In this test I see problem with function
tests/test_protocol.py::test_df_get_chunks, tests/test_protocol.py::test_column_get_chunks - Pandas do not plan to use more chunks than 1. Because tests/test_protocol.py::test_buffer - In this test was don't correct assignment. I fixed this. |
protocol/tests/test_protocol.py
Outdated
chunks = list(dfX.get_chunks(n_chunks)) | ||
""" | ||
Pandas do not plan to use more chunks than 1. | ||
But if chunking is supported in your implementation, | ||
then it is the place to use the following check | ||
|
||
``assert len(chunks) == n_chunks`` | ||
|
||
""" | ||
assert len(chunks) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we explicitly specify the number of chunks that we desire to recieve for the .get_chunks()
via n_chunks
parameter. Going by the protocol's spec, if the parameter is specified, the producer must subdivide each chunk before yielding it [1]. So it doesn't mater whether an underlying implementation internally chunks data or not, it still has to return n_chunks
amount of chunks when we explicitly request it.
The reason why @rgommers experienced failures [2] for this test, is that the draft protocol's implementation in pandas_implementation.py
doesn't respect n_chunks
parameter:
dataframe-api/protocol/pandas_implementation.py
Lines 762 to 766 in 81be345
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable['_PandasDataFrame']: | |
""" | |
Return an iterator yielding the chunks. | |
""" | |
return (self,) |
The latest known pandas implementation for the protocol (pandas-dev/pandas#46141) do respect this parameter and should work properly with the previous version of this test (I haven't tested it for myself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a library doesn't support physical chunking (i.e. the column of a dataframe is a continuous block of memory), should that library implement virtual chunking to support get_chunks()?
cc @rgommers, @vnlitvinov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations @Rubtsowa and @dchigarev. That makes perfect sense.
should that library implement virtual chunking to support get_chunks()?
Yes indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation is right, so the last commit should be reverted I think.
I would like the tests from this PR fit every implementation of the protocol, but not the implementation for pandas only. |
They should indeed. That was not the point of the comment though, I just had the wrong (in-progress) implementation of the protocol for Pandas. We should just link to pandas-dev/pandas#46141, I think that's the right implementation and not the one in this repo. |
72bdae4
to
90a6598
Compare
Then Failed in tests which have @rgommers at the moment they will be present, but with the updated pandas, they will not occur. |
So for idx, truth in enumerate(arr):
val = ctype.from_address(dataBuf.ptr + idx * (bitwidth // 8)).value
assert val == truth, f"Buffer at index {idx} mismatch" I can mull doing something like this, similarly only in scenarios where checking the actually values of a buffer is easy (e.g. on CPU, and only when libraries are returning valid |
Thanks @honno. I'll close this then. Can you migrate that last loose end to a new issue (or PR) on the Thanks a lot @Rubtsowa and @dchigarev. Given that:
Thanks for opening this PR! And also for working on the Pandas prototype (I think). |
Here we add generic tests for the df protocol. These tests can be used for any product that works with dataframes. It is only necessary to change conftest, replace the existing constructor for pandas with the one necessary for the project that will be used.