Add more tests for the dataframe interchange protocol #75

Rubtsowa · 2022-02-16T20:03:23Z

Here we add generic tests for the df protocol. These tests can be used for any product that works with dataframes. It is only necessary to change conftest, replace the existing constructor for pandas with the one necessary for the project that will be used.

protocol/tests/test_protocol.py

dchigarev · 2022-02-22T09:34:49Z

@Rubtsowa could you please elaborate on what these tests are supposed to be testing? Going from what I see in the PR, this tests the protocol implementation of some pandas-like dataframe (the suites use pandas specific methods of df like df.columns.values, df.categorize, an assumption that string column has 'np.object' type, etc).

Is there a plan to make these tests more generic so not only pandas-like dataframes could be tested with them?

rgommers · 2022-02-23T13:54:21Z

Thanks for the tests @Rubtsowa and for the initial review @dchigarev! @Rubtsowa it would be nice to fill in the PR description a bit, e.g., what are the main things this PR aims to cover, and are these tests written from scratch or taken from an existing implementation (cuDF, Vaex)?

Is there a plan to make these tests more generic so not only pandas-like dataframes could be tested with them?

That would be quite nice, and was indeed something that multiple maintainers of Pandas and other libraries have asked for.

protocol/tests/conftest.py

protocol/tests/test_protocol.py

YarShev · 2022-03-01T10:57:06Z

@Rubtsowa, please leave comments what is fixed.

protocol/tests/test_protocol.py

Rubtsowa · 2022-03-10T12:24:52Z

@rgommers please, check PR

rgommers

Thanks for the ping @Rubtsowa. I resolved all the comments that were addressed. This is looking good overall. I do see some failures though. It's a little tricky to run these tests, since this is now a proper pytest test suite with a conftest.py but it doesn't apply the needed monkeypatch to attach __dataframe__ to pd.DataFrame. So what I did to run the tests is:

$ ipython  # in the `protocol/` directory

In [1]: run pandas_implementation.py

In [2]: run -m pytest tests/

The result is:

FAILED tests/test_protocol.py::test_dataframe - pandas.core.indexing.IndexingError: Too ...
FAILED tests/test_protocol.py::test_df_get_chunks[10-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_df_get_chunks[12-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_df_get_chunks[12-5] - assert 1 == 5
FAILED tests/test_protocol.py::test_column_get_chunks[10-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_column_get_chunks[12-3] - assert 1 == 3
FAILED tests/test_protocol.py::test_column_get_chunks[12-5] - assert 1 == 5
FAILED tests/test_protocol.py::test_buffer - TypeError: cannot unpack non-iterable metho...
========================= 8 failed, 8 passed, 1 warning in 0.15s ==========================

Do the tests pass for you, and if so are you working from a different version of the protocol implementation for Pandas, or running the tests differently?

Rubtsowa · 2022-03-21T07:17:57Z

@rgommers thanks for your comment.

tests/test_protocol.py::test_dataframe - In this test I see problem with function select_columns. I have error:

def select_columns(self, indices: Sequence[int]) -> '_PandasDataFrame':
>       if not isinstance(indices, collections.Sequence):
E       AttributeError: module 'collections' has no attribute 'Sequence'

tests/test_protocol.py::test_df_get_chunks, tests/test_protocol.py::test_column_get_chunks - Pandas do not plan to use more chunks than 1. Because assert len(chunks) == n_chunks will be don't correct. I fixed this.

tests/test_protocol.py::test_buffer - In this test was don't correct assignment. I fixed this.

dchigarev · 2022-03-21T07:38:10Z

protocol/tests/test_protocol.py

+    chunks = list(dfX.get_chunks(n_chunks))
+    """
+    Pandas do not plan to use more chunks than 1. 
+    But if chunking is supported in your implementation, 
+    then it is the place to use the following check
+
+    ``assert len(chunks) == n_chunks``
+
+    """
+    assert len(chunks) == 1


But we explicitly specify the number of chunks that we desire to recieve for the .get_chunks() via n_chunks parameter. Going by the protocol's spec, if the parameter is specified, the producer must subdivide each chunk before yielding it [1]. So it doesn't mater whether an underlying implementation internally chunks data or not, it still has to return n_chunks amount of chunks when we explicitly request it.

The reason why @rgommers experienced failures [2] for this test, is that the draft protocol's implementation in pandas_implementation.py doesn't respect n_chunks parameter:

dataframe-api/protocol/pandas_implementation.py

Lines 762 to 766 in 81be345

def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable['_PandasDataFrame']:

"""

Return an iterator yielding the chunks.

"""

return (self,)

The latest known pandas implementation for the protocol (pandas-dev/pandas#46141) do respect this parameter and should work properly with the previous version of this test (I haven't tested it for myself)

I wonder if a library doesn't support physical chunking (i.e. the column of a dataframe is a continuous block of memory), should that library implement virtual chunking to support get_chunks()?

cc @rgommers, @vnlitvinov

Thanks for the explanations @Rubtsowa and @dchigarev. That makes perfect sense.

should that library implement virtual chunking to support get_chunks()?

Yes indeed.

This explanation is right, so the last commit should be reverted I think.

YarShev · 2022-03-21T08:55:40Z

Do the tests pass for you, and if so are you working from a different version of the protocol implementation for Pandas, or running the tests differently?

I would like the tests from this PR fit every implementation of the protocol, but not the implementation for pandas only.

rgommers · 2022-03-21T10:36:54Z

I would like the tests from this PR fit every implementation of the protocol, but not the implementation for pandas only.

They should indeed. That was not the point of the comment though, I just had the wrong (in-progress) implementation of the protocol for Pandas. We should just link to pandas-dev/pandas#46141, I think that's the right implementation and not the one in this repo.

Rubtsowa · 2022-03-21T13:55:41Z

Then Failed in tests which have @rgommers at the moment they will be present, but with the updated pandas, they will not occur.

honno · 2022-07-29T09:06:34Z

So dataframe-interchange-tests now covers everything in this PR, except for asserting the actual data held in buffers e.g.

for idx, truth in enumerate(arr):
    val = ctype.from_address(dataBuf.ptr + idx * (bitwidth // 8)).value
    assert val == truth, f"Buffer at index {idx} mismatch"

I can mull doing something like this, similarly only in scenarios where checking the actually values of a buffer is easy (e.g. on CPU, and only when libraries are returning valid Column.dtype tuples in the first place).

rgommers · 2022-07-29T13:15:48Z

Thanks @honno. I'll close this then. Can you migrate that last loose end to a new issue (or PR) on the dataframe-interchange-tests repo so it doesn't get lost?

Thanks a lot @Rubtsowa and @dchigarev. Given that:

after Remove prototype Pandas implementation from protocol/ #81 the implementation (including all the existing tests) has been removed from this repo,
Pandas now has its own implementation merged with significant improvements compared to the prototype, and
https://github.com/data-apis/dataframe-interchange-tests covers protocol testing comprehensively now,
I will close this PR.

Thanks for opening this PR! And also for working on the Pandas prototype (I think).

add tests

0c70d7c

dchigarev reviewed Feb 22, 2022

View reviewed changes

protocol/tests/test_protocol.py Outdated Show resolved Hide resolved

dchigarev reviewed Feb 22, 2022

View reviewed changes

protocol/tests/test_protocol.py Outdated Show resolved Hide resolved

rgommers added the interchange-protocol label Feb 23, 2022

rgommers changed the title ~~add tests~~ Add more tests for the dataframe interchange protocol Feb 23, 2022

vnlitvinov mentioned this pull request Feb 24, 2022

ENH: Implement DataFrame interchange protocol pandas-dev/pandas#46141

Merged

4 tasks

change tests

46830f4

Rubtsowa force-pushed the add_tests_protocol branch from 768ca9f to 46830f4 Compare February 24, 2022 22:27

change conftest

87231e0

vnlitvinov reviewed Feb 28, 2022

View reviewed changes

protocol/tests/conftest.py Outdated Show resolved Hide resolved

protocol/tests/conftest.py Outdated Show resolved Hide resolved

dchigarev reviewed Feb 28, 2022

View reviewed changes

protocol/tests/test_protocol.py Outdated Show resolved Hide resolved

dchigarev reviewed Feb 28, 2022

View reviewed changes

protocol/tests/test_protocol.py Outdated Show resolved Hide resolved

dchigarev reviewed Feb 28, 2022

View reviewed changes

protocol/tests/test_protocol.py Show resolved Hide resolved

dchigarev reviewed Feb 28, 2022

View reviewed changes

protocol/tests/test_protocol.py Show resolved Hide resolved

change tests & conftest

9c3f352

vnlitvinov reviewed Mar 1, 2022

View reviewed changes

protocol/tests/test_protocol.py Outdated Show resolved Hide resolved

change tests

90a6598

dchigarev mentioned this pull request Mar 8, 2022

FEAT-#4244: Implement dataframe exchange protocol for OmniSci modin-project/modin#4269

Merged

8 tasks

rgommers reviewed Mar 16, 2022

View reviewed changes

dchigarev reviewed Mar 21, 2022

View reviewed changes

Rubtsowa force-pushed the add_tests_protocol branch from 72bdae4 to 90a6598 Compare March 21, 2022 12:20

honno mentioned this pull request Jun 21, 2022

Fold in existing tests? data-apis/dataframe-interchange-tests#3

Closed

data-apis deleted a comment from Rubtsowa Jul 14, 2022

rgommers closed this Jul 29, 2022

honno mentioned this pull request Jul 29, 2022

Test values of buffers data-apis/dataframe-interchange-tests#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more tests for the dataframe interchange protocol #75

Add more tests for the dataframe interchange protocol #75

Rubtsowa commented Feb 16, 2022 •

edited

Loading

dchigarev commented Feb 22, 2022

rgommers commented Feb 23, 2022

YarShev commented Mar 1, 2022

Rubtsowa commented Mar 10, 2022

rgommers left a comment

Rubtsowa commented Mar 21, 2022

dchigarev Mar 21, 2022

YarShev Mar 21, 2022

rgommers Mar 21, 2022

rgommers Mar 21, 2022

YarShev commented Mar 21, 2022

rgommers commented Mar 21, 2022

Rubtsowa commented Mar 21, 2022 •

edited

Loading

honno commented Jul 29, 2022

rgommers commented Jul 29, 2022

	def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable['_PandasDataFrame']:
	"""
	Return an iterator yielding the chunks.
	"""
	return (self,)

Add more tests for the dataframe interchange protocol #75

Add more tests for the dataframe interchange protocol #75

Conversation

Rubtsowa commented Feb 16, 2022 • edited Loading

dchigarev commented Feb 22, 2022

rgommers commented Feb 23, 2022

YarShev commented Mar 1, 2022

Rubtsowa commented Mar 10, 2022

rgommers left a comment

Choose a reason for hiding this comment

Rubtsowa commented Mar 21, 2022

dchigarev Mar 21, 2022

Choose a reason for hiding this comment

YarShev Mar 21, 2022

Choose a reason for hiding this comment

rgommers Mar 21, 2022

Choose a reason for hiding this comment

rgommers Mar 21, 2022

Choose a reason for hiding this comment

YarShev commented Mar 21, 2022

rgommers commented Mar 21, 2022

Rubtsowa commented Mar 21, 2022 • edited Loading

honno commented Jul 29, 2022

rgommers commented Jul 29, 2022

Rubtsowa commented Feb 16, 2022 •

edited

Loading

Rubtsowa commented Mar 21, 2022 •

edited

Loading