Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: implement __iter__ for IntegerArray so conversions (to_dict, tolist, etc.) return python native types #37377

Closed
wants to merge 27 commits into from

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Oct 24, 2020

Reheating #31328 (as it's been stale for a few months)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - need to generalize this

also need a whats new note

@@ -357,6 +357,13 @@ def __pos__(self):
def __abs__(self):
return type(self)(np.abs(self._data), self._mask)

def __iter__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually be in our base class so that bool and float will work as well here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lambda s: list(iter(s))[0],
],
)
def test_conversion_methods_return_type_is_native(func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paramterize per all Int and UInt types

also add tests for float / bool

may need o move this to the base class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests. Atm they're in:

pandas/tests/arrays/boolean/test_astype.py
pandas/tests/arrays/floating/test_astype.py
pandas/tests/arrays/integer/test_dtypes.py

Do you want them moved?

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Dtype Conversions Unexpected or buggy dtype conversions labels Oct 24, 2020
@jreback
Copy link
Contributor

jreback commented Oct 24, 2020

likely also close

Possibly related to: #27616, #25969, #21256

pls indicate this by adding tests (or comment that we don't close these)

Copy link
Member Author

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely also close

Possibly related to: #27616, #25969, #21256

pls indicate this by adding tests (or comment that we don't close these)

Looking into these

@@ -357,6 +357,13 @@ def __pos__(self):
def __abs__(self):
return type(self)(np.abs(self._data), self._mask)

def __iter__(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lambda s: list(iter(s))[0],
],
)
def test_conversion_methods_return_type_is_native(func):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests. Atm they're in:

pandas/tests/arrays/boolean/test_astype.py
pandas/tests/arrays/floating/test_astype.py
pandas/tests/arrays/integer/test_dtypes.py

Do you want them moved?

@arw2019
Copy link
Member Author

arw2019 commented Oct 24, 2020

Getting a doctest failure after this patch: is it ok to change the doc here?

_____________ [doctest] pandas.core.arrays.floating.FloatingArray ______________
240 
241     Returns
242     -------
243     FloatingArray
244 
245     Examples
246     --------
247     Create an FloatingArray with :func:`pandas.array`:
248 
249     >>> pd.array([0.1, None, 0.3], dtype=pd.Float32Dtype())
Differences (unified diff with -expected +actual):
    @@ -1,3 +1,3 @@
     <FloatingArray>
    -[0.1, <NA>, 0.3]
    +[0.10000000149011612, <NA>, 0.30000001192092896]
     Length: 3, dtype: Float32

@arw2019
Copy link
Member Author

arw2019 commented Oct 24, 2020

likely also close

Possibly related to: #27616, #25969, #21256

pls indicate this by adding tests (or comment that we don't close these)

#21256 - fixed on current master (I'll add tests)
#27616, #25969 - open & unaffected by this fix (they're bugs in the NumPy-backed data structures). I'll investigate in a separate PR

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this!

As mentioned in #29738, if we see this as a general rule for EAs, we should also document this expected behaviour in the base class. And maybe also try to add a base extension test (although the expected result is probably not that easy to know, we could at least check it's not a numpy scalar)

lambda s: list(iter(s))[0],
],
)
def test_conversion_methods_return_type_is_native(func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe move this test to the common arrays/masked/ section? We should be able to know the expected python class for each of the data types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. I'll make a separate file there (call it test_conversions.py maybe)

@@ -118,3 +118,19 @@ def test_astype_object(dtype):
# check exact element types
assert isinstance(result[0], float)
assert result[1] is pd.NA


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can you move this to the base class to avoid this duplication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the data fixture

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved + rewritten with fixtures (though not 100% sure that what I did is what you asked)

assert isinstance(func(s), int)


def test_conversion_to_dict_oriented_record_returns_native(any_nullable_int_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments about about moving this then these tests are much simpler

@@ -32,11 +32,13 @@ def test_tolist(self, data):
expected = list(data)
assert result == expected

@pytest.mark.skip(reason="Floating precision issues")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this skip? Because the conversion to string is different for numpy scalars vs native scalars?

(the skip should also not be here, I think, but only for those dtypes that are failing them)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. It's specifically that in the construction of expected

expected = pd.Series([str(x) for x in data[:5]], dtype=str)

we get a lot more decimal places. The astype operation works as before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if you remove the str(..) call around x? Because by specifying string dtype, we will already use our internal machinery to handle the conversion, which should do it consistently for numpy vs native numbers, I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this, problem persists (for both tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth opening an issue about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this. It's surprisingly hard to get a copy-pastable reproducer but the test failure is reliable - will keep looking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a while but I finally got a copy-pastable example:

In [8]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: data = pd.array([0, 0.1, 0.2, 0.3, 0.4, 0.5], dtype="Float32")
   ...: s = pd.Series(data)
   ...: res = s.astype(str)
   ...: expected = pd.Series([x for x in data], dtype=str)

In [9]: expected
Out[9]: 
0                    0.0
1    0.10000000149011612
2    0.20000000298023224
3    0.30000001192092896
4     0.4000000059604645
5                    0.5
dtype: object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only the 32 bit that fails for me locally (on a 64-bit system) - I wonder if it's something to do with that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a problem with .item() with np.float32 scalar:

In [15]: scalar = np.float32(0.1)

In [16]: scalar
Out[16]: 0.1

In [17]: scalar.item()
Out[17]: 0.10000000149011612

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was resolved by the numpy folks at numpy/numpy#17880 (not a bug)

@@ -0,0 +1,60 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment was actually about moving them to a file in tests/arrays/masked/ (so to share the test with the different nullable dtypes).

Now, I am also fine to move it to the base extension tests as you did here, but then we should add this to all EAs. If it's only activated for nullable floating/integer/boolean as is done now, then it should move to tests/arrays/masked/, as it's not a general test for all EAs.

Making it truly generic is probably a bit annoying (you need to know the exact expected type for the scalars), but so maybe the subclasses can override get_native_dtype where needed? Or maybe it can be an attribute on the test class (if it's always a single type for each dtype, then that's simpler)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment was actually about moving them to a file in tests/arrays/masked/ (so to share the test with the different nullable dtypes).

Now, I am also fine to move it to the base extension tests as you did here, but then we should add this to all EAs. If it's only activated for nullable floating/integer/boolean as is done now, then it should move to tests/arrays/masked/, as it's not a general test for all EAs.

I did misunderstand but this makes sense

Making it truly generic is probably a bit annoying (you need to know the exact expected type for the scalars), but so maybe the subclasses can override get_native_dtype where needed? Or maybe it can be an attribute on the test class (if it's always a single type for each dtype, then that's simpler)

This sounds like the way to go. I like not having to keep a registry of the mapping between scalars and native types in a single place

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 31, 2020
@arw2019 arw2019 removed the Stale label Dec 31, 2020
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 31, 2021
@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

closing as stale. if you want to continue, pls ping and can re-open.

@jreback jreback closed this Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
3 participants