Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace with nested dict raises for overlapping keys #27696

Merged
merged 10 commits into from
Aug 27, 2019

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Aug 1, 2019

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having second thoughts about this change - not sure working around hashes is universally appropriate.

Instead of replace should users just be doing an astype(int) if they want to represent True / False as 1 and 0?

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Aug 1, 2019

indeed, there are several ways to represent True/False as 1/0, and astype is one of them. But since based on documentation of replace, {col_name: {replace dict}} can be used to replace values, and should also work for this boolean case. However, the boolean case is not working, so I feel it is a bug and should fix. I just have not thought of a nice way instead of working around hash and would be very happy to hear nicer idea! @WillAyd

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have the same thoughts/concerns as @WillAyd. I think the issue in question is an implementation detail that we might have to live with, as I'm not sure there's a clean fix that doesn't involve a substantial rewrite or a lot workarounds that hurt maintainability. I could be wrong about this but no clean solutions immediately come to mind.

# GH 27660
df = DataFrame({"col": [False, True, 0, 1]})

result = df.replace({"col": {False: 0, True: 1}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were to switch the replacement to {False: 1, True: 0} then this test will fail as the integers will get incorrectly replaced:

In [2]: df = pd.DataFrame({'col': [False, True, 0, 1]})

In [3]: df.replace({'col': {False: 1, True: 0}})
Out[3]: 
   col
0    1
1    0
2    1
3    0

I think the current version of the test is passing because 0/1 are just getting replaced by themselves.

This is a consequence of how Python handles hashing, specifically 0/False have the same hash and evaluate equally with == (Python's fallback on hash collision), so they'll be considered the same in set operations or when looking up keys in a dict (likewise for 1/True).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh, you are very right, it's not good solution @jschendel

# add another check to avoid boolean being regarded
# as binary in python set
if set(keys) & set(values) and set(map(str, keys)) & set(
map(str, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a little bit too permissive now, as it will allow {0: 1.0, 1: 'a'}, which was previously rejected (might not actually matter but is a change in behavior we should be cognizant of).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i should have thought of it more thoroughly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just not be removed altogether? Not clear on the purpose of it

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2019

Though it is interesting that this works currently:

>>> df = pd.DataFrame({"col":[False, True]})
>>> df.replace({False:0, True:1})
   col
0    0
1    1

While the example from the OP doesn't

>>> df.replace({"col": {False:0, True:1}})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/frame.py", line 4210, in replace
    method=method,
  File "/Users/williamayd/clones/pandas/pandas/core/generic.py", line 6645, in replace
    "Replacement not allowed with "
ValueError: Replacement not allowed with overlapping keys and values

@charlesdong1991 any idea how the path without column selection handles this?

@charlesdong1991
Copy link
Member Author

@WillAyd I also noticed this and it's because in the case of replace.{col_name: {replace dict}}, there is a nested dictionary, and will cause are_mappings on the line of 6628 in generic.py containing a True, but in the case replace({False:0, True:1}), any(are_mappings) will be False and code will jump to line 6657 directly. I am very sorry i don't know how to copy paste the code here. Hope my description is clear enough ^^

@WillAyd
Copy link
Member

WillAyd commented Aug 2, 2019

@charlesdong1991 you can click on a line number in GH and share a permalink to make things easier:

image

Here's what I think you are referencing

are_mappings = [is_dict_like(v) for v in values]

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Aug 2, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Aug 2, 2019

It looks like this behavior was intentionally added in #6429 quite a few years back. I'm not sure why the distinction is made for only nested dictionaries and maybe it's not valid any more (at least if it works with only a single dictionary) so certainly if you'd like to investigate and put forth a proposal would be appreciated

@charlesdong1991 you always seem to find these tricky PRs ha!

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Aug 2, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Aug 2, 2019

These contributions are all really good. What happens if you remove the error check?

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Aug 2, 2019 via email

@charlesdong1991
Copy link
Member Author

if removing the error check, everything will work as is. And test file in origin PR was moved to pandas/tests/frame/test_replace.py and those two tests will fail. I think it largely depends on which behavior people want to keep, but i do feel some consistency might be good to have. @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

@charlesdong1991 can you update to use the consistent approach described above?

@WillAyd WillAyd added this to the 0.25.2 milestone Aug 23, 2019
@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Aug 23, 2019

just to double check to ensure I undsertand correclty: preferable way is to remove the error check to avoid replace({True: 1, False: 0}) to keep consistency of not supporting such replacement, right? @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Right - I don't think we should have that error check since it only applies to nested dictionaries when it works as is for a "simple" dictionary replacement

@charlesdong1991
Copy link
Member Author

Right - I don't think we should have that error check since it only applies to nested dictionaries when it works as is for a "simple" dictionary replacement

agree, thanks for your quick reply! @WillAyd appreciate a lot!

@WillAyd WillAyd modified the milestones: 0.25.2, 1.0 Aug 24, 2019
@pep8speaks
Copy link

pep8speaks commented Aug 24, 2019

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-27 06:54:23 UTC

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew for v1.0.0?

pandas/tests/generic/test_generic.py Outdated Show resolved Hide resolved
# add another check to avoid boolean being regarded
# as binary in python set
if set(keys) & set(values) and set(map(str, keys)) & set(
map(str, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just not be removed altogether? Not clear on the purpose of it

@charlesdong1991
Copy link
Member Author

ping

sorry again for that unintentional push, now I think this PR is ready for review @WillAyd

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things but generally this looks good. Back over to you @jschendel

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
pandas/tests/frame/test_replace.py Outdated Show resolved Hide resolved
pandas/tests/frame/test_replace.py Outdated Show resolved Hide resolved
@WillAyd WillAyd changed the title Fix pandas replace does not work with boolean Replace with nested dict raises for overlapping keys Aug 25, 2019
@WillAyd WillAyd added the DataFrame DataFrame data structure label Aug 25, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @jschendel or whomever else wants to take a look merge if happy

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; a couple small comments since there's a merge conflict that needs to be resolved

pandas/tests/frame/test_replace.py Outdated Show resolved Hide resolved
pandas/tests/frame/test_replace.py Outdated Show resolved Hide resolved
@charlesdong1991
Copy link
Member Author

thanks @jschendel @WillAyd

ping

@jschendel jschendel merged commit 041b6b1 into pandas-dev:master Aug 27, 2019
@jschendel
Copy link
Member

thanks @charlesdong1991

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas replace does not work with booleans
4 participants