Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict Pandas merge suffixes param type to list/tuple to avoid interchange in right and left suffix order #34208

Merged
merged 10 commits into from
Jun 8, 2020

Conversation

PuneethaPai
Copy link
Contributor

@simonjayhawkins
Copy link
Member

Thanks @PuneethaPai for the PR. can you explain the changes? why not just validate in merge in pandas\core\reshape\merge.py? Can you also explain the benefit of adding list as an acceptable type? A two tuple can be statically type checked, whereas a list requires runtime length validation.

@simonjayhawkins simonjayhawkins added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels May 17, 2020
@PuneethaPai
Copy link
Contributor Author

HI @simonjayhawkins ,

  • Added list also to acceptable type, as many tests written in test_merge.py, we have used both list and tuple for suffixes.
  • Had to introduce type checking as people should not pass set which is causing change in order, as set returns sorted elements when you unpack.
  • Added length validation error tests as no one had done that in tests. But this is not really part of bug BUG: pandas merge suffixes accepting set can interchange right and left suffix order #33740.
  • I can move the validation in merge method it-self before creating _MergeOperation object. Do let me know.

A two tuple can be statically type checked

did you refer to static checking tools for python like pylint, pyflake, etc?

whereas a list requires runtime length validation.

I think both list/tuple needs length check. I haven't done length checking explicitly. Relying on python to through error. Example:

In [16]: a, b = (1,2,3)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-22772806e702> in <module>
----> 1 a, b = (1,2,3)

ValueError: too many values to unpack (expected 2)

In [17]: a, b = [1, 2, 3]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-c702cf699c19> in <module>
----> 1 a, b = [1, 2, 3]

ValueError: too many values to unpack (expected 2)

In [18]: a, b = [1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-605f28859c6e> in <module>
----> 1 a, b = [1]

ValueError: not enough values to unpack (expected 2, got 1)

I have added tests for these error messages that's all.

@simonjayhawkins
Copy link
Member

  • Added list also to acceptable type, as many tests written in test_merge.py, we have used both list and tuple for suffixes.

the docs clearly state tuple, so I would be ok with changing the tests. need to wait and see what others think.

  • Had to introduce type checking as people should not pass set which is causing change in order, as set returns sorted elements when you unpack.

again docs clearly state tuple, so could be regarded as a user error, but ok with validating parameters and raising exception to avoid unexpected results.

sgtm

  • I can move the validation in merge method it-self before creating _MergeOperation object. Do let me know.

it seems logical to me minimise the changes to validate prior to the code removed, L670 in pandas/core/reshape/merge.py

The other changes do reduce the number of parameters passed around, but is that cleanup required here. My preference is to keep PRs atomic where possible. again wait and see what others think.

A two tuple can be statically type checked

did you refer to static checking tools for python like pylint, pyflake, etc?

thinking mypy ( and end users using mypy)

I think both list/tuple needs length check. I haven't done length checking explicitly. Relying on python to through error. Example:

sgtm.

@simonjayhawkins simonjayhawkins added this to the 1.1 milestone May 17, 2020
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
@PuneethaPai PuneethaPai force-pushed the merge_suffixes_fix branch from 847dd16 to 466b175 Compare May 19, 2020 15:30
@pep8speaks
Copy link

pep8speaks commented May 19, 2020

Hello @PuneethaPai! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-06 04:03:24 UTC

@PuneethaPai
Copy link
Contributor Author

@simonjayhawkins or @jreback ,

Line 2059:74: E231 missing whitespace after ','

flake8 as you can see above asks me to add white space after ,.

But black doesn't like this.

black --diff pandas/tests/reshape/merge/test_merge.py 
--- pandas/tests/reshape/merge/test_merge.py    2020-05-20 03:05:31 +0000
+++ pandas/tests/reshape/merge/test_merge.py    2020-05-20 03:14:29.050606 +0000
@@ -2054,11 +2054,11 @@
     tm.assert_frame_equal(result, expected)
 
 
 @pytest.mark.parametrize(
     "col1, col2, suffixes",
-    [("a", "a", (None, None)), ("a", "a", ("", None)), (0, 0, (None, "")), ],
+    [("a", "a", (None, None)), ("a", "a", ("", None)), (0, 0, (None, "")),],
 )
 def test_merge_suffix_error(col1, col2, suffixes):
     # issue: 24782
     a = pd.DataFrame({col1: [1, 2, 3]})
     b = pd.DataFrame({col2: [3, 4, 5]})
reformatted pandas/tests/reshape/merge/test_merge.py
All done! ✨ 🍰 ✨
1 file reformatted.

As you can see it will delete white space. Can anyone of you suggest fix for this?
This PR is stuck on this simple issue. 😸

@PuneethaPai PuneethaPai force-pushed the merge_suffixes_fix branch from d67fc79 to 49c5c7a Compare May 22, 2020 08:07
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PuneethaPai for working on this. generally lgtm.

There is a ci failure in the doc build

in doc\source\user_guide\merging.rst

result = pd.merge(left, right, on='k', suffixes=['_l', '_r'])

needs to be changed also

@PuneethaPai PuneethaPai force-pushed the merge_suffixes_fix branch from 49c5c7a to 22cc686 Compare May 22, 2020 14:18
@PuneethaPai
Copy link
Contributor Author

Thanks @simonjayhawkins . I wasn't able to debug the issue as I couldn't run that ci stage in my local. Thanks for identifying bug.

@simonjayhawkins
Copy link
Member

@PuneethaPai I think we are going to need a note in the Backwards incompatible API changes
section of doc\source\whatsnew\v1.1.0.rst, something along the line that the parameter is now checked and an exception raised, whereas before a list or set were accepted and that the set could produce unexpected results and add the issue number.

@PuneethaPai PuneethaPai force-pushed the merge_suffixes_fix branch from 4592838 to b060059 Compare June 6, 2020 04:03
@PuneethaPai
Copy link
Contributor Author

PuneethaPai commented Jun 6, 2020

Hi, @simonjayhawkins and @jreback
I have refreshed the pull request to resolve merge conflict. Hope it's all good now.
Sorry for multiple to-and-fro conversation. Appreciate your support.

Thanks

@jreback jreback merged commit be0b61c into pandas-dev:master Jun 8, 2020
@jreback
Copy link
Contributor

jreback commented Jun 8, 2020

thanks @PuneethaPai nice!

@PuneethaPai PuneethaPai deleted the merge_suffixes_fix branch June 8, 2020 03:15
@jorisvandenbossche
Copy link
Member

This is breaking GeoPandas (where we are using list and not tuple).

the docs clearly state tuple

That's not fully correct. The merge docstring indeed say tuple, but eg merge_asof still says "tuple or list". And in our user guide, we were ourselves using a list.

In general, I am certainly fine with restricting the set of accepted types (eg to avoid the confusing issue with sets, the original report), but this is also a change that could easily be done with a deprecation warning.
And given our versioning policy, if it is easy to do a change with a deprecation, then we should try to do that, IMO.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jun 15, 2020
Closes pandas-dev#34741, while
retaining the spirit of the spirit of
pandas-dev#34208.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas merge suffixes accepting set can interchange right and left suffix order
5 participants