-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Keep float dtype in merge on int and float column #18352
Conversation
pandas/tests/reshape/test_merge.py
Outdated
B = DataFrame({'Y': float_vals}) | ||
|
||
res = A.merge(B, left_on='X', right_on='Y') | ||
assert is_float_dtype(res['Y'].dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to check the results
the last merge is not what you would expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my branch:
In [1]: A = pd.DataFrame({'X': [1, 2, 3]})
In [2]: B = pd.DataFrame({'Y': [1.1, 2.5, 3.0]})
In [3]: A.merge(B, left_on='X', right_on='Y')
Out[3]:
X Y
0 3 3.0
In [4]: A.merge(B, left_on='X', right_on='Y').dtypes
Out[4]:
X int64
Y float64
dtype: object
Is this not what we would expect?
And could you give me an example of the problems caused when merging integer and float columns when the float values are not equal to the int representation of those same values? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm that looks ok, in any event compare the expected results (not the dtypes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue with comparing floats and integers is almost always wrong, IOW the user needs to be aware of this, as its almost always an error (so warning is ok here).
Codecov Report
@@ Coverage Diff @@
## master #18352 +/- ##
==========================================
- Coverage 91.35% 91.34% -0.02%
==========================================
Files 163 163
Lines 49691 49697 +6
==========================================
- Hits 45397 45394 -3
- Misses 4294 4303 +9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a sub-section in other enhancements
pandas/core/reshape/merge.py
Outdated
# further if we are object, but we infer to | ||
# the same, then proceed | ||
if is_numeric_dtype(lk) and is_numeric_dtype(rk): | ||
if lk.dtype.kind == rk.dtype.kind: | ||
continue | ||
|
||
# check whether ints and floats | ||
if is_integer_dtype(rk) and is_float_dtype(lk): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little leary of merging integer & float columns when the floats are not actually == ints. A warning would be good here if
not (lk == lk.astype(rk)).all()
is True (and the reverse for 2nd case)
pandas/core/reshape/merge.py
Outdated
@@ -906,13 +906,20 @@ def _maybe_coerce_merge_keys(self): | |||
continue | |||
|
|||
# if we are numeric, then allow differing | |||
# kinds to proceed, eg. int64 and int8 | |||
# kinds to proceed, eg. int64 and int8, int and float | |||
# further if we are object, but we infer to | |||
# the same, then proceed | |||
if is_numeric_dtype(lk) and is_numeric_dtype(rk): | |||
if lk.dtype.kind == rk.dtype.kind: | |||
continue | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's change this to an if/elsif/else
clause (and use pass rather than continue to delineate the cases)
e6469d5
to
eb511a0
Compare
pandas/core/reshape/merge.py
Outdated
continue | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I meant put the 'houston we have a problem' in the else. then everything else is just a pass.
pandas/tests/reshape/test_merge.py
Outdated
B = DataFrame({'Y': float_vals}) | ||
|
||
res = A.merge(B, left_on='X', right_on='Y') | ||
assert is_float_dtype(res['Y'].dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm that looks ok, in any event compare the expected results (not the dtypes)
pandas/tests/reshape/test_merge.py
Outdated
B = DataFrame({'Y': float_vals}) | ||
|
||
res = A.merge(B, left_on='X', right_on='Y') | ||
assert is_float_dtype(res['Y'].dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue with comparing floats and integers is almost always wrong, IOW the user needs to be aware of this, as its almost always an error (so warning is ok here).
e464769
to
2b8190d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. ping on green.
pandas/core/reshape/merge.py
Outdated
'columns where the float values ' | ||
'are not equal to their int ' | ||
'representation', UserWarning) | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't actually need the pass here (as you have the if statement)
pandas/core/reshape/merge.py
Outdated
'columns where the float values ' | ||
'are not equal to their int ' | ||
'representation', UserWarning) | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
if lib.infer_dtype(lk) == lib.infer_dtype(rk): | ||
continue | ||
elif lib.infer_dtype(lk) == lib.infer_dtype(rk): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ones good!
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -29,6 +29,7 @@ Other Enhancements | |||
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`) | |||
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`) | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`) | |||
- :func:`pandas.DataFrame.merge` no longer casts a ``float`` column to ``object`` when merging on ``int`` and ``float`` columns (:issue:`16572`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to api_breaking
pandas/tests/reshape/test_merge.py
Outdated
# merging on float and int columns | ||
A = DataFrame({'X': int_vals}) | ||
B = DataFrame({'Y': float_vals}) | ||
exp_res = DataFrame(exp_vals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this expected
pandas/tests/reshape/test_merge.py
Outdated
res = A.merge(B, left_on='X', right_on='Y') | ||
assert_frame_equal(res, exp_res) | ||
|
||
res = B.merge(A, left_on='Y', right_on='X') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call these result
pandas/tests/reshape/test_merge.py
Outdated
# equal to their int representation | ||
A = DataFrame({'X': [1, 2, 3]}) | ||
B = DataFrame({'Y': [1.1, 2.5, 3.0]}) | ||
exp_res = DataFrame({'X': [3], 'Y': [3.0]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
2b8190d
to
a3d6fe6
Compare
@jreback thanks for the comments. Green now. |
thanks @reidy-p nice patch and great responsiveness! |
git diff upstream/master -u -- "*.py" | flake8 --diff