Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Keep float dtype in merge on int and float column #18352

Merged
merged 8 commits into from
Nov 23, 2017

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Nov 18, 2017

B = DataFrame({'Y': float_vals})

res = A.merge(B, left_on='X', right_on='Y')
assert is_float_dtype(res['Y'].dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to check the results
the last merge is not what you would expect

Copy link
Contributor Author

@reidy-p reidy-p Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my branch:

In [1]: A = pd.DataFrame({'X': [1, 2, 3]})
In [2]: B = pd.DataFrame({'Y': [1.1, 2.5, 3.0]})

In [3]: A.merge(B, left_on='X', right_on='Y')
Out[3]: 
   X    Y
0  3  3.0

In [4]: A.merge(B, left_on='X', right_on='Y').dtypes
Out[4]:
X      int64
Y    float64
dtype: object

Is this not what we would expect?

And could you give me an example of the problems caused when merging integer and float columns when the float values are not equal to the int representation of those same values? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that looks ok, in any event compare the expected results (not the dtypes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue with comparing floats and integers is almost always wrong, IOW the user needs to be aware of this, as its almost always an error (so warning is ok here).

@codecov
Copy link

codecov bot commented Nov 18, 2017

Codecov Report

Merging #18352 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18352      +/-   ##
==========================================
- Coverage   91.35%   91.34%   -0.02%     
==========================================
  Files         163      163              
  Lines       49691    49697       +6     
==========================================
- Hits        45397    45394       -3     
- Misses       4294     4303       +9
Flag Coverage Δ
#multiple 89.14% <100%> (ø) ⬆️
#single 39.67% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.32% <100%> (+0.05%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fedc503...a3d6fe6. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a sub-section in other enhancements

# further if we are object, but we infer to
# the same, then proceed
if is_numeric_dtype(lk) and is_numeric_dtype(rk):
if lk.dtype.kind == rk.dtype.kind:
continue

# check whether ints and floats
if is_integer_dtype(rk) and is_float_dtype(lk):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little leary of merging integer & float columns when the floats are not actually == ints. A warning would be good here if

not (lk == lk.astype(rk)).all() is True (and the reverse for 2nd case)

@@ -906,13 +906,20 @@ def _maybe_coerce_merge_keys(self):
continue

# if we are numeric, then allow differing
# kinds to proceed, eg. int64 and int8
# kinds to proceed, eg. int64 and int8, int and float
# further if we are object, but we infer to
# the same, then proceed
if is_numeric_dtype(lk) and is_numeric_dtype(rk):
if lk.dtype.kind == rk.dtype.kind:
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, let's change this to an if/elsif/else clause (and use pass rather than continue to delineate the cases)

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 19, 2017
continue

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I meant put the 'houston we have a problem' in the else. then everything else is just a pass.

B = DataFrame({'Y': float_vals})

res = A.merge(B, left_on='X', right_on='Y')
assert is_float_dtype(res['Y'].dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that looks ok, in any event compare the expected results (not the dtypes)

B = DataFrame({'Y': float_vals})

res = A.merge(B, left_on='X', right_on='Y')
assert is_float_dtype(res['Y'].dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue with comparing floats and integers is almost always wrong, IOW the user needs to be aware of this, as its almost always an error (so warning is ok here).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. ping on green.

'columns where the float values '
'are not equal to their int '
'representation', UserWarning)
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't actually need the pass here (as you have the if statement)

'columns where the float values '
'are not equal to their int '
'representation', UserWarning)
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

if lib.infer_dtype(lk) == lib.infer_dtype(rk):
continue
elif lib.infer_dtype(lk) == lib.infer_dtype(rk):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ones good!

@@ -29,6 +29,7 @@ Other Enhancements
- :class:`pandas.io.formats.style.Styler` now has method ``hide_columns()`` to determine whether columns will be hidden in output (:issue:`14194`)
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`)
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`)
- :func:`pandas.DataFrame.merge` no longer casts a ``float`` column to ``object`` when merging on ``int`` and ``float`` columns (:issue:`16572`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to api_breaking

# merging on float and int columns
A = DataFrame({'X': int_vals})
B = DataFrame({'Y': float_vals})
exp_res = DataFrame(exp_vals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this expected

res = A.merge(B, left_on='X', right_on='Y')
assert_frame_equal(res, exp_res)

res = B.merge(A, left_on='Y', right_on='X')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call these result

# equal to their int representation
A = DataFrame({'X': [1, 2, 3]})
B = DataFrame({'Y': [1.1, 2.5, 3.0]})
exp_res = DataFrame({'X': [3], 'Y': [3.0]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@reidy-p
Copy link
Contributor Author

reidy-p commented Nov 23, 2017

@jreback thanks for the comments. Green now.

@jreback jreback added this to the 0.22.0 milestone Nov 23, 2017
@jreback jreback merged commit 4e98a7b into pandas-dev:master Nov 23, 2017
@jreback
Copy link
Contributor

jreback commented Nov 23, 2017

thanks @reidy-p nice patch and great responsiveness!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge of int and float column results in column of dtype object
2 participants