Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enable pd.cut to handle i8 convertibles #14714

Closed
jreback opened this issue Nov 22, 2016 · 15 comments · Fixed by #14737
Closed

ENH: enable pd.cut to handle i8 convertibles #14714

jreback opened this issue Nov 22, 2016 · 15 comments · Fixed by #14737
Labels
Datetime Datetime data dtype Enhancement Groupby Timedelta Timedelta data type

Comments

@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

so this should work for timedeltas AND datetimes.

Should be straight forward. detect an i8 convertible. turn into i8. do the cut. turn back to the original dtype.

In [15]: s = Series(pd.to_timedelta(np.random.randint(0,100,size=10),unit='ms')).sort_values()

In [16]: s
Out[16]:
3   00:00:00.005000
5   00:00:00.007000
7   00:00:00.010000
4   00:00:00.017000
9   00:00:00.023000
1   00:00:00.043000
0   00:00:00.045000
6   00:00:00.047000
8   00:00:00.065000
2   00:00:00.090000
dtype: timedelta64[ns]

In [18]: pd.cut(s, 5)
TypeError: unsupported operand type(s) for +: 'Timedelta' and 'float'

# works when converted
In [17]: pd.cut(s.astype('timedelta64[ms]'), 5)
Out[17]:
3    (4.915, 22]
5    (4.915, 22]
7    (4.915, 22]
4    (4.915, 22]
9       (22, 39]
1       (39, 56]
0       (39, 56]
6       (39, 56]
8       (56, 73]
2       (73, 90]
dtype: category
Categories (5, object): [(4.915, 22] < (22, 39] < (39, 56] < (56, 73] < (73, 90]]

@jreback jreback added this to the Next Major Release milestone Nov 22, 2016
@jreback jreback changed the title ENH: enable pd.cut to handle timedeltas ENH: enable pd.cut to handle i8 convertibles Nov 22, 2016
@jreback jreback added the Datetime Datetime data dtype label Nov 22, 2016
@aileronajay
Copy link
Contributor

@jreback in the above example should the return type of the final object be 'timedelta64'? (the datatype of the original input)

@jreback
Copy link
Contributor Author

jreback commented Nov 23, 2016

yes (it should be original dtype)

@aileronajay
Copy link
Contributor

@jreback Does it currently return strings? I got the below output by printing typeof over the members of category object returned
Categories (5, object): [(2.915, 20] < (20, 37] < (37, 54] < (54, 71] < (71, 88]]
(2.915, 20] <type 'str'>
(2.915, 20] <type 'str'>
(37, 54] <type 'str'>
(37, 54] <type 'str'>
(37, 54] <type 'str'>
(37, 54] <type 'str'>
(54, 71] <type 'str'>
(54, 71] <type 'str'>
(71, 88] <type 'str'>
(71, 88] <type 'str'>

@jreback
Copy link
Contributor Author

jreback commented Nov 23, 2016

yes it returns strings
we don't have an interval type currently

@aileronajay
Copy link
Contributor

@jreback i am bit confused now, as part of this enhancement we first need convert to a dtype which cut can handle (timedelta64[ms]) and then return the type (timedelta64[ns]) from which we originally started. Though the objects returned will still be strings but they will be strings composed of object types that we initially passed to cut (timedelta64[ns])?

@jreback
Copy link
Contributor Author

jreback commented Nov 23, 2016

yes this is a bit tricky. I think you:

  • convert to i8
  • do the binning
  • construct the labels based on the bins / dtype
  • stringify them

@aileronajay
Copy link
Contributor

@jreback thanks, this is what i was thinking

@jreback
Copy link
Contributor Author

jreback commented Nov 23, 2016

IF we had an Interval type then this would be very easy (#8625), e.g. Period is an interval type for datetimes (but not actually implemented that way, and has slightly different semantics).

@aileronajay
Copy link
Contributor

@jreback is there an error with the way i am making this round trip

I start with s which is timedelta64[ns]. This is what "s". I then convert using the astype conversion used earlier here. Then i convert back to timedelta64[ns] using another as type conversion. But this conversion does not retain data and r is just a list having no time information

s = pd.Series(pd.to_timedelta(np.random.randint(0,100,size=10),unit='ms')).sort_values()
print s
p = s.astype('timedelta64[ms]')
r = p.astype(s.dtype)
print r

python data_conversion.py
3 00:00:00.008000
5 00:00:00.015000
8 00:00:00.031000
1 00:00:00.040000
9 00:00:00.045000
6 00:00:00.046000
2 00:00:00.072000
0 00:00:00.082000
4 00:00:00.091000
7 00:00:00.091000
dtype: timedelta64[ns]
3 00:00:00.000000
5 00:00:00.000000
8 00:00:00.000000
1 00:00:00.000000
9 00:00:00.000000
6 00:00:00.000000
2 00:00:00.000000
0 00:00:00.000000
4 00:00:00.000000
7 00:00:00.000000
dtype: timedelta64[ns]

@jreback
Copy link
Contributor Author

jreback commented Nov 23, 2016

no, you will always convert to ns and always back from ns. (internally you will do

.values.view('i8')

then pd.to_timedelta(result, unit='ns') to convert back.

@aileronajay
Copy link
Contributor

aileronajay commented Nov 24, 2016

@jreback does this code block emulate the change we want to make?

import pandas as pd
import numpy as np
import re
s = pd.Series(pd.to_timedelta(np.random.randint(0,100,size=10),unit='ms')).sort_values()
print s
p = s.astype('timedelta64[ms]')
r = pd.cut(p,5)
for elem in r:
k = elem.split(',')
a = re.sub('[^0-9.]','', k[0])
b = re.sub('[^0-9.]','', k[1])
print pd.to_timedelta(float(a) , unit='ms'),pd.to_timedelta(float(b) , unit='ms')

output

1 00:00:00.012000
6 00:00:00.025000
7 00:00:00.042000
9 00:00:00.043000
3 00:00:00.057000
0 00:00:00.061000
2 00:00:00.071000
5 00:00:00.083000
8 00:00:00.086000
4 00:00:00.099000
dtype: timedelta64[ns]
0 days 00:00:00.011913 0 days 00:00:00.029400
0 days 00:00:00.011913 0 days 00:00:00.029400
0 days 00:00:00.029400 0 days 00:00:00.046800
0 days 00:00:00.029400 0 days 00:00:00.046800
0 days 00:00:00.046800 0 days 00:00:00.064200
0 days 00:00:00.046800 0 days 00:00:00.064200
0 days 00:00:00.064200 0 days 00:00:00.081600
0 days 00:00:00.081600 0 days 00:00:00.099000
0 days 00:00:00.081600 0 days 00:00:00.099000
0 days 00:00:00.081600 0 days 00:00:00.099000

@aileronajay
Copy link
Contributor

@jreback should a series s of pandas.TimeDelta objects on s.astype('timedelta64[ns]') return a series having a numpy timedelta64 objects with time in seconds. Currently on doing s.astype('timedelta64[ns]') it return back pandas.TimeDelta objects instead of the numpy equivalent

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

we always return pandas objects (for timedelta / datetime)

@aileronajay
Copy link
Contributor

@jreback is it a good idea to use infer_dtype (from pandas.lib) to test if the object that we are calling cut on is of type datetime or timedelta, to decide if we need to do a conversion to timedelta[64] or datetime[64]?

@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

no

use needs_i8_conversion

jreback pushed a commit that referenced this issue Dec 22, 2016
xref #14714, follow-on to #14737

Author: Ajay Saxena <aileronajay@gmail.com>

Closes #14798 from aileronajay/cut_timetype_bin and squashes the following commits:

82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests
ac919cf [Ajay Saxena] added test for datetime bin type
355e569 [Ajay Saxena]  allowing datetime and timedelta datatype in pd cut bins
ShaharBental pushed a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016
xref pandas-dev#14714, follow-on to pandas-dev#14737

Author: Ajay Saxena <aileronajay@gmail.com>

Closes pandas-dev#14798 from aileronajay/cut_timetype_bin and squashes the following commits:

82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests
ac919cf [Ajay Saxena] added test for datetime bin type
355e569 [Ajay Saxena]  allowing datetime and timedelta datatype in pd cut bins
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Groupby Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants