pd.Grouper support? #364

naught101 · 2015-03-09T06:25:14Z

In pandas, you can pas a pandas.TimeGrouper object to a .groupby() call, and it allows you to group by month, year, day, or other times, without manually creating a new index with those values first. It would be great if you could do this with xray, but at the moment, I get:

/usr/local/lib/python3.4/dist-packages/xray/core/groupby.py in __init__(self, obj, group, squeeze)
     66             if the dimension is squeezed out.
     67         """
---> 68         if group.ndim != 1:
     69             # TODO: remove this limitation?
     70             raise ValueError('`group` must be 1 dimensional')

AttributeError: 'TimeGrouper' object has no attribute 'ndim'

Not sure how this will work though, because pandas.TimeGrouper doesn't appear to work with multi-index dataframes yet anyway, so maybe there needs to be a feature request over there too, or maybe it's better to implement something from scratch...

The text was updated successfully, but these errors were encountered:

shoyer · 2015-03-09T06:31:14Z

I wrote a resample function last week based on TimeGrouper. See the dev docs for more details: http://xray.readthedocs.org/en/latest/whats-new.html

This should go out in the 0.4.1 release, which I'd like to get out later this week (everyone likes faster release cycles if they are backwards compatible).

It would be pretty straightforward to create some sort of API that gives direct access to the resulting GroupBy object. I was considering something like Dataset.groupby_time. I'd love some input here if you have ideas.

naught101 · 2015-03-09T06:49:55Z

Looks good to me. I don't know enough to be able to comment on the API question.

shoyer · 2015-03-09T07:01:11Z

Well, I guess the first question is -- are there uses for TimeGrouper that you can't easily do with resample?

I suppose the simplest (no new method) would be to allow passing a dict where the key is the time dimension and the value is the grouper. Something like ds.groupby({'time': pd.TimeGrouper(...)}) or ds.groupby(time=pd.TimeGrouper(...)).

naught101 · 2015-03-09T07:34:49Z

Unfortunately I'm not familiar enough with pd.resample and pd.TeimGrouper to know the difference in what they can do. resample looks like it would cover my use-cases, although the docs are pretty limited, and don't actually specify the format of the rule format...

One thing that I would like to be able to do that is not covered by resample, and might be covered by TimeGrouper is to group over month only (not month and year), in order to create a plot of mean seasonal cycle (at monthly resolution), or similarly, a daily cycle at hourly resolution. I haven't figured out if I can do that with TimeGrouper yet though.

shoyer · 2015-03-09T08:54:34Z

Indeed, I need to complete the resample docs before doing a release :).

For your other use case, you just want to group by 'time.month' (or 'time.hour'), e.g.,
http://xray.readthedocs.org/en/v0.4/examples/weather-data.html#probability-of-freeze-by-calendar-month

naught101 · 2015-03-09T09:46:15Z

Heh, I meant the pandas docs - they don't specify the rule argument format either

time.month and time.hour do exactly what I need. They aren't mentioned in the docs at http://xray.readthedocs.org/en/stable/groupby.html, and I'm not sure how I'd guess that they exist, so perhaps they should be added to that page? It doesn't appear to be something that exists in pandas..

shoyer · 2015-03-09T17:16:14Z

For pandas resample, see here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling

The doc string could definitely use an update there, too -- see pandas-dev/pandas#5023 (I think I'll try to update this, too)

For time.month, see this part of the docs:
http://xray.readthedocs.org/en/stable/data-structures.html#dataarray-coordinates

I'm going to consolidate all the time/date functionality into a new documentation page for the next release of xray, since this is kind of all over the place now.

Also, I should probably break up that monolithic page on "Data structures", perhaps into "Basics" and "Advanced" pages.

naught101 · 2015-03-10T01:16:25Z

Ah, cool, thanks for that link, I missed that in the docs.

One thing that would be nice (in both pandas and xray) is a time.timeofday. I can't figure out how to do it with time.hour and time.minute - I need half-hourly resolution averaging. time.time does something in xray, but it seems to never complete, and it doesn't work at all in pandas.

shoyer · 2015-03-10T02:21:37Z

Hmm. ds['time.time'] appears to be broken in xray (it casts to a datetime64 array). I'll make a new issue for that.

However, it should work in pandas -- you can do df['time'].dt.time:

In [13]: t = pd.date_range('2000-01-01', periods=10000, freq='H')

In [14]: t.time
Out[14]:
array([datetime.time(0, 0), datetime.time(1, 0), datetime.time(2, 0), ...,
       datetime.time(13, 0), datetime.time(14, 0), datetime.time(15, 0)], dtype=object)

The simplest way to do timeofday, though, is probably just to calculate ds.coords['timeofday'] = ds['time.hour'] + 60 * ds['time.minute']. Then you can do ds.groupby('timeofday').

naught101 · 2015-03-10T07:51:45Z

Nice.

Ok, I have hit a stumbling block, and this is much more of a support request, so feel free to direct me else where, but since we're on the topic, I want to do something like:

start = 2002
n_years = 4
new_data = []
for year in range(start, start + n_years):
    days = 365 if year%4 else 365
    for d in range(days):
        day_data = mean + annual_cycle.isel(dayofyear=d) + daily_cycle
        day_data.coords['time'] = datetime.datetime(year,1,1) + datetime.timedelta(day=d, hour=day_data.timeofday.hour, minute=day_data.timeofday.minute)
        new_data.append(day_data)
xray.concat(new_data)

where mean, annual_cycle, and daily_cycle are overall mean, annual cycle at daily resolution, and daily cycle at 30 minute resolution (the latter two bias corrected by subtracting the mean). I'm trying to make a synthetic dataset 4 years long that only includes the mean, seasonal, and daily cycles, but no other variability.

The assignment of day_data['time'] fails because the day_data.timeofday.hour (and .minute) don't work. These are datetime.times, is there an efficient way of converting these to datetime.timedeltas, without first manually taking them out of the DataArray?

naught101 · 2015-03-10T11:30:10Z

Dunno if this is related to the ds['time.time'] problem, but I tried creating the daily_cycle using a pandas.Timedelta as the index (timeofday), but it also appeared to just hang indefinitely when doing the data.groupby('timeofday').mean('time') call..

naught101 · 2015-03-11T03:00:03Z

same problem with numpy.timedelta64 too.

shoyer · 2015-03-11T03:08:31Z

I don't think the timeofday issue is related to using Timedeltas in the index (and it's certainly not related to the ds['time.time'] issue). Also, it should be exactly equivalent to use timedelta64 or Timedelta -- both should convert to timedelta64 when put in an xray object.

Here's an example that seems to be working properly (except for uselessly display timedeltas in nanoseconds):

In [29]: time = pd.date_range('2000-01-01', freq='H', periods=100)

In [30]: daystart = time.to_period(freq='1D').to_datetime()

In [31]: timeofday = time.values - daystart.values

In [32]: ds = xray.Dataset({'data': ('time', range(100))}, {'time': time, 'timeofday': ('time', timeofday)})

In [33]: ds
Out[33]:
<xray.Dataset>
Dimensions:    (time: 100)
Coordinates:
    timeofday  (time) timedelta64[ns] 0 nanoseconds ...
  * time       (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
    data       (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ...

In [34]: ds.groupby('timeofday').mean('time')
Out[34]:
<xray.Dataset>
Dimensions:    (timeofday: 24)
Coordinates:
  * timeofday  (timeofday) timedelta64[ns] 0 nanoseconds ...
Data variables:
    data       (timeofday) float64 48.0 49.0 50.0 51.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 ...

naught101 · 2015-03-11T06:17:10Z

Ok, weird. That example works for me, but even if I take a really short slice of my data set, the same thing won't work:

In [61]:
d = data.sel(time=slice('2002-01-01','2002-01-03'))
d

Out[61]:

<xray.Dataset>
Dimensions:           (time: 143, timeofday: 70128, x: 1, y: 1, z: 1)
Coordinates:
  * x                 (x) >f8 1.0
  * y                 (y) >f8 1.0
  * z                 (z) >f8 1.0
  * time              (time) datetime64[ns] 2002-01-01T00:30:00 ...
  * timeofday         (timeofday) timedelta64[ns] 1800000000000 nanoseconds ...
Data variables:
    SWdown            (time, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14.58 ...
    Rainf_qc          (time, y, x) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    SWdown_qc         (time, y, x) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    Tair              (time, z, y, x) float64 282.9 282.9 282.7 282.6 282.4 281.7 281.0 ...
    Tair_qc           (time, y, x) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    LWdown            (time, y, x) float64 296.7 297.3 297.3 297.3 297.2 295.9 294.5 ...
    PSurf_qc          (time, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    latitude          (y, x) float64 -35.66
    Wind              (time, z, y, x) float64 2.2 2.188 1.9 2.2 2.5 2.5 2.5 2.25 2.0 2.35 ...
    LWdown_qc         (time, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    Rainf             (time, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    Qair_qc           (time, y, x) float64 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    longitude         (y, x) float64 148.2
    PSurf             (time, y, x) float64 8.783e+04 8.783e+04 8.782e+04 8.781e+04 ...
    reference_height  (y, x) float64 70.0
    elevation         (y, x) float64 1.2e+03
    Qair              (time, z, y, x) float64 0.00448 0.004608 0.004692 0.004781 ...
    Wind_qc           (time, y, x) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
Attributes:
    Production_time: 2012-09-27 12:44:42
    Production_source: PALS automated netcdf conversion
    Contact: palshelp@gmail.com
    PALS_fluxtower_template_version: 1.0.2
    PALS_dataset_name: TumbaFluxnet
    PALS_dataset_version: 1.4

In [62]:
d.groupby('timeofday').mean('time')

That last command will not complete - it will run for minutes. Not really sure how to debug that behaviour.

Perhaps it's to do with the long/lat/height variables that really should be coordinates (I'm just using the data as it came, but I can clean that, if necessary)

shoyer · 2015-03-11T07:06:57Z

The problem is that you've created a new timeofday dimension that is gigantic and orthogonal to all the other ones. You want timeofday to be along the time dimension.

d.groupby('timeofday').mean('time') is literally doing the exact same calculation 70128 times. We also implicitly assume the coordinates corresponding to dimensions have unique labels, which explains why we aren't grouping 48 times instead.

Also, unlike pandas, xray currently does the core loop for all groupby operations in pure Python, which means that yes, it will be slow when you have a very large number of groups (and it loops again to handle your 15 different variables). Using something like Cython or Numba to speedup groupby operations is on my to-do list, but I've found this to be less of a barrier than you might expect for multi-dimensional datasets -- individual group members tend to include more elements than in DataFrames.

naught101 · 2015-03-11T10:38:05Z

Ah, yep, making the dimension using data.coords['timeofday'] = ('time', [np.timedelta64(60 * int(h) + int(m), 'm') for h,m in zip(data['time.hour'], data['time.minute'])]) works. Thanks for all the help :)

saulomeirelles · 2016-06-27T11:45:09Z

This is a very useful functionality. I am wondering if I can specify the time window, for example, like ds.groupby(time=pd.TimeGrouper('12.42H')). Is there a way to do that in xarray?

shoyer · 2016-07-06T23:15:27Z

@saulomeirelles Nope, this hasn't been added yet, beyond what you can do with the current resample interface. Ideally, we would update resample so that it worked like resample in newer versions of pandas, where it creates a groupby object instead of immediately doing the calculation. But we haven't had time to do this yet -- contributions would certainly be welcome! I believe this would solve most of the remaining use cases for TimeGrouper.

saulomeirelles · 2016-07-07T08:54:46Z

Thanks, @shoyer !

Here is an example of how I circumvented the problem:

data = np.random.rand(24*5)
times = pd.date_range('2000-01-01', periods=24*5, freq='H')
foo = xray.DataArray(data, coords=[times], dims=['time'])
foo = foo.to_dataset(dim=foo.dims,name='foo')

T = time.mktime( dt.datetime(1970,1,1,12+1,25,12).timetuple() ) # 12.42 hours
Tint = [ int( time.mktime( t.timetuple() ) / T ) for t in foo.time.values.astype('datetime64[s]').tolist()]
foo2 = xray.DataArray( Tint, coords=foo.time.coords, dims=foo.time.dims)
foo.merge(foo2.to_dataset(name='Tint'), inplace=True)

foo_grp = foo.groupby('Tint')

foo_grp.group.plot.line()

In my case, the dataset is quite large then it costed a lot of computational time to merge the new variable Tint.

hazbottles · 2017-11-03T00:40:14Z

Hi, being able to pass a pd.TimeGrouper to .groupby() would be really handy. Here is my use-case and work around at the moment (.resample() doesn't serve my needs because I need to iterate over the groups):

import pandas as pd
import xarray as xr

dates = pd.DatetimeIndex(['2017-01-01 15:00', '2017-01-02 14:00', '2017-01-02 23:00'])
da = xr.DataArray([1, 2, 3], dims=['time'], coords={'time': dates})
time_grouper = pd.TimeGrouper(freq='24h', base=15)

# digging around the source code for xr.DataArray.resample i found this
grouped = xr.core.groupby.DataArrayGroupBy(da, 'time', grouper=time_grouper)

for _, sub_da in grouped:
    print(sub_da)

which prints:

<xarray.DataArray (time: 2)>
array([1, 2])
Coordinates:
  * time     (time) datetime64[ns] 2017-01-01T15:00:00 2017-01-02T14:00:00
<xarray.DataArray (time: 1)>
array([3])
Coordinates:
  * time     (time) datetime64[ns] 2017-01-02T23:00:00

Would it be possible to add a grouper kwarg to .groupby(), e.g.

da.groupby('time', grouper=time_grouper)

shoyer · 2017-11-03T03:13:14Z

Have you tried iterating over a resample object in the v0.10 release candidate? I believe the new resample API supports iteration.

…

On Thu, Nov 2, 2017 at 5:40 PM hazbottles ***@***.***> wrote: Hi, being able to pass a pd.TimeGrouper to .groupby() would be really handy. Here is my use-case and work around at the moment (.resample() doesn't serve my needs because I need to iterate over the groups: import pandas as pdimport xarray as xr dates = pd.DatetimeIndex(['2017-01-01 15:00', '2017-01-02 14:00', '2017-01-02 23:00']) da = xr.DataArray([1, 2, 3], dims=['time'], coords={'time': dates}) time_grouper = pd.TimeGrouper(freq='24h', base=15) # digging around the source code for xr.DataArray.resample i found this grouped = xr.core.groupby.DataArrayGroupBy(da, 'time', grouper=time_grouper) for _, sub_da in grouped: print(sub_da) which prints: <xarray.DataArray (time: 2)> array([1, 2]) Coordinates: * time (time) datetime64[ns] 2017-01-01T15:00:00 2017-01-02T14:00:00 <xarray.DataArray (time: 1)> array([3]) Coordinates: * time (time) datetime64[ns] 2017-01-02T23:00:00 Would it be possible to add a grouper kwarg to .groupby(), e.g. da.groupby('time', grouper=time_grouper) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#364 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1sv9n8gxYr6Dne83Hfp0IJl_5gMvks5symDugaJpZM4DredK> .

lexual · 2017-11-29T02:47:44Z

pd.TimeGrouper is deprecated in latest pandas release, so I imagine this bug should be closed.

shoyer · 2017-11-29T08:51:19Z

Well, the functionality is still there, it's just recommended that you use pd.Grouper.

…

On Wed, Nov 29, 2017 at 2:47 AM lexual ***@***.***> wrote: pd.TimeGrouper is deprecated in latest pandas release, so I imagine this bug should be closed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#364 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1pVxryyv73zANrbH-ncx-UjspMkFks5s7MXRgaJpZM4DredK> .

stale · 2020-12-25T14:49:52Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

dcherian · 2024-08-27T14:04:44Z

We now support Xarray Grouper objects which are equivalent.

shoyer mentioned this issue Mar 10, 2015

ds['time.time'] is broken #367

Closed

naught101 closed this as completed Mar 11, 2015

shoyer reopened this Jul 6, 2016

shoyer mentioned this issue Feb 14, 2017

GroupBy like API for resample #1269

Closed

dcherian added the topic-groupby label Jan 22, 2019

stale bot added the stale label Dec 25, 2020

dcherian removed the stale label Apr 9, 2022

dcherian changed the title ~~pd.TimeGrouper support?~~ pd.Grouper support? Apr 9, 2022

dcherian closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.Grouper support? #364

pd.Grouper support? #364

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 10, 2015

shoyer commented Mar 10, 2015

naught101 commented Mar 10, 2015

naught101 commented Mar 10, 2015

naught101 commented Mar 11, 2015

shoyer commented Mar 11, 2015

naught101 commented Mar 11, 2015

shoyer commented Mar 11, 2015

naught101 commented Mar 11, 2015

saulomeirelles commented Jun 27, 2016

shoyer commented Jul 6, 2016 •

edited

Loading

saulomeirelles commented Jul 7, 2016 •

edited

Loading

hazbottles commented Nov 3, 2017 •

edited

Loading

shoyer commented Nov 3, 2017 via email

lexual commented Nov 29, 2017

shoyer commented Nov 29, 2017 via email

stale bot commented Dec 25, 2020

dcherian commented Aug 27, 2024

pd.Grouper support? #364

pd.Grouper support? #364

Comments

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 9, 2015

shoyer commented Mar 9, 2015

naught101 commented Mar 10, 2015

shoyer commented Mar 10, 2015

naught101 commented Mar 10, 2015

naught101 commented Mar 10, 2015

naught101 commented Mar 11, 2015

shoyer commented Mar 11, 2015

naught101 commented Mar 11, 2015

shoyer commented Mar 11, 2015

naught101 commented Mar 11, 2015

saulomeirelles commented Jun 27, 2016

shoyer commented Jul 6, 2016 • edited Loading

saulomeirelles commented Jul 7, 2016 • edited Loading

hazbottles commented Nov 3, 2017 • edited Loading

shoyer commented Nov 3, 2017 via email

lexual commented Nov 29, 2017

shoyer commented Nov 29, 2017 via email

stale bot commented Dec 25, 2020

dcherian commented Aug 27, 2024

shoyer commented Jul 6, 2016 •

edited

Loading

saulomeirelles commented Jul 7, 2016 •

edited

Loading

hazbottles commented Nov 3, 2017 •

edited

Loading