REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

jorisvandenbossche · 2021-03-03T13:50:05Z

The Series(Bin)Grouper still has some direct block access, which needs to be resolved to let it work for both Block and ArrayManager

jorisvandenbossche · 2021-03-03T13:52:47Z

pandas/_libs/reduction.pyx

+            if self.has_block:
+                object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf)
+                object.__setattr__(cached_typ._mgr._block, 'mgr_locs',
+                                   slice(len(vslider.buf)))
+            else:
+                cached_typ._mgr.arrays[0] = vslider.buf


I could probably replace this full if/else block with a single cached_typ._mgr.set_values(vslider.buf) (if I add the setting of mgr_locs to SingleBlockManager.set_values).

But I assume that, currently, we don't use a plain python attribute setting but object.__setattr__ for performance? Using _mgr.set_values(..) might defeat that purpose?

So the newer commit I pushed does this change, and is thus now using _mgr.set_values.

I did some timings with a specific function that uses the SeriesGrouper with many labels + a cheap dummy function (adapted from the same benchmark case as I have been using for the other groupby PRs: #40178 (comment)):

ncols = 1000 N = 1000 data = np.random.randn(N, ncols) labels = np.random.randint(0, 100, size=N) df = pd.DataFrame(data) %timeit df.groupby(labels)[0].agg(lambda x: 1)

And repeating this several times switching back and forth between this version and master, I don't see any difference. A representative timing was:

In [15]: %timeit df.groupby(labels)[0].agg(lambda x: 1) 1.18 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- master 1.17 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- PR

jbrockmendel · 2021-03-03T15:28:52Z

pandas/_libs/reduction.pyx

@@ -66,9 +66,7 @@ cdef class _BaseGrouper:
            object.__setattr__(cached_ityp, '_index_data', islider.buf)
            cached_ityp._engine.clear_mapping()
            cached_ityp._cache.clear()  # e.g. inferred_freq must go
-            object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf)
-            object.__setattr__(cached_typ._mgr._block, 'mgr_locs',


might get a small boost by setting _mgr_locs to BlockPlacement(slice(...)) instead of going through the mgr_locs property

jreback · 2021-03-04T14:25:12Z

typing issue on ci / checks

jorisvandenbossche · 2021-03-04T14:28:37Z

It's the failure that was failing on master yesterday. Will merge master to be sure.

…uction-series

[ArrayManager] Fix groupby libreduction Series(Bin)Grouper

2dfeffb

jorisvandenbossche added Groupby Internals Related to non-user accessible pandas implementation labels Mar 3, 2021

jorisvandenbossche commented Mar 3, 2021

View reviewed changes

simplify with SingleManager.set_values

adf5e81

jbrockmendel reviewed Mar 3, 2021

View reviewed changes

set to _mgr_locs

450e800

jorisvandenbossche mentioned this pull request Mar 3, 2021

[ArrayManager] Add SingleArrayManager to back a Series #40152

Merged

jorisvandenbossche changed the title ~~[ArrayManager] Fix groupby libreduction Series(Bin)Grouper~~ REF: remove block access in groupby libreduction Series(Bin)Grouper Mar 3, 2021

jreback added this to the 1.3 milestone Mar 4, 2021

Merge remote-tracking branch 'upstream/master' into am-groupby-libred…

1df3593

…uction-series

jorisvandenbossche merged commit 81114eb into pandas-dev:master Mar 4, 2021

jorisvandenbossche deleted the am-groupby-libreduction-series branch March 4, 2021 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

jorisvandenbossche commented Mar 3, 2021

jorisvandenbossche Mar 3, 2021

jorisvandenbossche Mar 3, 2021

jbrockmendel Mar 3, 2021

jreback commented Mar 4, 2021

jorisvandenbossche commented Mar 4, 2021

REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

REF: remove block access in groupby libreduction Series(Bin)Grouper #40199

Conversation

jorisvandenbossche commented Mar 3, 2021

jorisvandenbossche Mar 3, 2021

Choose a reason for hiding this comment

jorisvandenbossche Mar 3, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 3, 2021

Choose a reason for hiding this comment

jreback commented Mar 4, 2021

jorisvandenbossche commented Mar 4, 2021