-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: remove block access in groupby libreduction Series(Bin)Grouper #40199
REF: remove block access in groupby libreduction Series(Bin)Grouper #40199
Conversation
pandas/_libs/reduction.pyx
Outdated
if self.has_block: | ||
object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf) | ||
object.__setattr__(cached_typ._mgr._block, 'mgr_locs', | ||
slice(len(vslider.buf))) | ||
else: | ||
cached_typ._mgr.arrays[0] = vslider.buf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could probably replace this full if/else block with a single cached_typ._mgr.set_values(vslider.buf)
(if I add the setting of mgr_locs
to SingleBlockManager.set_values
).
But I assume that, currently, we don't use a plain python attribute setting but object.__setattr__
for performance? Using _mgr.set_values(..)
might defeat that purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the newer commit I pushed does this change, and is thus now using _mgr.set_values
.
I did some timings with a specific function that uses the SeriesGrouper with many labels + a cheap dummy function (adapted from the same benchmark case as I have been using for the other groupby PRs: #40178 (comment)):
ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = pd.DataFrame(data)
%timeit df.groupby(labels)[0].agg(lambda x: 1)
And repeating this several times switching back and forth between this version and master, I don't see any difference. A representative timing was:
In [15]: %timeit df.groupby(labels)[0].agg(lambda x: 1)
1.18 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- master
1.17 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- PR
@@ -66,9 +66,7 @@ cdef class _BaseGrouper: | |||
object.__setattr__(cached_ityp, '_index_data', islider.buf) | |||
cached_ityp._engine.clear_mapping() | |||
cached_ityp._cache.clear() # e.g. inferred_freq must go | |||
object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf) | |||
object.__setattr__(cached_typ._mgr._block, 'mgr_locs', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might get a small boost by setting _mgr_locs to BlockPlacement(slice(...)) instead of going through the mgr_locs property
typing issue on ci / checks |
It's the failure that was failing on master yesterday. Will merge master to be sure. |
xref #39146
The
Series(Bin)Grouper
still has some direct block access, which needs to be resolved to let it work for both Block and ArrayManager