-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119
Comments
This does not happen in concat, the previous consolidation happened in a self.values call when selecting columns, which was removed. I think this was intended @jbrockmendel ? |
getting rid of silent consolidation was definitely intentional. Where does the self.values lookup occur? We could warn on .values if a consolidation would improve perf (would be ugly but could even check for repeated .values checks) using Cow gets a 3.5x speedup. |
Yes Copy-on-Write seems like a good solution here. The self.values happened under the hood In |
That being said: This is a very specialised case, e.g. only one dtype. If you add a string column, the non-consolidation operation is actually faster, but both are considerably slower. I'd recommend using CoW, performance is the same with multiple dtypes as with an single one! |
I'm guessing this goes through the _reindex_multi path. Two things to try out: 1) get rid of that path altogether, 2) only go through it in already-consolidated cases |
Thank you for the quick replies. One thing to note, extending the basic performance test shows worse performance with CoW enabled unless the manual consolidation is triggered. I am no expert on the internals of Pandas, but if there is a useful/reasonable/hacky change for me to try and make to see if there is a reasonable approach to take, I would be willing to try it out. 2.0.1
Modified code: import itertools
import timeit
import numpy as np
import pandas as pd
def _debug(force_consolidate, cow):
with pd.option_context('mode.copy_on_write', cow):
df = _make_column_multilevel_df_via_concat(force_consolidate)
level_a_cols = df.columns.unique('A')
print(f'Running once {force_consolidate=} {cow=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running once again {force_consolidate=} {cow=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running ten times {force_consolidate=} {cow=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))
def _make_column_multilevel_df_via_concat(force_consolidate):
a_values = list(range(16))
b_values = list(range(50))
idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])
if force_consolidate:
df._consolidate_inplace()
return df
def select_each_column(df, cols):
# noinspection PyStatementEffect
[df[c] for c in cols]
if __name__ == '__main__':
pd.show_versions()
for (force_consolidate, cow) in itertools.product([False, True], [False, True]):
_debug(force_consolidate=force_consolidate, cow=cow)
print() |
Yikes, I will look into this. This is what I am getting on main:
Results on 2.0.1 is similar to yours. We should check whether we can backport the commit that improves performance here... |
Well the results on main do indeed look good under CoW! For my specific use case, given the root-change was intentional, am fine to manually consolidate where I need to for now and to spend the time on ensuring my codebase can switch to CoW safely. If do not have a pandas dev env set-up but is there a way I could help track down the commit in question to make a backport easier? I presume git bisect and the like get expensive needing the full compile etc? |
Ah I see you've already tracked it down. Thank you again for all the help, quick replies, and indeed the work on Pandas generally! |
Thanks for offering. Already found it. Bisects are expensive yes, but depends a bit on your hardware. |
I'll reopen till we did the actual backport. Thx |
Backport pr has been merged, this will be fixed on 2.0.2 for CoW |
Much appreciated - thank you very much |
Do we need to do anything for non-CoW case? |
Comparing vanilla installs of Pandas 1.5.3 vs 2.0.1 (and also vs 2.0.0) when selecting columns from a DataFrame with columns having a MultiIndex constructed via pd.concat the performance is noticeably slower in 2.0.0, unless manual consolidation is forced.
I was unable to fully follow through the code differences between the Pandas versions, but it appears that in 1.5.3 the first select performs consolidation whereas 2.0.0+ never automatically consolidates and thus causes each subsequent select to be much slower.
As parts of my production code follow a pattern of creating DataFrames via concat and then looping over column subsets this introduces a ~100x slowdown in parts (due to very wide multilevel DFs).
1.5.3
2.0.1
You can see that force_consolidate=True makes no difference for 1.5.3 but does for 2.0.0. Further, "Running once" vs "Running once again" in 1.5.3 shows the cost of the consolidation happening where it is has not already been forced.
Note, the benchmark above ignores the cost of the consolidate in 2.0.0, but I hope the difference in behaviour between versions is clear.
Reading the release notes, a lot of work has gone into the block manager and I can see that other users might not want the previous behaviour as the cost of consolidation could outweigh the cost of selecting from the DataFrame.
I am thus unsure if the change in behaviour is expected or not. If it is expected, is there a cleaner way to force consolidation other than calling the undocumented consolidate/consolidate_inplace?
I also feel obliged to say thank you for all of the excellent work in this project -- it is greatly appreciated!
Installed Versions
1.5.32.0.0
2.0.1
The text was updated successfully, but these errors were encountered: