PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119

muttermachine · 2023-05-06T12:10:48Z

Comparing vanilla installs of Pandas 1.5.3 vs 2.0.1 (and also vs 2.0.0) when selecting columns from a DataFrame with columns having a MultiIndex constructed via pd.concat the performance is noticeably slower in 2.0.0, unless manual consolidation is forced.

I was unable to fully follow through the code differences between the Pandas versions, but it appears that in 1.5.3 the first select performs consolidation whereas 2.0.0+ never automatically consolidates and thus causes each subsequent select to be much slower.

As parts of my production code follow a pattern of creating DataFrames via concat and then looping over column subsets this introduces a ~100x slowdown in parts (due to very wide multilevel DFs).

import timeit

import numpy as np
import pandas as pd


def _debug(force_consolidate):
    df = _make_column_multilevel_df_via_concat(force_consolidate)
    level_a_cols = df.columns.unique('A')

    print(f'Running once {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

    print(f'Running once again {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

    print(f'Running ten times {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))


def _make_column_multilevel_df_via_concat(force_consolidate):
    a_values = list(range(16))
    b_values = list(range(50))
    idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
    template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
    df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])

    if force_consolidate:
        df._consolidate_inplace()

    return df


def select_each_column(df, cols):
    # noinspection PyStatementEffect
    [df[c] for c in cols]


if __name__ == '__main__':
    pd.show_versions()
    _debug(False)
    print()
    _debug(True)

1.5.3

Running once force_consolidate=False
0.027367000002413988
Running once again force_consolidate=False
0.0026708999648690224
Running ten times force_consolidate=False
0.024182399967685342

Running once force_consolidate=True
0.0028768000192940235
Running once again force_consolidate=True
0.0026675001718103886
Running ten times force_consolidate=True
0.024797999998554587

2.0.1

Running once force_consolidate=False
0.1871038000099361
Running once again force_consolidate=False
0.18837619991973042
Running ten times force_consolidate=False
1.8612669999711215

Running once force_consolidate=True
0.002792700193822384
Running once again force_consolidate=True
0.002532100072130561
Running ten times force_consolidate=True
0.02379040000960231

You can see that force_consolidate=True makes no difference for 1.5.3 but does for 2.0.0. Further, "Running once" vs "Running once again" in 1.5.3 shows the cost of the consolidation happening where it is has not already been forced.

Note, the benchmark above ignores the cost of the consolidate in 2.0.0, but I hope the difference in behaviour between versions is clear.

Reading the release notes, a lot of work has gone into the block manager and I can see that other users might not want the previous behaviour as the cost of consolidation could outweigh the cost of selecting from the DataFrame.

I am thus unsure if the change in behaviour is expected or not. If it is expected, is there a cleaner way to force consolidation other than calling the undocumented consolidate/consolidate_inplace?

I also feel obliged to say thank you for all of the excellent work in this project -- it is greatly appreciated!

Installed Versions

1.5.3

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252

pandas : 1.5.3
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

2.0.0

INSTALLED VERSIONS
------------------
commit           : 478d340667831908b5b4bf09a2787a11a14560c9
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 2.0.0
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

2.0.1

INSTALLED VERSIONS
------------------
commit           : 37ea63d540fd27274cad6585082c91b1283f963d
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 2.0.1
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

The text was updated successfully, but these errors were encountered:

phofl · 2023-05-06T15:03:29Z

This does not happen in concat, the previous consolidation happened in a self.values call when selecting columns, which was removed. I think this was intended @jbrockmendel ?

jbrockmendel · 2023-05-06T15:12:12Z

getting rid of silent consolidation was definitely intentional. Where does the self.values lookup occur? We could warn on .values if a consolidation would improve perf (would be ugly but could even check for repeated .values checks)

using Cow gets a 3.5x speedup.

phofl · 2023-05-06T15:53:50Z

Yes Copy-on-Write seems like a good solution here.

The self.values happened under the hood In _getitem_multilevel, but we got rid of it since then, dispatching to iloc now.

phofl · 2023-05-06T16:00:12Z

That being said: This is a very specialised case, e.g. only one dtype. If you add a string column, the non-consolidation operation is actually faster, but both are considerably slower.

I'd recommend using CoW, performance is the same with multiple dtypes as with an single one!

jbrockmendel · 2023-05-06T16:58:20Z

I'm guessing this goes through the _reindex_multi path. Two things to try out: 1) get rid of that path altogether, 2) only go through it in already-consolidated cases

muttermachine · 2023-05-06T21:08:05Z

Thank you for the quick replies. One thing to note, extending the basic performance test shows worse performance with CoW enabled unless the manual consolidation is triggered.

I am no expert on the internals of Pandas, but if there is a useful/reasonable/hacky change for me to try and make to see if there is a reasonable approach to take, I would be willing to try it out.

2.0.1

Running once force_consolidate=False cow=False
0.18940009991638362
Running once again force_consolidate=False cow=False
0.1838280998636037
Running ten times force_consolidate=False cow=False
1.8617813000455499

Running once force_consolidate=False cow=True
0.29346469999291
Running once again force_consolidate=False cow=True
0.29061580006964505
Running ten times force_consolidate=False cow=True
3.1367222000844777

Running once force_consolidate=True cow=False
0.002467299811542034
Running once again force_consolidate=True cow=False
0.002005200134590268
Running ten times force_consolidate=True cow=False
0.020947899902239442

Running once force_consolidate=True cow=True
0.002265599789097905
Running once again force_consolidate=True cow=True
0.002035900019109249
Running ten times force_consolidate=True cow=True
0.01990539999678731

Modified code:

import itertools
import timeit

import numpy as np
import pandas as pd


def _debug(force_consolidate, cow):
    with pd.option_context('mode.copy_on_write', cow):
        df = _make_column_multilevel_df_via_concat(force_consolidate)
        level_a_cols = df.columns.unique('A')

        print(f'Running once {force_consolidate=} {cow=}')
        print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

        print(f'Running once again {force_consolidate=} {cow=}')
        print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

        print(f'Running ten times {force_consolidate=} {cow=}')
        print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))


def _make_column_multilevel_df_via_concat(force_consolidate):
    a_values = list(range(16))
    b_values = list(range(50))
    idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
    template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
    df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])

    if force_consolidate:
        df._consolidate_inplace()

    return df


def select_each_column(df, cols):
    # noinspection PyStatementEffect
    [df[c] for c in cols]


if __name__ == '__main__':
    pd.show_versions()
    for (force_consolidate, cow) in itertools.product([False, True], [False, True]):
        _debug(force_consolidate=force_consolidate, cow=cow)
        print()

phofl · 2023-05-06T21:12:46Z

Yikes, I will look into this.

This is what I am getting on main:

Running once force_consolidate=False cow=False
0.007546624983660877
Running once again force_consolidate=False cow=False
0.004861583933234215
Running ten times force_consolidate=False cow=False
0.05707850004546344

Running once force_consolidate=False cow=True
0.001773166935890913
Running once again force_consolidate=False cow=True
0.0014914589701220393
Running ten times force_consolidate=False cow=True
0.0145230001071468

Running once force_consolidate=True cow=False
0.0015935830306261778
Running once again force_consolidate=True cow=False
0.0013050410198047757
Running ten times force_consolidate=True cow=False
0.013885915977880359

Running once force_consolidate=True cow=True
0.0016619160305708647
Running once again force_consolidate=True cow=True
0.001402582973241806
Running ten times force_consolidate=True cow=True
0.012952166958712041

Results on 2.0.1 is similar to yours. We should check whether we can backport the commit that improves performance here...

muttermachine · 2023-05-06T21:28:20Z

Well the results on main do indeed look good under CoW! For my specific use case, given the root-change was intentional, am fine to manually consolidate where I need to for now and to spend the time on ensuring my codebase can switch to CoW safely.

If do not have a pandas dev env set-up but is there a way I could help track down the commit in question to make a backport easier? I presume git bisect and the like get expensive needing the full compile etc?

muttermachine · 2023-05-06T21:30:10Z

Ah I see you've already tracked it down. Thank you again for all the help, quick replies, and indeed the work on Pandas generally!

phofl · 2023-05-06T21:30:12Z

Thanks for offering. Already found it. Bisects are expensive yes, but depends a bit on your hardware.

phofl · 2023-05-06T21:30:34Z

I'll reopen till we did the actual backport. Thx

phofl · 2023-05-09T08:16:51Z

Backport pr has been merged, this will be fixed on 2.0.2 for CoW

muttermachine · 2023-05-09T12:14:42Z

Much appreciated - thank you very much

jbrockmendel · 2023-05-09T16:32:55Z

Do we need to do anything for non-CoW case?

phofl added Indexing Related to indexing on series/frames, not to indexes themselves Closing Candidate May be closeable, needs more eyeballs labels May 6, 2023

jbrockmendel added the Performance Memory or execution speed performance label May 6, 2023

phofl mentioned this issue May 6, 2023

Backport PR #51895 on branch 2.0.x (BUG: Fix getitem dtype preservation with multiindexes) #53121

Merged

muttermachine closed this as completed May 6, 2023

phofl reopened this May 6, 2023

phofl closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119

PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119

muttermachine commented May 6, 2023

phofl commented May 6, 2023

jbrockmendel commented May 6, 2023

phofl commented May 6, 2023

phofl commented May 6, 2023

jbrockmendel commented May 6, 2023

muttermachine commented May 6, 2023

phofl commented May 6, 2023

muttermachine commented May 6, 2023

muttermachine commented May 6, 2023

phofl commented May 6, 2023

phofl commented May 6, 2023

phofl commented May 9, 2023

muttermachine commented May 9, 2023

jbrockmendel commented May 9, 2023

PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119

PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119

Comments

muttermachine commented May 6, 2023

phofl commented May 6, 2023

jbrockmendel commented May 6, 2023

phofl commented May 6, 2023

phofl commented May 6, 2023

jbrockmendel commented May 6, 2023

muttermachine commented May 6, 2023

phofl commented May 6, 2023

muttermachine commented May 6, 2023

muttermachine commented May 6, 2023

phofl commented May 6, 2023

phofl commented May 6, 2023

phofl commented May 9, 2023

muttermachine commented May 9, 2023

jbrockmendel commented May 9, 2023