Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_excel with MultiIndex adds a blank line #27772

Open
stefanopassador opened this issue Aug 6, 2019 · 32 comments
Open

to_excel with MultiIndex adds a blank line #27772

stefanopassador opened this issue Aug 6, 2019 · 32 comments
Labels
Bug IO Excel read_excel, to_excel MultiIndex

Comments

@stefanopassador
Copy link

stefanopassador commented Aug 6, 2019

Code Sample

data = [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
multi_index = pd.MultiIndex.from_product([['a', 'b'], ['one', 'two', 'three']])
df = pd.DataFrame(data, columns=multi_index, index=[1, 2])

df.to_excel('test_files/test.xlsx')

Problem description

When exporting an xlsx file from a dataframe with MultiIndex, a blank line is added.

Here a screenshot representing the problem:
image

This is happening with pandas 0.25.0.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.2.3
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@jmcnamara
Copy link
Contributor

jmcnamara commented Aug 6, 2019

Note, there is a typo in the columns argument. The program should be:

import pandas as pd

data = [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
multi_index = pd.MultiIndex.from_product([['a', 'b'], ['one', 'two', 'three']])
df = pd.DataFrame(data, columns=multi_index, index=[1, 2])

df.to_excel('test.xlsx')

And I suppose the expected out should be:

aa_image

The blank line is probably there to allow for the index label. For example:

 import pandas as pd

data = [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
multi_index = pd.MultiIndex.from_product([['a', 'b'], ['one', 'two', 'three']])
df = pd.DataFrame(data, columns=multi_index, index=[1, 2])

df.to_excel('test.xlsx', index_label="Foo")

Output:

aa_image

This matches the repl output if the the index name is set:

>>> df.index.name = 'foo'
>>> df
      a             b
    one two three one two three
foo
1     1   2     3   4   5     6
2     7   8     9  10  11    12

It is probably a separate question/feature request on whether it should be something like:

aa_image

@stefanopassador
Copy link
Author

I edited my code snippet to fix the typo.

About my problem, how should I remove that line?
Without MultiIndex I don't encounter the same behavior.

@TomAugspurger TomAugspurger added IO Excel read_excel, to_excel MultiIndex labels Aug 9, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Aug 9, 2019
@yuylyp
Copy link

yuylyp commented Aug 9, 2019

@stefanopassador
i encountered the same question, it confused me for some days.
have you got any idea?

@yuylyp
Copy link

yuylyp commented Aug 9, 2019

I have tried to_csv just now, and the blank row disappear by setting index name to None.
while I turned to to_excel, I encountered it again.
It seems to be an unwanted behavior in to_excel.
In other words, it should be an BUG.

@stefanopassador
Copy link
Author

@yuylyp I haven't had any luck about it.

@yuylyp
Copy link

yuylyp commented Aug 9, 2019

I might get the reason, I found some code in /pandas/io/formats/excel.py as below.
However I don't know how to fix this issue

# MultiIndex columns require an extra row
# with index names (blank if None) for
# unambiguous round-trip, unless not merging,
# in which case the names all go on one row Issue #11328

@yuylyp
Copy link

yuylyp commented Aug 13, 2019

@stefanopassador
I found a workground, please comment the following code in function _format_regular_rows in /pandas/io/formats/excel.py
but the index name will cover the lowest-level column name while both of them are not none or empty.
so I suggests the developer to add more condition in this if-statement.

if isinstance(self.columns, ABCMultiIndex):
    self.rowcounter += 1

@gergab1129
Copy link

gergab1129 commented Aug 16, 2019

I managed to get a workaround to this problem using quite a mix between offsets and writing the DataFrame by segments.

May data be a Dataframe like this:

image

Extracting the level_0 column names and writing the index you can overcome the blank line problem

index_mes = data.columns.levels[0]

for index_num, value in enumerate(data.index):
        worksheet.write(row_offset + index_num + 3,
                        column_offset, value)`

and then you can write the DataFrame splitting by key:

for mes in index_mes:

        worksheet.merge_range(row_offset + 1, column_offset + 1,
                              row_offset + 1, column_offset + 3,
                              mes)
        data[mes].to_excel(writer, nombre_hoja, startrow=row_offset + 2,
                              startcol=column_offset + 1, header=True,
                              index=False)
         column_offset += 3`

However I found a problem, This is the expected output:

image

But what I get is this:

image

You can see that the month are not in order, this problem comes from the line

index_mes = data.columns.levels[0]

This gets you the level_0 columns names in alphabetical order instead of the original order and this I haven't been able to overcome. However if your keys do not depend on orders I see no problem using this approach.

 

@bartkim0426
Copy link

@yuylyp
Can you explain more about the specific condition?
I tried to fix this while the sprint in pycon 2019 Korea. While making a test code, I figured out that I can't reproduce this problem with test code because read_excel doesn't have skip_blanklines.
Since I'm not a hard user of pandas, I can't specify the condition for this problem. It will be very helpful to give a clue.

@yuylyp
Copy link

yuylyp commented Aug 19, 2019

@bartkim0426
for example, while the name of 1-level index or names of multiindex is(are) none/ empty or others which indicates the empty string, the statement (self.rowcounter += 1) shouldn't be executed.
while we add those condition, it will works well as we thought.

@bartkim0426
Copy link

I dug into this issue for a while and I found out that this code can be intended because of index name.

If I add the condition, the problem above is solved. However, some multiple multiindex tests failed (in test_excel_multindex_roundtrip) because of index name.

While read_excel - if there's only row cell is filled and other cells are blank, it's hard to find out that the row is name of index or the blank row.

Example below: Nan for first row

Add blank row

image

In [5]: act
Out[5]:
        C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3 C_l0_g4
        C_l1_g0 C_l1_g1 C_l1_g2 C_l1_g3 C_l1_g4
        C_l2_g0 C_l2_g1 C_l2_g2 C_l2_g3 C_l2_g4
R_l0_g0     NaN     NaN     NaN     NaN     NaN
R_l0_g1    R1C0    R1C1    R1C2    R1C3    R1C4
R_l0_g2    R2C0    R2C1    R2C2    R2C3    R2C4
R_l0_g3    R3C0    R3C1    R3C2    R3C3    R3C4
R_l0_g4    R4C0    R4C1    R4C2    R4C3    R4C4

In [6]: act.index
Out[6]: Index(['R_l0_g0', 'R_l0_g1', 'R_l0_g2', 'R_l0_g3', 'R_l0_g4'], dtype='object')

In [7]: act.shape
Out[7]: (5, 5)

As you can see, R_I0_g0 row treated well as data because there's blank row above - so it's not treated as index name.

Not add blank row

image

In [2]: act
Out[2]:
        C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3 C_l0_g4
        C_l1_g0 C_l1_g1 C_l1_g2 C_l1_g3 C_l1_g4
        C_l2_g0 C_l2_g1 C_l2_g2 C_l2_g3 C_l2_g4
R_l0_g0
R_l0_g1    R1C0    R1C1    R1C2    R1C3    R1C4
R_l0_g2    R2C0    R2C1    R2C2    R2C3    R2C4
R_l0_g3    R3C0    R3C1    R3C2    R3C3    R3C4
R_l0_g4    R4C0    R4C1    R4C2    R4C3    R4C4

In [3]: act.index
Out[3]: Index(['R_l0_g1', 'R_l0_g2', 'R_l0_g3', 'R_l0_g4'], dtype='object', name='R_l0_g0')

In [4]: act.shape
Out[4]: (4, 5)

R_I0_g0 row treated as index name, so real data is started from R_I0_g1.

I figured out that this code is written 4 years ago by Chris, and Cleanup by Will 5 months ago. (#26473)

So I think this issue can be closed without considering an ambiguous read_excel problems.

Summary

  • While to_excel with multiindex without index name, there's a blank line created.
  • If add condition, multiple tests failed because of read_excel treated the first row as index name.
  • Which one is better - closed this issue or change the whole test including read_excel issues.

@WillAyd Can you give an idea about this?

@jmcnamara
Copy link
Contributor

@bartkim0426 Thanks. That is a good summary of the problems involved here and why the solution isn't straightforward.

Another solution might be to have an option to turn off the blank line for cases where the user isn't worried about round-tripping the file through Pandas. In which case the existing tests/behaviour could still be the same.

@bartkim0426
Copy link

@jmcnamara Good idea!
Then to_excel would have an extra parameter. (I don't think the appropriate name) Any ideas?
Also is it too excessive to add an option for the specific situation?

@mroeschke mroeschke added the Bug label May 8, 2020
@lenoqt
Copy link

lenoqt commented Apr 4, 2021

This is still being an issue? I am getting this:
image

I have pandas : 1.1.5

@Seonu-Lim
Copy link

This is still being an issue? I am getting this:
image

I have pandas : 1.1.5

Yes, I have also encountered this issue on pandas 1.2.2. I think no one's working on this...

@mullimanko
Copy link

+1

@Delengowski
Copy link
Contributor

This has always done it, I can remember all the way back to 0.23.4, that blank line is for the names of the multi index levels

@tomwojcik
Copy link

This has always done it, I can remember all the way back to 0.23.4, that blank line is for the names of the multi index levels

I was debugging it yesterday out of curiosity and can confirm it's due to the name of MultiIndex. I'll tackle this bug sometime in the future.

@Delengowski
Copy link
Contributor

This has always done it, I can remember all the way back to 0.23.4, that blank line is for the names of the multi index levels

I was debugging it yesterday out of curiosity and can confirm it's due to the name of MultiIndex. I'll tackle this bug sometime in the future.

I was under the impression it was intended behavior

@mullimanko
Copy link

mullimanko commented May 16, 2021

For now, here is a quick and dirty workaraound after you have done the export:

import pandas as pd
import xlwings as xw

path_save = r"test.xlsx"

# At first, you have a dataframe that you want to export, e.g., df.
df = pd.DataFrame([
    ("foo", "bar"),
    ("", ""),
    ("baz", "qux"),
    ("fred", "thud"),
    ], columns=['col1', 'col2',])

# Then you did:
df.to_excel(path_save)

# After that you could use the following code to get rid of a blank line.
with xw.App(visible=False) as app:
    wb = xw.Book(path_save)
    ws = wb.sheets[0]

    # Delete the row that is blank. In this case it's row 3 (in 1-based notation, i.e. excel notation, not pandas notation).
    wb.sheets[0].range('3:3').api.Delete()
    wb.save(path_save)
    wb.close()

@Taoyuetao
Copy link

It looks xlwings can't work under Linux system, is there any workaround for Linux?

@ant1j
Copy link

ant1j commented Jul 29, 2021

It looks xlwings can't work under Linux system, is there any workaround for Linux?

I guess openpyxl or xlsxwriter (or any Excel managing library) can work this out.

@Taoyuetao
Copy link

@ant1j

Thanks your tips. I found a workaround as

writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1')
writer.sheets['test1'].set_row(2, None, None, {'hidden': True})
writer.save()

@chuachengling
Copy link

chuachengling commented Sep 23, 2021

Hi is there a workaround when writing multiple dataframes into one excel sheet? Thanks

Edit:
Nevermind, the method would be to add the method programmatically

writer.sheets['test1'].set_row(2, None, None, {'hidden': True})

Thanks!

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@filipthor
Copy link

For now, here is a quick and dirty workaraound after you have done the export:

df.to_excel(path_save)

# After the to_excel you could add this.
import xlwings as xw

app = xw.App(visible=False)
wb = xw.Book(path_save)

# Delete the row that is blank (in this case row 3).
wb.sheets[0].range('3:3').api.Delete()
wb.save(path_save)
app.quit()

For some reason this solution only worked once for me. The second time I run the code (regenerating the same df and exporting the file) the code gets stuck in running, and nothing happens.

Is there any update on this issue in general, is there any solutions/fixes planned for this?

@mullimanko
Copy link

In the workaround you mention, there might have been a problem that sometimes an invisible excel process in the background stays open, I have updated the above code, so that it is more stable. The problem should be solved now.

@alessiamarcolini
Copy link

Hi @tomwojcik is there any update on this issue? :)

@tomwojcik
Copy link

Hi @tomwojcik is there any update on this issue? :)

I stopped working on it after someone mentioned that's the expected behavior. Even if it wasn't I'm not interested in fixing it anymore.

@sailist
Copy link

sailist commented Aug 14, 2024

I resolved this problem by overriding these classes(pandas 2.0.3):

from pandas.io.formats.style import (
    Styler,
    Sequence,
    Hashable,
    IndexLabel,
    StorageOptions,
)
import numpy as np
from pandas.io.formats.excel import (
    ExcelFormatter,
    ExcelCell,
    Index,
    MultiIndex,
    PeriodIndex,
    CssExcelCell,
    com,
    get_level_lengths,
)
from typing import Iterable


class CustomStyler(Styler):
    def to_excel(
        self,
        excel_writer,
        sheet_name: str = "Sheet1",
        na_rep: str = "",
        float_format: str | None = None,
        columns: Sequence[Hashable] | None = None,
        header: Sequence[Hashable] | bool = True,
        index: bool = True,
        index_label: IndexLabel | None = None,
        startrow: int = 0,
        startcol: int = 0,
        engine: str | None = None,
        merge_cells: bool = True,
        encoding: str | None = None,
        inf_rep: str = "inf",
        verbose: bool = True,
        freeze_panes: tuple[int, int] | None = None,
        storage_options: StorageOptions = None,
    ) -> None:
        formatter = CustomExcelFormatter(
            self,
            na_rep=na_rep,
            cols=columns,
            header=header,
            float_format=float_format,
            index=index,
            index_label=index_label,
            merge_cells=merge_cells,
            inf_rep=inf_rep,
        )
        formatter.write(
            excel_writer,
            sheet_name=sheet_name,
            startrow=startrow,
            startcol=startcol,
            freeze_panes=freeze_panes,
            engine=engine,
            storage_options=storage_options,
        )


class CustomExcelFormatter(ExcelFormatter):

    def _format_regular_rows(self) -> Iterable[ExcelCell]:
        if self._has_aliases or self.header:
            self.rowcounter += 1

        # output index and index_label?
        if self.index:
            # check aliases
            # if list only take first as this is not a MultiIndex
            if self.index_label and isinstance(
                self.index_label, (list, tuple, np.ndarray, Index)
            ):
                index_label = self.index_label[0]
            # if string good to go
            elif self.index_label and isinstance(self.index_label, str):
                index_label = self.index_label
            else:
                index_label = self.df.index.names[0]

            # if isinstance(self.columns, MultiIndex):
            #     self.rowcounter += 1

            if index_label and self.header is not False:
                yield ExcelCell(self.rowcounter - 1, 0, index_label, self.header_style)

            # write index_values
            index_values = self.df.index
            if isinstance(self.df.index, PeriodIndex):
                index_values = self.df.index.to_timestamp()

            for idx, idxval in enumerate(index_values):
                yield CssExcelCell(
                    row=self.rowcounter + idx,
                    col=0,
                    val=idxval,
                    style=self.header_style,
                    css_styles=getattr(self.styler, "ctx_index", None),
                    css_row=idx,
                    css_col=0,
                    css_converter=self.style_converter,
                )
            coloffset = 1
        else:
            coloffset = 0

        yield from self._generate_body(coloffset)

    def _format_hierarchical_rows(self) -> Iterable[ExcelCell]:
        if self._has_aliases or self.header:
            self.rowcounter += 1

        gcolidx = 0

        if self.index:
            index_labels = self.df.index.names
            # check for aliases
            if self.index_label and isinstance(
                self.index_label, (list, tuple, np.ndarray, Index)
            ):
                index_labels = self.index_label

            # MultiIndex columns require an extra row
            # with index names (blank if None) for
            # unambiguous round-trip, unless not merging,
            # in which case the names all go on one row Issue #11328
            # if isinstance(self.columns, MultiIndex) and self.merge_cells:
            #     self.rowcounter += 1

            # if index labels are not empty go ahead and dump
            if com.any_not_none(*index_labels) and self.header is not False:
                for cidx, name in enumerate(index_labels):
                    yield ExcelCell(self.rowcounter - 1, cidx, name, self.header_style)

            if self.merge_cells:
                # Format hierarchical rows as merged cells.
                level_strs = self.df.index.format(
                    sparsify=True, adjoin=False, names=False
                )
                level_lengths = get_level_lengths(level_strs)

                for spans, levels, level_codes in zip(
                    level_lengths, self.df.index.levels, self.df.index.codes
                ):
                    values = levels.take(
                        level_codes,
                        allow_fill=levels._can_hold_na,
                        fill_value=levels._na_value,
                    )

                    for i, span_val in spans.items():
                        mergestart, mergeend = None, None
                        if span_val > 1:
                            mergestart = self.rowcounter + i + span_val - 1
                            mergeend = gcolidx
                        yield CssExcelCell(
                            row=self.rowcounter + i,
                            col=gcolidx,
                            val=values[i],
                            style=self.header_style,
                            css_styles=getattr(self.styler, "ctx_index", None),
                            css_row=i,
                            css_col=gcolidx,
                            css_converter=self.style_converter,
                            mergestart=mergestart,
                            mergeend=mergeend,
                        )
                    gcolidx += 1

            else:
                # Format hierarchical rows with non-merged values.
                for indexcolvals in zip(*self.df.index):
                    for idx, indexcolval in enumerate(indexcolvals):
                        yield CssExcelCell(
                            row=self.rowcounter + idx,
                            col=gcolidx,
                            val=indexcolval,
                            style=self.header_style,
                            css_styles=getattr(self.styler, "ctx_index", None),
                            css_row=idx,
                            css_col=gcolidx,
                            css_converter=self.style_converter,
                        )
                    gcolidx += 1

        yield from self._generate_body(gcolidx)


def pd_to_excel(
    self,
    excel_writer,
    sheet_name: str = "Sheet1",
    na_rep: str = "",
    float_format: str | None = None,
    columns: Sequence[Hashable] | None = None,
    header: Sequence[Hashable] | bool = True,
    index: bool = True,
    index_label: IndexLabel | None = None,
    startrow: int = 0,
    startcol: int = 0,
    engine: str | None = None,
    merge_cells: bool = True,
    encoding: str | None = None,
    inf_rep: str = "inf",
    verbose: bool = True,
    freeze_panes: tuple[int, int] | None = None,
    storage_options: StorageOptions = None,
) -> None:

    formatter = CustomExcelFormatter(
        self,
        na_rep=na_rep,
        cols=columns,
        header=header,
        float_format=float_format,
        index=index,
        index_label=index_label,
        merge_cells=merge_cells,
        inf_rep=inf_rep,
    )
    formatter.write(
        excel_writer,
        sheet_name=sheet_name,
        startrow=startrow,
        startcol=startcol,
        freeze_panes=freeze_panes,
        engine=engine,
        storage_options=storage_options,
    )

@secsilm
Copy link

secsilm commented Oct 29, 2024

@ant1j

Thanks your tips. I found a workaround as

writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter') df.to_excel(writer, sheet_name='test1') writer.sheets['test1'].set_row(2, None, None, {'hidden': True}) writer.save()

It's not working now.

@jmcnamara
Copy link
Contributor

jmcnamara commented Oct 31, 2024

Thanks your tips. I found a workaround as ... It's not working now.

@secsilm It should still work. You just need to change .save() to .close():

import pandas as pd

data = [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
multi_index = pd.MultiIndex.from_product([["a", "b"], ["one", "two", "three"]])
df = pd.DataFrame(data, columns=multi_index, index=[1, 2])

writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
df.to_excel(writer, sheet_name="test1")

writer.sheets["test1"].set_row(2, None, None, {"hidden": True})

writer.close()

Output (note that row 3 is hidden):

screenshot

Tested with python3.11, pandas 2.2.3 and xlsxwriter 3.2.0.

@secsilm
Copy link

secsilm commented Nov 1, 2024

Thanks your tips. I found a workaround as ... It's not working now.

@secsilm It should still work. You just need to change .save() to .close():

import pandas as pd

data = [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
multi_index = pd.MultiIndex.from_product([["a", "b"], ["one", "two", "three"]])
df = pd.DataFrame(data, columns=multi_index, index=[1, 2])

writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
df.to_excel(writer, sheet_name="test1")

writer.sheets["test1"].set_row(2, None, None, {"hidden": True})

writer.close()

Output (note that row 3 is hidden):

screenshot

Tested with python3.11, pandas 2.2.3 and xlsxwriter 3.2.0.

Thanks. I tried your code, it works. Maybe there was some mistakes in my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel MultiIndex
Projects
None yet
Development

No branches or pull requests