ENH: Better handling of MultiIndex with Excel #5423

jmcnamara · 2013-11-03T20:42:24Z

Allow optional formatting of MultiIndex and Hierarchical Rows
as merged cells. closes #5254.

Some notes on this PR:

The required xlsxwriter version was up-revved in ci/requirements*.txtto pick up a fix in that module that makes working with charts from pandas easier.
Added a comment to release.rst.
Modified format.py to format MultiIndex and Hierarchical Rows as merged cells in Excel. The old code path/option is still available. The new formatting must be explicitly invoked via the merge_cells option in to_excel:

df.to_excel('file.xlsx', merge_cells=True)

The merge_cells option can be renamed if required. During development I also used multi_index and merge_range.
Updated the API and docs in frame.py to reflect the new option.
Fixed openpyxl merge handling in excel.py.
Modified the test_excel.py test cases so that they could be used to test the existing dot notation or the new merge_cells options for MultiIndex and Hierarchical Rows handling.
Did some minor PEP8 formatting to the test_excel.py.

jtratner · 2013-11-03T20:55:14Z

I added your notes to the top of the PR for ease of review.

jtratner · 2013-11-03T20:56:17Z

so it still can't read in the (simpler) merged cells versions, right? Or can it?

jmcnamara · 2013-11-03T21:00:20Z

I added your notes to the top of the PR for ease of review.

Okay, thanks. I'm never sure if that gets converted to a commit message on merge or not. I'm guessing it doesn't?

so it still can't read in the (simpler) merged cells versions, right? Or can it?

Yes, it can still read the old dot format. It can also read the new merge format for simple single row MIs.

jtratner · 2013-11-03T21:01:46Z

so then it's not even necessary to have the keyword argument, right? Or we could at least flip it to be the other direction? (default merge_cells, existing behavior under kwarg)?

Previously it couldn't even roundtrip an MI...

jmcnamara · 2013-11-03T21:33:19Z

Previously it couldn't even roundtrip an MI...

Hmm, perhaps I'm wrong then. I thought that is what the existing tests with mulitiindex in the name were doing that. But maybe not. The tests are like this:

import pandas as pd

df1 = pd.DataFrame([[7] *3, [8] *3, [9] *3])
df1.index.name = 'Foo'

df1.to_excel('merge30.xlsx', merge_cells=True)

xf = pd.ExcelFile('merge30.xlsx')
df2 = xf.parse(xf.sheet_names[0], has_index_names=True)

df1 == df2

>>> df1 == df2
        0     1     2
Foo
0    True  True  True
1    True  True  True
2    True  True  True

So that isn't round-tripping the MI, just the index names.

jtratner · 2013-11-03T21:36:39Z

Check out roundtrip in hemstring (the test that's actually checking for MI-things):

        def roundtrip(df, header=True, parser_hdr=0):

            with ensure_clean(self.ext) as path:
                df.to_excel(path, header=header)
                xf = pd.ExcelFile(path)
                res = xf.parse(xf.sheet_names[0], header=parser_hdr)
                return res

doesn't even check that it's reading it in the same way, and later it just checks that none of it is nan.

jtratner · 2013-11-03T21:38:08Z

bottom line is that this behavior wasn't well-specified before, so if read_excel can now always read in to_excel's output we should go with that by default. (and maybe just warn that it will strip out MI column names?)

jmcnamara · 2013-11-05T00:13:14Z

Rebased Excel MultiIndex PR to the latest master and fixed conflicts.

jtratner · 2013-11-05T02:42:37Z

Why do you need to use __getattribute__ and __setattr__ explicitly? Surprised to see that?

jmcnamara · 2013-11-05T10:50:07Z

Why do you need to use getattribute and setattr explicitly? Surprised to see that?

                           xcell = wks.cell("%s%s" % (colletter, row))
                           for field in style.__fields__:
                               xcell.style.__setattr__(field, \
                                   style.__getattribute__(field))

Yes, that is kind of janky. I copied that from the existing code above those lines. You would imagine that it should just be a function like:

    xcell.style = _convert_cell_style(cell.style)

I might look at refactoring the style/format handling as a separate PR in the 1.4 milestone and re-enable the commented out testcase for cell formats in test_excel.py. It probably isn't worth it for this PR though.

jtratner · 2013-11-05T11:04:32Z

Ah okay, I get why you'd want to match nearby code.

cancan101 · 2013-11-05T13:24:43Z

If if need be there is always the functions getattr and setattr.

jtratner · 2013-11-05T13:31:25Z

This looks fine overall - I think the default should be merge_cells=True though.

jreback · 2013-11-05T13:35:03Z

ci/requirements-2.7.txt

@@ -8,7 +8,7 @@ numexpr==2.1
 tables==2.3.1
 matplotlib==1.1.1
 openpyxl==1.6.2
-xlsxwriter==0.4.3
+xlsxwriter==0.4.6


I assume this is the current version.....not sure if you care about testing with a previous version (or even if it matters)

Yes, there are a few recent changes in XlsxWriter that are worth picking up. They don't affect any of the functionality here though. All tests will pass with versions from 0.4.3 onwards.

that's fine.

jreback · 2013-11-05T13:37:25Z

I think it would be nice to have an example of this in io.rst (and maybe fix up the excel section a bit to have a bit more organization, and several sub-headings). could do in this PR or another, and can use the same example in v0.13.0

jtratner · 2013-11-05T23:11:42Z

pandas/core/frame.py

@@ -1130,7 +1130,8 @@ def to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None,

    def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='',
                 float_format=None, cols=None, header=True, index=True,
-                 index_label=None, startrow=0, startcol=0, engine=None):
+                 index_label=None, startrow=0, startcol=0, engine=None,
+                 merge_cells=False):


@jmcnamara why did you choose to have it default to False? I was thinking we could enable it given that the previous behavior was not particularly correct (nor at all roundtrip-able).

jmcnamara · 2013-11-05T23:12:46Z

I think the default should be merge_cells=True though.

Ok. I'll change that around.

jtratner · 2013-11-05T23:15:34Z

In terms of docs, I think you could take one of the straightforward examples from the issue and use that in the docs to demonstrate what's going on (and then just add a .. note:: that it may not always be possible to effectively round trip DataFrames with named Index and named columns).

jmcnamara · 2013-11-05T23:20:27Z

I think it would be nice to have an example of this in io.rst (and maybe fix up the excel section a bit to have a bit more organization, and several sub-headings). could do in this PR or another, and can use the same example in v0.13.0

If you like, I could write a document section on Excel handling from Pandas: how to read and write files, explanations of some of the options with some examples and screenshots.

That would be better as a separate PR though.

Also, here is a related document that I've been working on for after the 0.13 release: Using Pandas and XlsxWriter to create Excel charts

Added merged cell formatting for MultiIndex and Hierarchical Rows. Issue #5254.

jtratner · 2013-11-06T01:28:59Z

okay, so is this ready to go? (and then I'll open an issue about docs)

jtratner · 2013-11-06T01:31:14Z

btw - those docs look nice - when you say that you can use xlsxwriter to create excel charts, do you mean that you can create charts that you can open and edit in Excel?

jmcnamara · 2013-11-06T01:40:20Z

Yes. PR is ready to go.

And yes the charts are in Excel rather than, say, PNGs produced by matplotlib. Give it a try. There are some examples in the latter part. It would be good to get some feedback.

jtratner · 2013-11-06T01:42:20Z

That is really amazing... I would like to get to that at some point (going to PyData this weekend but maybe next weekend or something). That would be so useful at work. Anyways, that's off topic.

ENH: Better handling of MultiIndex with Excel

jtratner · 2013-11-06T01:42:33Z

thanks for this!

jtratner · 2013-11-06T02:02:58Z

@jmcnamara just tested this out, I get hierarchical rows from this example, but a phantom nan - can you take a look at some point?

this code generates the screenshot below:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": range(10)}, index=[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]])

In [3]: df.to_excel('output.xlsx')

In [7]: df
Out[7]:
     A
a 1  0
  2  1
  3  2
  4  3
b 1  4
  2  5
  3  6
c 1  7
  2  8
  3  9

When I read it back in, I get this:

In [5]: pd.read_excel('output.xlsx', 'Sheet1')
Out[5]:
       A
a   1  0
NaN 2  1
    3  2
    4  3
b   1  4
NaN 2  5
    3  6
c   1  7
NaN 2  8
    3  9

jtratner · 2013-11-06T02:03:44Z

In [8]: _5.index
Out[8]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3, 4]],
           labels=[[0, -1, -1, -1, 1, -1, -1, 2, -1, -1], [0, 1, 2, 3, 0, 1, 2, 0, 1, 2]])

And that is specifically being read in as 'a', followed by 3 nans, then 'b', 2 nans, then 'c', and 2 nans.

jtratner · 2013-11-06T02:11:55Z

To be clear, the MI should be:

MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3, 4]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 3, 0, 1, 2, 0, 1, 2]])

jmcnamara · 2013-11-06T13:54:09Z

I get hierarchical rows from this example, but a phantom nan - can you take a look at some point?

Hmm, that is a little odd. I seeing it on one test system but not another. I'll look into it.

jtratner · 2013-11-06T22:14:17Z

I'm guessing it's a Mac excel thing?

jmcnamara · 2013-11-06T23:35:11Z

My bad. I was on a branch that didn't have the MI patch merged.

So unfortunately, reading HRs in the new merged format is broken. It is fixable but it will need some work. I'll address it in the 0.14 timeframe.

jtratner · 2013-11-07T00:24:16Z

Honestly, that's fine, it wasn't working before and MIs weren't
roundtripping correctly anyways. Would you mind adding a warning to the
docs?

jtratner · 2013-11-07T00:27:44Z

If you get to a fix earlier, feel free to submit it...bugfixes are good.

On Wed, Nov 6, 2013 at 7:23 PM, Jeffrey Tratner
jeffrey.tratner@gmail.comwrote:

Honestly, that's fine, it wasn't working before and MIs weren't
roundtripping correctly anyways. Would you mind adding a warning to the
docs?

jmcnamara · 2013-11-07T01:27:34Z

Would you mind adding a warning to the
docs?

I put the following in the release note.

diff --git a/doc/source/release.rst b/doc/source/release.rst
index 4b33c20..77d78b2 100644
--- a/doc/source/release.rst
+++ b/doc/source/release.rst
@@ -209,6 +209,11 @@ Improvements to existing features
     by color as expected.
   - ``read_excel()`` now tries to convert integral floats (like ``1.0``) to int
     by default. (:issue:`5394`)
+  - Excel writers now have a default option ``merge_cells`` in ``to_excel()``
+    to merge cells in MultiIndex and Hierarchical Rows. Note: using this
+    option it is no longer possible to round trip Excel files with merged
+    MultiIndex and Hierarchical Rows. Set the ``merge_cells`` to ``False`` to
+    restore the previous behaviour.  (:issue:`5254`)

Hopefully, that should do for now until I fix MI parsing or document the issue in the new doc section on using Excel until it is fixed.

jmcnamara mentioned this pull request Nov 4, 2013

BUG: Excel writer doesn't handle "cols" option correctly #5427

Closed

jreback reviewed Nov 5, 2013
View reviewed changes

jtratner reviewed Nov 5, 2013
View reviewed changes

ENH: Better handling of MultiIndex with Excel

ae37d22

Added merged cell formatting for MultiIndex and Hierarchical Rows. Issue #5254.

jtratner added a commit that referenced this pull request Nov 6, 2013

Merge pull request #5423 from jmcnamara/enh_excel_multi_index

b139288

ENH: Better handling of MultiIndex with Excel

jtratner merged commit b139288 into pandas-dev:master Nov 6, 2013

jmcnamara mentioned this pull request Nov 10, 2013

multiindex column in to_excel #2701

Closed

ghost mentioned this pull request Feb 7, 2014

BUG: misplaced index_label with DF.to_excel() #6260

Closed

ENH: Better handling of MultiIndex with Excel #5423

ENH: Better handling of MultiIndex with Excel #5423

Conversation

jmcnamara commented Nov 3, 2013

jtratner commented Nov 3, 2013

jtratner commented Nov 3, 2013

jmcnamara commented Nov 3, 2013

jtratner commented Nov 3, 2013

jmcnamara commented Nov 3, 2013

jtratner commented Nov 3, 2013

jtratner commented Nov 3, 2013

jmcnamara commented Nov 5, 2013

jtratner commented Nov 5, 2013

jmcnamara commented Nov 5, 2013

jtratner commented Nov 5, 2013

cancan101 commented Nov 5, 2013

jtratner commented Nov 5, 2013

jreback Nov 5, 2013

Choose a reason for hiding this comment

jmcnamara Nov 5, 2013

Choose a reason for hiding this comment

jtratner Nov 5, 2013

Choose a reason for hiding this comment

jreback commented Nov 5, 2013

jtratner Nov 5, 2013

Choose a reason for hiding this comment

jmcnamara commented Nov 5, 2013

jtratner commented Nov 5, 2013

jmcnamara commented Nov 5, 2013

jtratner commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 6, 2013

jtratner commented Nov 6, 2013

jtratner commented Nov 6, 2013

jtratner commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 6, 2013

jmcnamara commented Nov 6, 2013

jtratner commented Nov 7, 2013

jtratner commented Nov 7, 2013

jmcnamara commented Nov 7, 2013