Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Better handling of MultiIndex with Excel #5423

Merged
merged 1 commit into from
Nov 6, 2013
Merged

ENH: Better handling of MultiIndex with Excel #5423

merged 1 commit into from
Nov 6, 2013

Conversation

jmcnamara
Copy link
Contributor

Allow optional formatting of MultiIndex and Hierarchical Rows
as merged cells. closes #5254.

Some notes on this PR:

  • The required xlsxwriter version was up-revved in ci/requirements*.txtto pick up a fix in that module that makes working with charts from pandas easier.
  • Added a comment to release.rst.
  • Modified format.py to format MultiIndex and Hierarchical Rows as merged cells in Excel. The old code path/option is still available. The new formatting must be explicitly invoked via the merge_cells option in to_excel:
df.to_excel('file.xlsx', merge_cells=True)
  • The merge_cells option can be renamed if required. During development I also used multi_index and merge_range.
  • Updated the API and docs in frame.py to reflect the new option.
  • Fixed openpyxl merge handling in excel.py.
  • Modified the test_excel.py test cases so that they could be used to test the existing dot notation or the new merge_cells options for MultiIndex and Hierarchical Rows handling.
  • Did some minor PEP8 formatting to the test_excel.py.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

I added your notes to the top of the PR for ease of review.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

so it still can't read in the (simpler) merged cells versions, right? Or can it?

@jmcnamara
Copy link
Contributor Author

I added your notes to the top of the PR for ease of review.

Okay, thanks. I'm never sure if that gets converted to a commit message on merge or not. I'm guessing it doesn't?

so it still can't read in the (simpler) merged cells versions, right? Or can it?

Yes, it can still read the old dot format. It can also read the new merge format for simple single row MIs.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

so then it's not even necessary to have the keyword argument, right? Or we could at least flip it to be the other direction? (default merge_cells, existing behavior under kwarg)?

Previously it couldn't even roundtrip an MI...

@jmcnamara
Copy link
Contributor Author

Previously it couldn't even roundtrip an MI...

Hmm, perhaps I'm wrong then. I thought that is what the existing tests with mulitiindex in the name were doing that. But maybe not. The tests are like this:

import pandas as pd

df1 = pd.DataFrame([[7] *3, [8] *3, [9] *3])
df1.index.name = 'Foo'

df1.to_excel('merge30.xlsx', merge_cells=True)

xf = pd.ExcelFile('merge30.xlsx')
df2 = xf.parse(xf.sheet_names[0], has_index_names=True)

df1 == df2

>>> df1 == df2
        0     1     2
Foo
0    True  True  True
1    True  True  True
2    True  True  True

So that isn't round-tripping the MI, just the index names.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

Check out roundtrip in hemstring (the test that's actually checking for MI-things):

        def roundtrip(df, header=True, parser_hdr=0):

            with ensure_clean(self.ext) as path:
                df.to_excel(path, header=header)
                xf = pd.ExcelFile(path)
                res = xf.parse(xf.sheet_names[0], header=parser_hdr)
                return res

doesn't even check that it's reading it in the same way, and later it just checks that none of it is nan.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

bottom line is that this behavior wasn't well-specified before, so if read_excel can now always read in to_excel's output we should go with that by default. (and maybe just warn that it will strip out MI column names?)

@jmcnamara
Copy link
Contributor Author

Rebased Excel MultiIndex PR to the latest master and fixed conflicts.

@jtratner
Copy link
Contributor

jtratner commented Nov 5, 2013

Why do you need to use __getattribute__ and __setattr__ explicitly? Surprised to see that?

@jmcnamara
Copy link
Contributor Author

Why do you need to use getattribute and setattr explicitly? Surprised to see that?

                           xcell = wks.cell("%s%s" % (colletter, row))
                           for field in style.__fields__:
                               xcell.style.__setattr__(field, \
                                   style.__getattribute__(field))

Yes, that is kind of janky. I copied that from the existing code above those lines. You would imagine that it should just be a function like:

    xcell.style = _convert_cell_style(cell.style)

I might look at refactoring the style/format handling as a separate PR in the 1.4 milestone and re-enable the commented out testcase for cell formats in test_excel.py. It probably isn't worth it for this PR though.

@jtratner
Copy link
Contributor

jtratner commented Nov 5, 2013

Ah okay, I get why you'd want to match nearby code.

@cancan101
Copy link
Contributor

If if need be there is always the functions getattr and setattr.

@jtratner
Copy link
Contributor

jtratner commented Nov 5, 2013

This looks fine overall - I think the default should be merge_cells=True though.

@@ -8,7 +8,7 @@ numexpr==2.1
tables==2.3.1
matplotlib==1.1.1
openpyxl==1.6.2
xlsxwriter==0.4.3
xlsxwriter==0.4.6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the current version.....not sure if you care about testing with a previous version (or even if it matters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a few recent changes in XlsxWriter that are worth picking up. They don't affect any of the functionality here though. All tests will pass with versions from 0.4.3 onwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

I think it would be nice to have an example of this in io.rst (and maybe fix up the excel section a bit to have a bit more organization, and several sub-headings). could do in this PR or another, and can use the same example in v0.13.0

@@ -1130,7 +1130,8 @@ def to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None,

def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='',
float_format=None, cols=None, header=True, index=True,
index_label=None, startrow=0, startcol=0, engine=None):
index_label=None, startrow=0, startcol=0, engine=None,
merge_cells=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmcnamara why did you choose to have it default to False? I was thinking we could enable it given that the previous behavior was not particularly correct (nor at all roundtrip-able).

@jmcnamara
Copy link
Contributor Author

I think the default should be merge_cells=True though.

Ok. I'll change that around.

@jtratner
Copy link
Contributor

jtratner commented Nov 5, 2013

In terms of docs, I think you could take one of the straightforward examples from the issue and use that in the docs to demonstrate what's going on (and then just add a .. note:: that it may not always be possible to effectively round trip DataFrames with named Index and named columns).

@jmcnamara
Copy link
Contributor Author

I think it would be nice to have an example of this in io.rst (and maybe fix up the excel section a bit to have a bit more organization, and several sub-headings). could do in this PR or another, and can use the same example in v0.13.0

If you like, I could write a document section on Excel handling from Pandas: how to read and write files, explanations of some of the options with some examples and screenshots.

That would be better as a separate PR though.

Also, here is a related document that I've been working on for after the 0.13 release: Using Pandas and XlsxWriter to create Excel charts

Added merged cell formatting for MultiIndex and Hierarchical Rows.
Issue #5254.
@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

okay, so is this ready to go? (and then I'll open an issue about docs)

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

btw - those docs look nice - when you say that you can use xlsxwriter to create excel charts, do you mean that you can create charts that you can open and edit in Excel?

@jmcnamara
Copy link
Contributor Author

Yes. PR is ready to go.

And yes the charts are in Excel rather than, say, PNGs produced by matplotlib. Give it a try. There are some examples in the latter part. It would be good to get some feedback.

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

That is really amazing... I would like to get to that at some point (going to PyData this weekend but maybe next weekend or something). That would be so useful at work. Anyways, that's off topic.

jtratner added a commit that referenced this pull request Nov 6, 2013
ENH: Better handling of MultiIndex with Excel
@jtratner jtratner merged commit b139288 into pandas-dev:master Nov 6, 2013
@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

thanks for this!

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

@jmcnamara just tested this out, I get hierarchical rows from this example, but a phantom nan - can you take a look at some point?

this code generates the screenshot below:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": range(10)}, index=[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]])

In [3]: df.to_excel('output.xlsx')

In [7]: df
Out[7]:
     A
a 1  0
  2  1
  3  2
  4  3
b 1  4
  2  5
  3  6
c 1  7
  2  8
  3  9

screen shot 2013-11-05 at 9 01 16 pm

When I read it back in, I get this:

In [5]: pd.read_excel('output.xlsx', 'Sheet1')
Out[5]:
       A
a   1  0
NaN 2  1
    3  2
    4  3
b   1  4
NaN 2  5
    3  6
c   1  7
NaN 2  8
    3  9

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

In [8]: _5.index
Out[8]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3, 4]],
           labels=[[0, -1, -1, -1, 1, -1, -1, 2, -1, -1], [0, 1, 2, 3, 0, 1, 2, 0, 1, 2]])

And that is specifically being read in as 'a', followed by 3 nans, then 'b', 2 nans, then 'c', and 2 nans.

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

To be clear, the MI should be:

MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3, 4]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 3, 0, 1, 2, 0, 1, 2]])

@jmcnamara
Copy link
Contributor Author

I get hierarchical rows from this example, but a phantom nan - can you take a look at some point?

Hmm, that is a little odd. I seeing it on one test system but not another. I'll look into it.

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

I'm guessing it's a Mac excel thing?

@jmcnamara
Copy link
Contributor Author

My bad. I was on a branch that didn't have the MI patch merged.

So unfortunately, reading HRs in the new merged format is broken. It is fixable but it will need some work. I'll address it in the 0.14 timeframe.

@jtratner
Copy link
Contributor

jtratner commented Nov 7, 2013

Honestly, that's fine, it wasn't working before and MIs weren't
roundtripping correctly anyways. Would you mind adding a warning to the
docs?

@jtratner
Copy link
Contributor

jtratner commented Nov 7, 2013

If you get to a fix earlier, feel free to submit it...bugfixes are good.

On Wed, Nov 6, 2013 at 7:23 PM, Jeffrey Tratner
jeffrey.tratner@gmail.comwrote:

Honestly, that's fine, it wasn't working before and MIs weren't
roundtripping correctly anyways. Would you mind adding a warning to the
docs?

@jmcnamara
Copy link
Contributor Author

Would you mind adding a warning to the
docs?

I put the following in the release note.

diff --git a/doc/source/release.rst b/doc/source/release.rst
index 4b33c20..77d78b2 100644
--- a/doc/source/release.rst
+++ b/doc/source/release.rst
@@ -209,6 +209,11 @@ Improvements to existing features
     by color as expected.
   - ``read_excel()`` now tries to convert integral floats (like ``1.0``) to int
     by default. (:issue:`5394`)
+  - Excel writers now have a default option ``merge_cells`` in ``to_excel()``
+    to merge cells in MultiIndex and Hierarchical Rows. Note: using this
+    option it is no longer possible to round trip Excel files with merged
+    MultiIndex and Hierarchical Rows. Set the ``merge_cells`` to ``False`` to
+    restore the previous behaviour.  (:issue:`5254`)

Hopefully, that should do for now until I fix MI parsing or document the issue in the new doc section on using Excel until it is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Better handling of MultiIndex with Excel
4 participants