-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Better handling of MultiIndex with Excel #5423
Conversation
I added your notes to the top of the PR for ease of review. |
so it still can't read in the (simpler) merged cells versions, right? Or can it? |
Okay, thanks. I'm never sure if that gets converted to a commit message on merge or not. I'm guessing it doesn't?
Yes, it can still read the old dot format. It can also read the new merge format for simple single row MIs. |
so then it's not even necessary to have the keyword argument, right? Or we could at least flip it to be the other direction? (default merge_cells, existing behavior under kwarg)? Previously it couldn't even roundtrip an MI... |
Hmm, perhaps I'm wrong then. I thought that is what the existing tests with import pandas as pd
df1 = pd.DataFrame([[7] *3, [8] *3, [9] *3])
df1.index.name = 'Foo'
df1.to_excel('merge30.xlsx', merge_cells=True)
xf = pd.ExcelFile('merge30.xlsx')
df2 = xf.parse(xf.sheet_names[0], has_index_names=True)
df1 == df2
>>> df1 == df2
0 1 2
Foo
0 True True True
1 True True True
2 True True True So that isn't round-tripping the MI, just the index names. |
Check out roundtrip in hemstring (the test that's actually checking for MI-things): def roundtrip(df, header=True, parser_hdr=0):
with ensure_clean(self.ext) as path:
df.to_excel(path, header=header)
xf = pd.ExcelFile(path)
res = xf.parse(xf.sheet_names[0], header=parser_hdr)
return res doesn't even check that it's reading it in the same way, and later it just checks that none of it is nan. |
bottom line is that this behavior wasn't well-specified before, so if |
Rebased Excel MultiIndex PR to the latest master and fixed conflicts. |
Why do you need to use |
xcell = wks.cell("%s%s" % (colletter, row))
for field in style.__fields__:
xcell.style.__setattr__(field, \
style.__getattribute__(field)) Yes, that is kind of janky. I copied that from the existing code above those lines. You would imagine that it should just be a function like: xcell.style = _convert_cell_style(cell.style) I might look at refactoring the style/format handling as a separate PR in the 1.4 milestone and re-enable the commented out testcase for cell formats in test_ |
Ah okay, I get why you'd want to match nearby code. |
If if need be there is always the functions |
This looks fine overall - I think the default should be merge_cells=True though. |
@@ -8,7 +8,7 @@ numexpr==2.1 | |||
tables==2.3.1 | |||
matplotlib==1.1.1 | |||
openpyxl==1.6.2 | |||
xlsxwriter==0.4.3 | |||
xlsxwriter==0.4.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is the current version.....not sure if you care about testing with a previous version (or even if it matters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a few recent changes in XlsxWriter that are worth picking up. They don't affect any of the functionality here though. All tests will pass with versions from 0.4.3 onwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine.
I think it would be nice to have an example of this in io.rst (and maybe fix up the excel section a bit to have a bit more organization, and several sub-headings). could do in this PR or another, and can use the same example in v0.13.0 |
@@ -1130,7 +1130,8 @@ def to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None, | |||
|
|||
def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', | |||
float_format=None, cols=None, header=True, index=True, | |||
index_label=None, startrow=0, startcol=0, engine=None): | |||
index_label=None, startrow=0, startcol=0, engine=None, | |||
merge_cells=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmcnamara why did you choose to have it default to False? I was thinking we could enable it given that the previous behavior was not particularly correct (nor at all roundtrip-able).
Ok. I'll change that around. |
In terms of docs, I think you could take one of the straightforward examples from the issue and use that in the docs to demonstrate what's going on (and then just add a |
If you like, I could write a document section on Excel handling from Pandas: how to read and write files, explanations of some of the options with some examples and screenshots. That would be better as a separate PR though. Also, here is a related document that I've been working on for after the 0.13 release: Using Pandas and XlsxWriter to create Excel charts |
Added merged cell formatting for MultiIndex and Hierarchical Rows. Issue #5254.
okay, so is this ready to go? (and then I'll open an issue about docs) |
btw - those docs look nice - when you say that you can use xlsxwriter to create excel charts, do you mean that you can create charts that you can open and edit in Excel? |
Yes. PR is ready to go. And yes the charts are in Excel rather than, say, PNGs produced by matplotlib. Give it a try. There are some examples in the latter part. It would be good to get some feedback. |
That is really amazing... I would like to get to that at some point (going to PyData this weekend but maybe next weekend or something). That would be so useful at work. Anyways, that's off topic. |
ENH: Better handling of MultiIndex with Excel
thanks for this! |
@jmcnamara just tested this out, I get hierarchical rows from this example, but a phantom nan - can you take a look at some point? this code generates the screenshot below: In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": range(10)}, index=[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]])
In [3]: df.to_excel('output.xlsx')
In [7]: df
Out[7]:
A
a 1 0
2 1
3 2
4 3
b 1 4
2 5
3 6
c 1 7
2 8
3 9 When I read it back in, I get this: In [5]: pd.read_excel('output.xlsx', 'Sheet1')
Out[5]:
A
a 1 0
NaN 2 1
3 2
4 3
b 1 4
NaN 2 5
3 6
c 1 7
NaN 2 8
3 9 |
In [8]: _5.index
Out[8]:
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3, 4]],
labels=[[0, -1, -1, -1, 1, -1, -1, 2, -1, -1], [0, 1, 2, 3, 0, 1, 2, 0, 1, 2]]) And that is specifically being read in as 'a', followed by 3 nans, then 'b', 2 nans, then 'c', and 2 nans. |
To be clear, the MI should be: MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3, 4]],
labels=[[0, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 3, 0, 1, 2, 0, 1, 2]]) |
Hmm, that is a little odd. I seeing it on one test system but not another. I'll look into it. |
I'm guessing it's a Mac excel thing? |
My bad. I was on a branch that didn't have the MI patch merged. So unfortunately, reading HRs in the new merged format is broken. It is fixable but it will need some work. I'll address it in the 0.14 timeframe. |
Honestly, that's fine, it wasn't working before and MIs weren't |
If you get to a fix earlier, feel free to submit it...bugfixes are good. On Wed, Nov 6, 2013 at 7:23 PM, Jeffrey Tratner
|
I put the following in the release note. diff --git a/doc/source/release.rst b/doc/source/release.rst
index 4b33c20..77d78b2 100644
--- a/doc/source/release.rst
+++ b/doc/source/release.rst
@@ -209,6 +209,11 @@ Improvements to existing features
by color as expected.
- ``read_excel()`` now tries to convert integral floats (like ``1.0``) to int
by default. (:issue:`5394`)
+ - Excel writers now have a default option ``merge_cells`` in ``to_excel()``
+ to merge cells in MultiIndex and Hierarchical Rows. Note: using this
+ option it is no longer possible to round trip Excel files with merged
+ MultiIndex and Hierarchical Rows. Set the ``merge_cells`` to ``False`` to
+ restore the previous behaviour. (:issue:`5254`) Hopefully, that should do for now until I fix MI parsing or document the issue in the new doc section on using Excel until it is fixed. |
Allow optional formatting of MultiIndex and Hierarchical Rows
as merged cells. closes #5254.
Some notes on this PR:
ci/requirements*.txt
to pick up a fix in that module that makes working with charts from pandas easier.release.rst
.format.py
to format MultiIndex and Hierarchical Rows as merged cells in Excel. The old code path/option is still available. The new formatting must be explicitly invoked via themerge_cells
option into_excel
:merge_cells
option can be renamed if required. During development I also usedmulti_index
andmerge_range
.frame.py
to reflect the new option.excel.py
.test_excel.py
test cases so that they could be used to test the existingdot notation
or the newmerge_cells
options for MultiIndex and Hierarchical Rows handling.test_excel.py
.