-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679
Comments
http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex (but for csv), and ths might/probably needs special handling for excel |
only special handling would be converting merged cells into repeated entries like csv, so this is relatively minor. I.e.
just needs to change to something like [['bar', 'bar', 'baz', 'baz', 'baz'], ['A', 'B', 'C', 'D', 'E']] under the hood |
so, really, a function that takes in merged cell and splits it into individual cells all with the same value would be sufficient to take advantage of csv's existing behavior. |
@jtratner I think that is right, in your example related is the reverse (in |
@cancan101 interested in implementing this? just a minor modification of your |
I can take a look at this. I am equally interested in solving this for HTML files, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI |
Yeah, that's basically the same thing, you just want to end up with the following arrays. >>> data = [
['Three months ended April 30','Three months ended April 30',
'Six months ended April 30', 'Six months ended April 30'],
['2013', '2012', '2013', '2012']
]
>>> MultiIndex.from_arrays(data)
MultiIndex
[(u'Three months ended April 30', u'2013'), (u'Three months ended April 30', u'2012'), (u'Six months ended April 30', u'2013'), (u'Six months ended April 30', u'2012')] So if you have something like:
You want to convert that into 2 cells with text 'Span2' |
Exactly. I am going to create a similar issues to this one for HTML. FWIW, It would be great to merge the IO backends so that functionality like this can be shared. See: #4682. Shoot closing the other issue. |
@cancan101 well, I believe they mostly are, they just pass to a TextReader which does the majority of the work. (so, for example, the ExcelFile reader has to do some magic to convert all the values to a list of lists that can be passed to text reader). I think you could do both of these in the same issue and then refactor the multiindex creation methods from read_csv out for something they can all use - check out code around here for how it works under the hood (I think): https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L703 |
I believe that I looked at the parsers backends and that some, but not all, of the parsers use |
Now comes the other reason for improving the ExcelParser and/or the HTML parser: parsing hierarchical row indexes. A good example of this would be (different link from above): http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE. This table has 4 major sections (they can be identified as lines with no other data):
Within the first of those sections are a number of lines items and a section total. A good feature of the parser would be to extract this structure from the table. Obviously this is non-trivial. |
@cancan101 hierarchical row indexes are almost the same thing as hierarchical columns - the goal is to turn the input into something similar to what csv takes for hierarchical indexes. You just need to convert rowspan/horizontally-merged cells into repeated horizontal cells. |
@jtratner That is not actually the issue here. In this case there is no explicit indication in the table itself that the rows should be grouped in a certain manner. For example there is no rowspan / merged cell. In this case the user would need to supply some additional information: for example to treat the heading in an empty row as a new level in the hierarchy, |
@cancan101 that's probably outside the scope of what the pandas parser could do [if it's that complicated, user probably should handle the cleaning after that] - I'd suggest going for a very simple implementation to start out with (i.e., take colspan or merge length --> convert it to individual cells --> use read_csv's existing functionality to get a multiindex going) then after you have that working, you can consider what else makes sense to add. |
@jtratner The above is unrelated to the hierarchical columns. That would be another issues. The simplest option might be split the table on those empty lines. For example have some options for the above link to split out 3 tables rather than one. Alternatively the user could attempt to detect this himself this given that that row will have a lot of nans for values. |
@cancan101 can you try passing a list to header and see if it converts to mi? it may work already. |
@cancan101
If there's a more specific issue that you've come across, please create a new one just for that. I (and I think the other devs) would prefer to keep the IO backends' issues separate from each other. Here are the issues:
(1 should be done before 3) I'll create 1, 3 is #4683, and I think you should create 2 since you've got the clearest idea of what the issue is 😄. |
@cpcloud That page is an interesting example. Why is this not the first row of the header:
|
@cancan101 That is the first row of the header if you don't pass in the
|
@cpcloud I am not sure if this feature make sense: auto detecting the number of header rows based upon the table's use of |
@cancan101 Right. There's a couple of things you should know, just so that you understand why/how the header parsing is happening. There's no dependence on header, body, footer = data Since I have control over where I place the You're right that not all tables use the So, it's not that the number of header rows is detected. One thing that there isn't a test case for is multiple header rows, which is partially what some of these issues are about. |
@cpcloud I also believe that some tables use |
@cancan101 Yes, that case is covered. |
This came up here: http://stackoverflow.com/q/23703638/1240268 (with a strange html table example) |
related #4468
Add keyword argument to ExcelFile parser to take an integer / list for the rows to interpret as header rows. If more than one, interpret as hierarchical columns / MultiIndex.
Presumably this would also allow you to round-trip Data Frame with hierarchical columns.
Basically, if you have something spanning two columns, just converts to two cells with the data in the original cell, ending up just like what you need for csv reader.
The text was updated successfully, but these errors were encountered: