Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_excel with openpyxl results in empty data frame #39001

Closed
1 task
luciodaou opened this issue Jan 6, 2021 · 25 comments · Fixed by #39486
Closed
1 task

BUG: read_excel with openpyxl results in empty data frame #39001

luciodaou opened this issue Jan 6, 2021 · 25 comments · Fixed by #39486
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@luciodaou
Copy link

luciodaou commented Jan 6, 2021

  • [ X ] I have checked that this issue has not already been reported.

  • [ X ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Old behavior with xlrd 1.2.0 (last version with XLSX support)

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='xlrd')
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
>>> df.shape
(843, 11)

File is read perfectly and dataframe is ok, with all 11 columns.

Problem description

When using openpyxl as engine for read_excel:

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='openpyxl')
>>> print(df)
Empty DataFrame
Columns: [Column1, Column2, Column3]
Index: []
>>> df.shape
(0, 3)

Only the 3 first headers are read, and it stops.

However, if the file is opened directly with openpyxl, it works fine, I'm using it to open and save the file with a temporary name to make openpyxl work as engine on Pandas:

>>> wb = load_workbook('infile.xlsx')
>>> wb.save(filename = 'temp.xlsx')
>>> pd.read_excel('temp.xlsx', engine='openpyxl')
>>> df.shape
(843, 11)

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.128-microsoft-standard
Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.3.1
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None

@luciodaou luciodaou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021
@phofl
Copy link
Member

phofl commented Jan 6, 2021

Hi, thanks for your report. Could you provide your file?

@phofl phofl added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021
@xmatthias
Copy link

xmatthias commented Jan 11, 2021

I do have the same exact thing with one file, while another file works just fine.

Unfortunately, in my case, i'm unable to share the broken file as it's a file containing proprietary information.

The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl.

However, opening and resaving the file cannot be the correct solution, as it's supposed to be an automated system without me (or excel) in the loop.

EDIT i've been able to find another file i am able to share where the same thing happens.
xxx_test.xlsx

@asishm
Copy link
Contributor

asishm commented Jan 11, 2021

this has to do with how openpyxl implements their parser

pandas uses openpyxl.load_workbook with read_only=True. Per https://stackoverflow.com/questions/38285364/python-openpyxl-read-only-mode-returns-a-different-row-count

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions.

Some applications set this incorrectly.

Changing to read_only=False probably would fix this but would likely be a perf hit

In [76]: def open_openpyxl(path, **kw):
    ...:     import openpyxl
    ...:     default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
    ...:     for k,v in default_kw.items():
    ...:         if k not in kw:
    ...:             kw[k] = v
    ...:     wb = openpyxl.load_workbook(path, **kw)
    ...:     sheet = wb.worksheets[0]
    ...:     print(sheet.calculate_dimension())
    ...:     from pandas.io.excel._openpyxl import OpenpyxlReader
    ...:     convert_cell = OpenpyxlReader(path)._convert_cell
    ...:     data = []
    ...:     for row in sheet.rows:
    ...:          data.append([convert_cell(cell, False) for cell in row])
    ...:     return data
    ...:

In [77]: open_openpyxl(path)
A1:A1
Out[77]: [[' ']]

In [78]: b = open_openpyxl(path, read_only=False)
A1:I169

In [79]: len(b)
Out[79]: 169

@luciodaou
Copy link
Author

luciodaou commented Jan 13, 2021

Hi, thanks for your report. Could you provide your file?

I'll try to generate a file without sensitive information, but I'm not sure it's gonna work.

The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl.

@xmatthias the opposite happens to me, the file increases on opening and saving with excel. My file is generated by a closed source system using some sort of database, as it allows exporting to lots of file formats (csv, table in word, table in pdf, xlsx, png, among others).

To devs: pythonexcel.org lists some alternatives to xlrd, like pylightxl, that's been actively maintained. I still haven't tested it as I prioritized using openpyxl until I got to this bug. But pylightxl looks promising for the simple stuff. Would it be possible to add it as a supported engine to Pandas?

@xmatthias
Copy link

@luciodaou It seems to depend on the original size, with the sample file above you're right, filesize increases also for me.

initially, i hit this problem with another, bigger file, which reduced it's size when resaving with excel.

As @asishm pointed out, maybe there's a way to fix this within openpyxl ... as this would look like a bug from there to me, but with the deprecation / removal of xlrd support, this comes up as part of a pandas problem.

@asishm
Copy link
Contributor

asishm commented Jan 13, 2021

@luciodaou could you try the snippet I posted above and see if you see a difference for your file?

To clarify - I don't necessarily think what openpyxl is doing is wrong. Openpyxl's read_only mode relies on whatever is generating the files to provide correct sheet info. The dimension being returned for the file that @xmatthias shared is A1 (if you unzip the excel file and parse the XML for that worksheet you get - <dimension ref="A1"/>). I created a random file via Excel and I can see it saves the dimension as A1:B10. The reason re-saving using openpyxl works is probably because they correctly re-compute the dimensions of the sheet and save it back.

Currently pandas uses read_only=True as a default and doesn't provide an interface for the user to change the parameters being used. Maybe a good option here would be to add in a parameter to the read_excel API to pandas called engine_params that allows the user to be able to override the defaults used by pandas.

@luciodaou
Copy link
Author

I tried to open the file with pylightxl and still got some error, here's the output:

  File "/home/lucio/.local/lib/python3.8/site-packages/pylightxl/pylightxl.py", line 136, in readxl
    data = readxl_scrape(fn, fn_ws, sharedString)
  File "/home/lucio/.local/lib/python3.8/site-packages/pylightxl/pylightxl.py", line 338, in readxl_scrape
    cell_val = sharedString[int(cell_val)]
ValueError: invalid literal for int() with base 10: ''

@asishm I'll test the snippet on a test script.

@luciodaou
Copy link
Author

luciodaou commented Jan 14, 2021

@asishm you're right, that's the exact issue. See output below.

Stackoverflow test:

>>> load_workbook('infile.xlsx', read_only = True ).active.max_row
1
>>> load_workbook('infile.xlsx', read_only = False ).active.max_row
1239

Snippet Results:

>>> infile = 'infile.xlsx'
>>> def open_openpyxl(path, **kw):
...     import openpyxl
...     default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
...     for k,v in default_kw.items():
...         if k not in kw:
...             kw[k] = v
...     wb = openpyxl.load_workbook(infile, **kw)
...     sheet = wb.worksheets[0]
...     print(sheet.calculate_dimension())
...     from pandas.io.excel._openpyxl import OpenpyxlReader
...     convert_cell = OpenpyxlReader(path)._convert_cell
...     data = []
...     for row in sheet.rows:
...          data.append([convert_cell(cell, False) for cell in row])
...     return data
...
>>> open_openpyxl(infile)
A1:C1
[['Column1', 'Column2', 'Column3']]
>>> b = open_openpyxl(infile, read_only=False)
A1:K1239
>>> len(b)
1239
>>>

My considerations about this:

  • Is there a way to pass the openpyxl read_only parameter to pd.read_excel arguments? If not, could this be implemented on Pandas as an additional parameter to pd.read_excel, something like openpyxl_readonly = boolean ?
  • This would be the easiest way to make scripts easy to fix, just adding these 2 parameters to the line.
  • IMHO some information about this issue should be added to the pd.read_excel documentation page as a warning until a solution is found.
  • The Stackoverflow answer talks about using "ws.max_row = ws.max_column = None", but that didn't work for me:
>>> ws = load_workbook('infile.xlsx', read_only = True ).active.max_row
>>> ws.max_row = ws.max_column = None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute 'max_row'
>>>

@xmatthias
Copy link

@luciodaou
the problem with your script is the first line, which assigns max_row to ws.
Maybe try to remove the .max_row before running this?

@WolfgangFellger
Copy link

Turning read-only mode off has other implications though, in particular higher memory consumption.

According to openpyxl's docs, the recommended solution is to call sheet.reset_dimensions(). I. e. at the beginning of OpenpyxlReader.get_sheet_data():

    if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

I monkeypatched this for our application and it seems to work fine.

@simonjayhawkins
Copy link
Member

I monkeypatched this for our application and it seems to work fine.

cc @WillAyd

@simonjayhawkins simonjayhawkins added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version and removed Needs Info Clarification about behavior needed to assess issue labels Jan 19, 2021
@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Jan 19, 2021
@WillAyd
Copy link
Member

WillAyd commented Jan 19, 2021

Thanks to all on this issue that have commented - very great insights.

* Is there a way to pass the openpyxl read_only parameter to pd.read_excel arguments? If not, could this be implemented on Pandas as an additional parameter to pd.read_excel, something like _openpyxl_readonly = boolean_ ?

This seems reasonable and I think there is already an issue for it if you can search. I think we should have a parameter that can be used to pass through engine-specific arguments

@WillAyd
Copy link
Member

WillAyd commented Jan 19, 2021

Hmm so this is the PR I was thinking of #26465 . I guess that is slightly different and a little outdated. I think a keyword like engine_args which accepts a dict of arguments would be fine to add

@xmatthias
Copy link

xmatthias commented Jan 19, 2021

while providing engine_args might be a welcome addition in general (for other usecases) - it's only moving the problem to the user to debug and figure out the correct arguments.

A solution like proposed above ( #39001 (comment) ) would directly fix this issue, otherwise i think pandas should temporarily revert to the previous default for excel file loading until a "proper" fix for this is found.

@phofl
Copy link
Member

phofl commented Jan 19, 2021

Reverting is not really a good option. See #38424

@xmatthias
Copy link

Well it's always possible to version-pin xlrd to the latest working version until the regression in pandas is fixed / a solution to this is found, so i don't really see that as blocking argument.

I also downgraded pandas (and xlrd) again after running into this issue.

Honestly, i'd also rather like to see a proper fix, as otherwise this will keep comming up again and again - but i don't think a fix that moves the responsibility to figure out the correct parameters to the user is going to get much love from the community.

@WillAyd
Copy link
Member

WillAyd commented Jan 19, 2021

How expensive is the reset_dimensions call? At the end of the day the real issue is with whatever application is generating the file as it produces incorrect metadata that openpyxl relies on. If its an expensive call that negatively impacts 99% of use cases I don't think worth adding, but if trivial that could be a compromise

@phofl
Copy link
Member

phofl commented Jan 19, 2021

The maintainer off xlrd explicitly talks about security vulnerabilities in every version of xlrd, you can always use engine="xlrd"

@xmatthias
Copy link

xmatthias commented Jan 19, 2021

based on the code above, i don't think this will impact 99% of the usecases, but will rather impact maybe 0.5% of the usecases (where people are intentionally reading a sheet with 1 filled cell only), as it's first checking if the "reset_dimensions" is even necessary.

In cases where "A1:A1" is correct, it'll have an unnecessary call for sure - but it should be quick in these cases, as the sheet is very small in this case anyway.

therefore, reset_dimensions() will only be called when it really seems necessary - which i wouldn't see as impacting 99% of the usecases, unless we assume 99% of the files contain wrong metadata - in which case a fix will be needed anyway.

  if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

source: #39001 (comment)

Obviously, fixing the "writing" application will be ideal, but it's most likely (at least in my case) some reporting system that can (accidentally / on purpose) write excel files.

@WillAyd
Copy link
Member

WillAyd commented Jan 19, 2021 via email

@asishm
Copy link
Contributor

asishm commented Jan 19, 2021

Per #39001 (comment)

  if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

this check would fail for @luciodaou 's file (they get a dimension of A1:C1). Say even if the check gets changed to both the numbers being 1

There may then be cases where sheet.calculate_dimension() returns A1:B2 which could be "visually inaccurate"

Not sure if there's a good way around this.

@luciodaou
Copy link
Author

@asishm @WolfgangFellger Could you give an example script of how to do this check with my file? I've been assigned some different tasks on this second half of January, so I'm having much less time to properly code.

Thanks to all for the attention on this matter. I fully agree that this is not a specific Pandas issue, but many of us rely on faulty closed source software - in my case it's even harder to ask for any changes, as they may be charged on our company by the maker.

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.2.1, 1.2.2 Jan 20, 2021
@WolfgangFellger
Copy link

Hmm sorry, hadn't realized yours was actually set to A1:A3. That does indeed complicate matters :-/ (also you really need to post that now ;)

With both @xmatthias sample and ours (the files we received were produced by Apache POI by the way), they had the dimension element set to plain <dimension ref="A1"/>, i.e. they did just not bother to compute it. I'm attaching a (redacted) example.

The only ways I can think of then are either to always call reset_dimensions() (I wager this is still faster than switching read-only-mode off), or introduce a switch to force the call.

apache-poi-sample.xlsx

@rhshadrach
Copy link
Member

All benchmarks below are on an excel file created via pd.DataFrame(np.random.rand(10000, 10)).to_excel("test.xlsx"), looped for 20 iterations. The code for each iteration is pd.read_excel('test.xlsx', engine='openpyxl'), where I am modifying the pandas internals to make sure there aren't any gotchas.

Using read_only=False takes 24.532 seconds as opposed to 19.897 with read_only=True. Calling reset_dimension is just the line:

self._max_row = self._max_column = None

However, by resetting the dimension alone openpyxl will no longer pad the rows when reading, and this results in the behavior that is the cause of #38956. To get correct results, one must also call calculate_dimension which unfortunately iterates in python space. It then seems to me that the approach taken in #39486, along with calling reset_dimension is the best one. This takes 20.183 seconds.

@ElGarno
Copy link

ElGarno commented Jun 29, 2022

Sometimes there is also a "hidden sheet" which results of bad exports.. You should use the sheet_name parameter for your sheet then or you could also use sheet_name=None. Then you get a dict with the empty df of the hidden sheet and the other data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants