BUG: read_excel with openpyxl results in empty data frame #39001

luciodaou · 2021-01-06T13:52:26Z

[ X ] I have checked that this issue has not already been reported.
[ X ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Old behavior with xlrd 1.2.0 (last version with XLSX support)

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='xlrd')
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
>>> df.shape
(843, 11)

File is read perfectly and dataframe is ok, with all 11 columns.

Problem description

When using openpyxl as engine for read_excel:

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='openpyxl')
>>> print(df)
Empty DataFrame
Columns: [Column1, Column2, Column3]
Index: []
>>> df.shape
(0, 3)

Only the 3 first headers are read, and it stops.

However, if the file is opened directly with openpyxl, it works fine, I'm using it to open and save the file with a temporary name to make openpyxl work as engine on Pandas:

>>> wb = load_workbook('infile.xlsx')
>>> wb.save(filename = 'temp.xlsx')
>>> pd.read_excel('temp.xlsx', engine='openpyxl')
>>> df.shape
(843, 11)

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.128-microsoft-standard
Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.3.1
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-01-06T13:56:21Z

Hi, thanks for your report. Could you provide your file?

xmatthias · 2021-01-11T14:11:45Z

I do have the same exact thing with one file, while another file works just fine.

Unfortunately, in my case, i'm unable to share the broken file as it's a file containing proprietary information.

The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl.

However, opening and resaving the file cannot be the correct solution, as it's supposed to be an automated system without me (or excel) in the loop.

EDIT i've been able to find another file i am able to share where the same thing happens.
xxx_test.xlsx

asishm · 2021-01-11T16:30:27Z

this has to do with how openpyxl implements their parser

pandas uses openpyxl.load_workbook with read_only=True. Per https://stackoverflow.com/questions/38285364/python-openpyxl-read-only-mode-returns-a-different-row-count

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions.

Some applications set this incorrectly.

Changing to read_only=False probably would fix this but would likely be a perf hit

In [76]: def open_openpyxl(path, **kw):
    ...:     import openpyxl
    ...:     default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
    ...:     for k,v in default_kw.items():
    ...:         if k not in kw:
    ...:             kw[k] = v
    ...:     wb = openpyxl.load_workbook(path, **kw)
    ...:     sheet = wb.worksheets[0]
    ...:     print(sheet.calculate_dimension())
    ...:     from pandas.io.excel._openpyxl import OpenpyxlReader
    ...:     convert_cell = OpenpyxlReader(path)._convert_cell
    ...:     data = []
    ...:     for row in sheet.rows:
    ...:          data.append([convert_cell(cell, False) for cell in row])
    ...:     return data
    ...:

In [77]: open_openpyxl(path)
A1:A1
Out[77]: [[' ']]

In [78]: b = open_openpyxl(path, read_only=False)
A1:I169

In [79]: len(b)
Out[79]: 169

luciodaou · 2021-01-13T17:45:05Z

Hi, thanks for your report. Could you provide your file?

I'll try to generate a file without sensitive information, but I'm not sure it's gonna work.

The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl.

@xmatthias the opposite happens to me, the file increases on opening and saving with excel. My file is generated by a closed source system using some sort of database, as it allows exporting to lots of file formats (csv, table in word, table in pdf, xlsx, png, among others).

To devs: pythonexcel.org lists some alternatives to xlrd, like pylightxl, that's been actively maintained. I still haven't tested it as I prioritized using openpyxl until I got to this bug. But pylightxl looks promising for the simple stuff. Would it be possible to add it as a supported engine to Pandas?

xmatthias · 2021-01-13T18:24:47Z

@luciodaou It seems to depend on the original size, with the sample file above you're right, filesize increases also for me.

initially, i hit this problem with another, bigger file, which reduced it's size when resaving with excel.

As @asishm pointed out, maybe there's a way to fix this within openpyxl ... as this would look like a bug from there to me, but with the deprecation / removal of xlrd support, this comes up as part of a pandas problem.

asishm · 2021-01-13T20:20:31Z

@luciodaou could you try the snippet I posted above and see if you see a difference for your file?

To clarify - I don't necessarily think what openpyxl is doing is wrong. Openpyxl's read_only mode relies on whatever is generating the files to provide correct sheet info. The dimension being returned for the file that @xmatthias shared is A1 (if you unzip the excel file and parse the XML for that worksheet you get - <dimension ref="A1"/>). I created a random file via Excel and I can see it saves the dimension as A1:B10. The reason re-saving using openpyxl works is probably because they correctly re-compute the dimensions of the sheet and save it back.

Currently pandas uses read_only=True as a default and doesn't provide an interface for the user to change the parameters being used. Maybe a good option here would be to add in a parameter to the read_excel API to pandas called engine_params that allows the user to be able to override the defaults used by pandas.

luciodaou · 2021-01-14T00:48:10Z

I tried to open the file with pylightxl and still got some error, here's the output:

  File "/home/lucio/.local/lib/python3.8/site-packages/pylightxl/pylightxl.py", line 136, in readxl
    data = readxl_scrape(fn, fn_ws, sharedString)
  File "/home/lucio/.local/lib/python3.8/site-packages/pylightxl/pylightxl.py", line 338, in readxl_scrape
    cell_val = sharedString[int(cell_val)]
ValueError: invalid literal for int() with base 10: ''

@asishm I'll test the snippet on a test script.

luciodaou · 2021-01-14T01:22:22Z

@asishm you're right, that's the exact issue. See output below.

Stackoverflow test:

>>> load_workbook('infile.xlsx', read_only = True ).active.max_row
1
>>> load_workbook('infile.xlsx', read_only = False ).active.max_row
1239

Snippet Results:

>>> infile = 'infile.xlsx'
>>> def open_openpyxl(path, **kw):
...     import openpyxl
...     default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
...     for k,v in default_kw.items():
...         if k not in kw:
...             kw[k] = v
...     wb = openpyxl.load_workbook(infile, **kw)
...     sheet = wb.worksheets[0]
...     print(sheet.calculate_dimension())
...     from pandas.io.excel._openpyxl import OpenpyxlReader
...     convert_cell = OpenpyxlReader(path)._convert_cell
...     data = []
...     for row in sheet.rows:
...          data.append([convert_cell(cell, False) for cell in row])
...     return data
...
>>> open_openpyxl(infile)
A1:C1
[['Column1', 'Column2', 'Column3']]
>>> b = open_openpyxl(infile, read_only=False)
A1:K1239
>>> len(b)
1239
>>>

My considerations about this:

Is there a way to pass the openpyxl read_only parameter to pd.read_excel arguments? If not, could this be implemented on Pandas as an additional parameter to pd.read_excel, something like openpyxl_readonly = boolean ?
This would be the easiest way to make scripts easy to fix, just adding these 2 parameters to the line.
IMHO some information about this issue should be added to the pd.read_excel documentation page as a warning until a solution is found.
The Stackoverflow answer talks about using "ws.max_row = ws.max_column = None", but that didn't work for me:

>>> ws = load_workbook('infile.xlsx', read_only = True ).active.max_row
>>> ws.max_row = ws.max_column = None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute 'max_row'
>>>

xmatthias · 2021-01-14T05:29:12Z

@luciodaou
the problem with your script is the first line, which assigns max_row to ws.
Maybe try to remove the .max_row before running this?

WolfgangFellger · 2021-01-19T09:32:42Z

Turning read-only mode off has other implications though, in particular higher memory consumption.

According to openpyxl's docs, the recommended solution is to call sheet.reset_dimensions(). I. e. at the beginning of OpenpyxlReader.get_sheet_data():

    if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

I monkeypatched this for our application and it seems to work fine.

simonjayhawkins · 2021-01-19T14:07:00Z

I monkeypatched this for our application and it seems to work fine.

cc @WillAyd

WillAyd · 2021-01-19T17:59:18Z

Thanks to all on this issue that have commented - very great insights.

* Is there a way to pass the openpyxl read_only parameter to pd.read_excel arguments? If not, could this be implemented on Pandas as an additional parameter to pd.read_excel, something like _openpyxl_readonly = boolean_ ?

This seems reasonable and I think there is already an issue for it if you can search. I think we should have a parameter that can be used to pass through engine-specific arguments

WillAyd · 2021-01-19T21:08:52Z

Hmm so this is the PR I was thinking of #26465 . I guess that is slightly different and a little outdated. I think a keyword like engine_args which accepts a dict of arguments would be fine to add

xmatthias · 2021-01-19T22:08:11Z

while providing engine_args might be a welcome addition in general (for other usecases) - it's only moving the problem to the user to debug and figure out the correct arguments.

A solution like proposed above ( #39001 (comment) ) would directly fix this issue, otherwise i think pandas should temporarily revert to the previous default for excel file loading until a "proper" fix for this is found.

phofl · 2021-01-19T22:15:14Z

Reverting is not really a good option. See #38424

xmatthias · 2021-01-19T22:18:45Z

Well it's always possible to version-pin xlrd to the latest working version until the regression in pandas is fixed / a solution to this is found, so i don't really see that as blocking argument.

I also downgraded pandas (and xlrd) again after running into this issue.

Honestly, i'd also rather like to see a proper fix, as otherwise this will keep comming up again and again - but i don't think a fix that moves the responsibility to figure out the correct parameters to the user is going to get much love from the community.

WillAyd · 2021-01-19T22:19:51Z

How expensive is the reset_dimensions call? At the end of the day the real issue is with whatever application is generating the file as it produces incorrect metadata that openpyxl relies on. If its an expensive call that negatively impacts 99% of use cases I don't think worth adding, but if trivial that could be a compromise

phofl · 2021-01-19T22:20:44Z

The maintainer off xlrd explicitly talks about security vulnerabilities in every version of xlrd, you can always use engine="xlrd"

xmatthias · 2021-01-19T22:24:49Z

based on the code above, i don't think this will impact 99% of the usecases, but will rather impact maybe 0.5% of the usecases (where people are intentionally reading a sheet with 1 filled cell only), as it's first checking if the "reset_dimensions" is even necessary.

In cases where "A1:A1" is correct, it'll have an unnecessary call for sure - but it should be quick in these cases, as the sheet is very small in this case anyway.

therefore, reset_dimensions() will only be called when it really seems necessary - which i wouldn't see as impacting 99% of the usecases, unless we assume 99% of the files contain wrong metadata - in which case a fix will be needed anyway.

  if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

source: #39001 (comment)

Obviously, fixing the "writing" application will be ideal, but it's most likely (at least in my case) some reporting system that can (accidentally / on purpose) write excel files.

WillAyd · 2021-01-19T22:27:27Z

If you can benchmark it would be helpful

…

On Jan 19, 2021, at 2:25 PM, Matthias ***@***.***> wrote: based on the code above, i don't think this will impact 99% of the usecases, but will rather impact maybe 0.5% of the usecases (where people are intentionally reading a sheet with 1 filled cell only), as it's first checking if the "reset_dimensions" is even necessary. In cases where A1:A1 is correct, it'll have an unnecessary call for sure - but it should be quick in these cases, as the sheet is very small in this case anyway. if sheet.calculate_dimension() == "A1:A1": sheet.reset_dimensions() — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39001 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEU4UNTFOZO2AB2JFEJF5TS2YBEDANCNFSM4VXR7MQQ>.

asishm · 2021-01-19T22:39:53Z

Per #39001 (comment)

  if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

this check would fail for @luciodaou 's file (they get a dimension of A1:C1). Say even if the check gets changed to both the numbers being 1

There may then be cases where sheet.calculate_dimension() returns A1:B2 which could be "visually inaccurate"

Not sure if there's a good way around this.

luciodaou · 2021-01-20T01:16:33Z

@asishm @WolfgangFellger Could you give an example script of how to do this check with my file? I've been assigned some different tasks on this second half of January, so I'm having much less time to properly code.

Thanks to all for the attention on this matter. I fully agree that this is not a specific Pandas issue, but many of us rely on faulty closed source software - in my case it's even harder to ask for any changes, as they may be charged on our company by the maker.

WolfgangFellger · 2021-01-20T08:08:29Z

Hmm sorry, hadn't realized yours was actually set to A1:A3. That does indeed complicate matters :-/ (also you really need to post that now ;)

With both @xmatthias sample and ours (the files we received were produced by Apache POI by the way), they had the dimension element set to plain <dimension ref="A1"/>, i.e. they did just not bother to compute it. I'm attaching a (redacted) example.

The only ways I can think of then are either to always call reset_dimensions() (I wager this is still faster than switching read-only-mode off), or introduce a switch to force the call.

apache-poi-sample.xlsx

rhshadrach · 2021-01-30T20:38:57Z

All benchmarks below are on an excel file created via pd.DataFrame(np.random.rand(10000, 10)).to_excel("test.xlsx"), looped for 20 iterations. The code for each iteration is pd.read_excel('test.xlsx', engine='openpyxl'), where I am modifying the pandas internals to make sure there aren't any gotchas.

Using read_only=False takes 24.532 seconds as opposed to 19.897 with read_only=True. Calling reset_dimension is just the line:

self._max_row = self._max_column = None

However, by resetting the dimension alone openpyxl will no longer pad the rows when reading, and this results in the behavior that is the cause of #38956. To get correct results, one must also call calculate_dimension which unfortunately iterates in python space. It then seems to me that the approach taken in #39486, along with calling reset_dimension is the best one. This takes 20.183 seconds.

ElGarno · 2022-06-29T15:49:35Z

Sometimes there is also a "hidden sheet" which results of bad exports.. You should use the sheet_name parameter for your sheet then or you could also use sheet_name=None. Then you get a dict with the empty df of the hidden sheet and the other data

luciodaou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021

phofl added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021

This was referenced Jan 18, 2021

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

Closed

BUG: read_excel fails with IndexError: list index out of range #39250

Closed

simonjayhawkins added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version and removed Needs Info Clarification about behavior needed to assess issue labels Jan 19, 2021

simonjayhawkins added this to the 1.2.1 milestone Jan 19, 2021

thuwon mentioned this issue Jan 20, 2021

BUG: read_excel fails to retrieve complete data / returns empty dataframe #39284

Closed

3 tasks

jorisvandenbossche modified the milestones: 1.2.1, 1.2.2 Jan 20, 2021

rhshadrach mentioned this issue Jan 30, 2021

BUG: read_excel with openpyxl and missing dimension #39486

Merged

5 tasks

rhshadrach mentioned this issue Feb 1, 2021

BUG: read_excel with openpyxl produces trailing rows of nan #39547

Merged

4 tasks

jreback closed this as completed in #39486 Feb 5, 2021

dweindl mentioned this issue Nov 4, 2023

ENH: Allow passing read_only, data_only and keep_links arguments to openpyxl using engine_kwargs #55807

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel with openpyxl results in empty data frame #39001

BUG: read_excel with openpyxl results in empty data frame #39001

luciodaou commented Jan 6, 2021 •

edited

Loading

INSTALLED VERSIONS

phofl commented Jan 6, 2021

xmatthias commented Jan 11, 2021 •

edited

Loading

asishm commented Jan 11, 2021 •

edited

Loading

luciodaou commented Jan 13, 2021 •

edited

Loading

xmatthias commented Jan 13, 2021

asishm commented Jan 13, 2021

luciodaou commented Jan 14, 2021

luciodaou commented Jan 14, 2021 •

edited

Loading

xmatthias commented Jan 14, 2021

WolfgangFellger commented Jan 19, 2021

simonjayhawkins commented Jan 19, 2021

WillAyd commented Jan 19, 2021

WillAyd commented Jan 19, 2021

xmatthias commented Jan 19, 2021 •

edited

Loading

phofl commented Jan 19, 2021

xmatthias commented Jan 19, 2021

WillAyd commented Jan 19, 2021

phofl commented Jan 19, 2021 •

edited

Loading

xmatthias commented Jan 19, 2021 •

edited

Loading

WillAyd commented Jan 19, 2021 via email

asishm commented Jan 19, 2021

luciodaou commented Jan 20, 2021

WolfgangFellger commented Jan 20, 2021

rhshadrach commented Jan 30, 2021

ElGarno commented Jun 29, 2022 •

edited

Loading

BUG: read_excel with openpyxl results in empty data frame #39001

BUG: read_excel with openpyxl results in empty data frame #39001

Comments

luciodaou commented Jan 6, 2021 • edited Loading

Old behavior with xlrd 1.2.0 (last version with XLSX support)

Problem description

INSTALLED VERSIONS

phofl commented Jan 6, 2021

xmatthias commented Jan 11, 2021 • edited Loading

asishm commented Jan 11, 2021 • edited Loading

luciodaou commented Jan 13, 2021 • edited Loading

xmatthias commented Jan 13, 2021

asishm commented Jan 13, 2021

luciodaou commented Jan 14, 2021

luciodaou commented Jan 14, 2021 • edited Loading

xmatthias commented Jan 14, 2021

WolfgangFellger commented Jan 19, 2021

simonjayhawkins commented Jan 19, 2021

WillAyd commented Jan 19, 2021

WillAyd commented Jan 19, 2021

xmatthias commented Jan 19, 2021 • edited Loading

phofl commented Jan 19, 2021

xmatthias commented Jan 19, 2021

WillAyd commented Jan 19, 2021

phofl commented Jan 19, 2021 • edited Loading

xmatthias commented Jan 19, 2021 • edited Loading

WillAyd commented Jan 19, 2021 via email

asishm commented Jan 19, 2021

luciodaou commented Jan 20, 2021

WolfgangFellger commented Jan 20, 2021

rhshadrach commented Jan 30, 2021

ElGarno commented Jun 29, 2022 • edited Loading

luciodaou commented Jan 6, 2021 •

edited

Loading

xmatthias commented Jan 11, 2021 •

edited

Loading

asishm commented Jan 11, 2021 •

edited

Loading

luciodaou commented Jan 13, 2021 •

edited

Loading

luciodaou commented Jan 14, 2021 •

edited

Loading

xmatthias commented Jan 19, 2021 •

edited

Loading

phofl commented Jan 19, 2021 •

edited

Loading

xmatthias commented Jan 19, 2021 •

edited

Loading

ElGarno commented Jun 29, 2022 •

edited

Loading