-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_excel with openpyxl results in empty data frame #39001
Comments
Hi, thanks for your report. Could you provide your file? |
I do have the same exact thing with one file, while another file works just fine. Unfortunately, in my case, i'm unable to share the broken file as it's a file containing proprietary information. The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl. However, opening and resaving the file cannot be the correct solution, as it's supposed to be an automated system without me (or excel) in the loop. EDIT i've been able to find another file i am able to share where the same thing happens. |
this has to do with how pandas uses
Changing to In [76]: def open_openpyxl(path, **kw):
...: import openpyxl
...: default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
...: for k,v in default_kw.items():
...: if k not in kw:
...: kw[k] = v
...: wb = openpyxl.load_workbook(path, **kw)
...: sheet = wb.worksheets[0]
...: print(sheet.calculate_dimension())
...: from pandas.io.excel._openpyxl import OpenpyxlReader
...: convert_cell = OpenpyxlReader(path)._convert_cell
...: data = []
...: for row in sheet.rows:
...: data.append([convert_cell(cell, False) for cell in row])
...: return data
...:
In [77]: open_openpyxl(path)
A1:A1
Out[77]: [[' ']]
In [78]: b = open_openpyxl(path, read_only=False)
A1:I169
In [79]: len(b)
Out[79]: 169 |
I'll try to generate a file without sensitive information, but I'm not sure it's gonna work.
@xmatthias the opposite happens to me, the file increases on opening and saving with excel. My file is generated by a closed source system using some sort of database, as it allows exporting to lots of file formats (csv, table in word, table in pdf, xlsx, png, among others). To devs: pythonexcel.org lists some alternatives to xlrd, like pylightxl, that's been actively maintained. I still haven't tested it as I prioritized using openpyxl until I got to this bug. But pylightxl looks promising for the simple stuff. Would it be possible to add it as a supported engine to Pandas? |
@luciodaou It seems to depend on the original size, with the sample file above you're right, filesize increases also for me. initially, i hit this problem with another, bigger file, which reduced it's size when resaving with excel. As @asishm pointed out, maybe there's a way to fix this within openpyxl ... as this would look like a bug from there to me, but with the deprecation / removal of xlrd support, this comes up as part of a pandas problem. |
@luciodaou could you try the snippet I posted above and see if you see a difference for your file? To clarify - I don't necessarily think what Currently pandas uses |
I tried to open the file with pylightxl and still got some error, here's the output:
@asishm I'll test the snippet on a test script. |
@asishm you're right, that's the exact issue. See output below. Stackoverflow test:
Snippet Results:
My considerations about this:
|
@luciodaou |
Turning read-only mode off has other implications though, in particular higher memory consumption. According to openpyxl's docs, the recommended solution is to call sheet.reset_dimensions(). I. e. at the beginning of if sheet.calculate_dimension() == "A1:A1":
sheet.reset_dimensions() I monkeypatched this for our application and it seems to work fine. |
cc @WillAyd |
Thanks to all on this issue that have commented - very great insights.
This seems reasonable and I think there is already an issue for it if you can search. I think we should have a parameter that can be used to pass through engine-specific arguments |
Hmm so this is the PR I was thinking of #26465 . I guess that is slightly different and a little outdated. I think a keyword like |
while providing A solution like proposed above ( #39001 (comment) ) would directly fix this issue, otherwise i think pandas should temporarily revert to the previous default for excel file loading until a "proper" fix for this is found. |
Reverting is not really a good option. See #38424 |
Well it's always possible to version-pin xlrd to the latest working version until the regression in pandas is fixed / a solution to this is found, so i don't really see that as blocking argument. I also downgraded pandas (and xlrd) again after running into this issue. Honestly, i'd also rather like to see a proper fix, as otherwise this will keep comming up again and again - but i don't think a fix that moves the responsibility to figure out the correct parameters to the user is going to get much love from the community. |
How expensive is the reset_dimensions call? At the end of the day the real issue is with whatever application is generating the file as it produces incorrect metadata that openpyxl relies on. If its an expensive call that negatively impacts 99% of use cases I don't think worth adding, but if trivial that could be a compromise |
The maintainer off xlrd explicitly talks about security vulnerabilities in every version of xlrd, you can always use |
based on the code above, i don't think this will impact 99% of the usecases, but will rather impact maybe 0.5% of the usecases (where people are intentionally reading a sheet with 1 filled cell only), as it's first checking if the "reset_dimensions" is even necessary. In cases where therefore, if sheet.calculate_dimension() == "A1:A1":
sheet.reset_dimensions()
Obviously, fixing the "writing" application will be ideal, but it's most likely (at least in my case) some reporting system that can (accidentally / on purpose) write excel files. |
If you can benchmark it would be helpful
… On Jan 19, 2021, at 2:25 PM, Matthias ***@***.***> wrote:
based on the code above, i don't think this will impact 99% of the usecases, but will rather impact maybe 0.5% of the usecases (where people are intentionally reading a sheet with 1 filled cell only), as it's first checking if the "reset_dimensions" is even necessary.
In cases where A1:A1 is correct, it'll have an unnecessary call for sure - but it should be quick in these cases, as the sheet is very small in this case anyway.
if sheet.calculate_dimension() == "A1:A1":
sheet.reset_dimensions()
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#39001 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEU4UNTFOZO2AB2JFEJF5TS2YBEDANCNFSM4VXR7MQQ>.
|
Per #39001 (comment) if sheet.calculate_dimension() == "A1:A1":
sheet.reset_dimensions() this check would fail for @luciodaou 's file (they get a dimension of A1:C1). Say even if the check gets changed to both the numbers being 1 There may then be cases where Not sure if there's a good way around this. |
@asishm @WolfgangFellger Could you give an example script of how to do this check with my file? I've been assigned some different tasks on this second half of January, so I'm having much less time to properly code. Thanks to all for the attention on this matter. I fully agree that this is not a specific Pandas issue, but many of us rely on faulty closed source software - in my case it's even harder to ask for any changes, as they may be charged on our company by the maker. |
Hmm sorry, hadn't realized yours was actually set to A1:A3. That does indeed complicate matters :-/ (also you really need to post that now ;) With both @xmatthias sample and ours (the files we received were produced by Apache POI by the way), they had the dimension element set to plain The only ways I can think of then are either to always call |
All benchmarks below are on an excel file created via Using
However, by resetting the dimension alone openpyxl will no longer pad the rows when reading, and this results in the behavior that is the cause of #38956. To get correct results, one must also call |
Sometimes there is also a "hidden sheet" which results of bad exports.. You should use the sheet_name parameter for your sheet then or you could also use |
[ X ] I have checked that this issue has not already been reported.
[ X ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Old behavior with xlrd 1.2.0 (last version with XLSX support)
File is read perfectly and dataframe is ok, with all 11 columns.
Problem description
When using openpyxl as engine for read_excel:
Only the 3 first headers are read, and it stops.
However, if the file is opened directly with openpyxl, it works fine, I'm using it to open and save the file with a temporary name to make openpyxl work as engine on Pandas:
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.128-microsoft-standard
Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.3.1
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None
The text was updated successfully, but these errors were encountered: