Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPS: require updated python-dateutil, openpyxl #18182

Merged
merged 7 commits into from
Dec 4, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 8, 2017

closes #15184

reminder to change: https://github.com/conda-forge/pandas-feedstock on release

@jreback jreback added Dependencies Required and optional dependencies IO Excel read_excel, to_excel labels Nov 8, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 8, 2017
@jreback
Copy link
Contributor Author

jreback commented Nov 8, 2017

cc @neirbowj since you wrote most of the code I just ripped out :>

@neirbowj
Copy link
Contributor

neirbowj commented Nov 8, 2017

👍 It had a good run. I hope folks found it useful. I'm glad to see it retired.

And incidentally, no concerns for the FreeBSD port because the applicable dependencies are already at:

  • textproc/py-openpyxl 2.4.9
  • deve/py-dateutil 2.6.1

@jreback
Copy link
Contributor Author

jreback commented Nov 8, 2017

should we bump openpxl to even newer?

@neirbowj
Copy link
Contributor

neirbowj commented Nov 9, 2017

I'm probably not the right person to answer that question. I'm not an openpyxl user and haven't studied the codebase since the last time I got roped into responding to one of that project's minor updates that broke backwards compatibility. Therefore I cannot say whether a more aggressive version bump would be a good idea, bad idea, or indifferent.

From a port maintainer perspective, is there anything you'd like me to investigate?

@jreback
Copy link
Contributor Author

jreback commented Nov 9, 2017

i just want to make sure that i didn’t remove any relevant tests (that were not tied to earlier versions)

@neirbowj
Copy link
Contributor

neirbowj commented Nov 9, 2017

That would be non-trivial for me to verify, and I'm afraid can't justify spending time on it at the moment unless the test removal impacts the FreeBSD port.

@codecov
Copy link

codecov bot commented Nov 9, 2017

Codecov Report

Merging #18182 into master will increase coverage by 0.15%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18182      +/-   ##
==========================================
+ Coverage    91.4%   91.55%   +0.15%     
==========================================
  Files         163      162       -1     
  Lines       50064    49947     -117     
==========================================
- Hits        45759    45730      -29     
+ Misses       4305     4217      -88
Flag Coverage Δ
#multiple 89.36% <87.5%> (+0.16%) ⬆️
#single 40.42% <62.5%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/excel.py 90.13% <100%> (+9.73%) ⬆️
pandas/compat/__init__.py 58.71% <66.66%> (+0.6%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/io/formats/format.py 96.01% <0%> (ø) ⬆️
pandas/core/config_init.py 98.34% <0%> (ø) ⬆️
pandas/core/generic.py 95.72% <0%> (ø) ⬆️
pandas/core/groupby.py 92.04% <0%> (+0.01%) ⬆️
pandas/core/indexes/datetimes.py 95.48% <0%> (+0.09%) ⬆️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f9d4fb...e2d6b53. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 9, 2017

Codecov Report

Merging #18182 into master will increase coverage by 0.14%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18182      +/-   ##
==========================================
+ Coverage   91.44%   91.59%   +0.14%     
==========================================
  Files         157      156       -1     
  Lines       51449    51328     -121     
==========================================
- Hits        47048    47013      -35     
+ Misses       4401     4315      -86
Flag Coverage Δ
#multiple 89.45% <87.5%> (+0.15%) ⬆️
#single 40.66% <62.5%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/excel.py 90.13% <100%> (+9.73%) ⬆️
pandas/compat/__init__.py 58.77% <66.66%> (+0.58%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/util/testing.py 82.08% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.25% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd9a3e0...f04637b. Read the comment docs.

@jreback jreback force-pushed the versions branch 6 times, most recently from 8d28316 to 034528a Compare November 13, 2017 11:29
@Themanwithoutaplan
Copy link
Contributor

My suggestion would be to go with >= 2.4.0 as this includes utilities and styles for working with Pandas dataframes which should make integration code a lot easier and reliable. NB. Merged cells are still a problem but openpyxl 2.5 introduces something that, in time, will make these easier to work with.

2.4 also gives you the ws.values property which makes reading XLSX files a lot easier than the current xlrd spaghetti. xlrd is slightly faster on really big files due to a lower memory overhead but my suggestion would be to allow openpyxl in read-only mode (parallelisation of worksheets should be possible).

Would be happy to discuss this with anyone but haven't time to do it myself at the moment.

2.4 is stable and I haven't seen anything 2.3 specific issues for a good long time.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2017

@Themanwithoutaplan thanks for the suggestion.

@Themanwithoutaplan
Copy link
Contributor

http://openpyxl.readthedocs.io/en/default/pandas.html explains everything (based on the work we did together in Montreal).

@Themanwithoutaplan
Copy link
Contributor

Also, support for a dedicated datetime celltype is coming 2.5. Removes some of the problems associated with Excel's moronic handling of dates and times.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2017

@Themanwithoutaplan is this api supported in >= 2.4? (would you do a PR to change the pandas code)?

@Themanwithoutaplan
Copy link
Contributor

New datetimes are 2.5 only. Not sure when I can do a PR. We are having a sprint here in Düsseldorf next week but I was hoping to be able work on something else…

I would like to get 2.5 out the door. It's just missing the code for adding/removing rows and columns.

@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2017

any comments @jorisvandenbossche @TomAugspurger

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For openpyxl, it seems all code you removed was for versions < 2.2. Is there a reason to not just bump the minimum required to 2.2 instead of directly 2.4 in the PR?

For dateutil, do we win (code cleanup wise) a lot by not leaving the minimum version at eg 2.4 ?

Further:

  • Can you also update the io.rst docs? (there is a mention of the needed openpyxl version somewhere I think)

@jorisvandenbossche
Copy link
Member

For dateutil, do we win (code cleanup wise) a lot by not leaving the minimum version at eg 2.4 ?

Although, by the time 0.22 will be released, dateutil 2.5 will probably be almost 2 years old, so that is probably fine.

@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2017

For openpyxl, it seems all code you removed was for versions < 2.2. Is there a reason to not just bump the minimum required to 2.2 instead of directly 2.4 in the PR?

why? that's a pretty old version, I don't see a reason to not require newer things in developing libraries (esp there is a speedup coming), see the above comments.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 16, 2017

openpyxl 2.4 is just 1 year old, which is not that much. If it doesn't cost us anything to provide users the ability to use older version, I would do that (at the moment it is simply the check of the version that would need to be changed no?). If we would start using those newer features, then sure.

@jreback jreback changed the title DEPS: require updated python-dateutil & openpyxl DEPS: require updated python-dateutil, openpyxl, pytz Nov 23, 2017
@jreback jreback force-pushed the versions branch 3 times, most recently from 258f5e9 to b653b47 Compare November 27, 2017 00:38
@jreback
Copy link
Contributor Author

jreback commented Nov 27, 2017

| openpyxl | 2.4.0 | |
+-----------------+-----------------+----------+

- :func:`Series.fillna` now raises a ``TypeError`` instead of a ``ValueError`` when passed a list, tuple or DataFrame as a ``value`` (:issue:`18293`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge issue here

@jorisvandenbossche
Copy link
Member

Please also update io.rst (there are several mentions of openpyxl and its supported version)

What is the advantage of hard dropping support for older pytz versions? It does not cost us anything? (I mean in code complexity)

Definitely drop support for openpyxl < 2.4. There were changes in styles in all 2.x releases. 2.4 was built to provide an API for Pandas so that the spaghetti code could be cleared out. < 2.4 is not maintained.

But we do not use this API at this point? As we only have special cased code for openpyxl < 2.2. If there have been many changes in style, for me that is a reason more to not drop support (as long as we don't use features of the newer releases), if people want to preserve the actual styling they have right now.

@jreback
Copy link
Contributor Author

jreback commented Nov 28, 2017

What is the advantage of hard dropping support for older pytz versions? It does not cost us anything? (I mean in code complexity)

this is true, yeah just wanted to set some sort of minimum, though I guess its not really needed.

But we do not use this API at this point? As we only have special cased code for openpyxl < 2.2. If there have been many changes in style, for me that is a reason more to not drop support (as long as we don't use features of the newer releases), if people want to preserve the actual styling they have right now.

not the point, there was too much churn in the < 2.4 versions.

@Themanwithoutaplan
Copy link
Contributor

Themanwithoutaplan commented Nov 28, 2017

openpyxl used to have unreliable and incorrect style implementation so backwards-incompatible changes were essential. Not doing users any favours by keeping support for these around, especially as code coverage was never complete I've just managed to update my fork and hope to work on the > 2.4 implementation in the near future. Using the API will make it much easier in future to decide who's responsible.

pytest pandas/tests/io/test_excel.py is all I need to worry about, right?

@Themanwithoutaplan
Copy link
Contributor

I'm not having a lot of fun getting tests to run: segfault with Python 3.6 on Mac OS and unable to run test_excel.py in isolation. I'd better check the docs.

@jreback
Copy link
Contributor Author

jreback commented Nov 29, 2017

@jorisvandenbossche

@Themanwithoutaplan
Copy link
Contributor

I've started work on this.

@Themanwithoutaplan
Copy link
Contributor

Looking at the implementation: is it going to be possible to work directly with a dataframe? openpyxl doesn't need the intermediate ExcelCell object.

@jreback
Copy link
Contributor Author

jreback commented Nov 30, 2017

@Themanwithoutaplan the implementation is orthogonal to us changing versions
what would hapoen if you had a version of openpyxl that can process things more efficiently
then we could dispatch it that way, keeping existing code for older versions
if it passes all tests then that would be fine

so i assume this is not possible currently?

@Themanwithoutaplan
Copy link
Contributor

I'm going to look at interfacing directly from df.to_excel() using openpyxl.utils.dataframe.dataframe_to_rows() This removes the need for an ExcelWriter class entirely. The older code could then be binned entirely.

@jreback jreback force-pushed the versions branch 3 times, most recently from 25594ac to 38a339e Compare December 3, 2017 18:49
@jreback jreback changed the title DEPS: require updated python-dateutil, openpyxl, pytz DEPS: require updated python-dateutil, openpyxl Dec 4, 2017
@jreback jreback merged commit 3e4e4b3 into pandas-dev:master Dec 4, 2017
@Themanwithoutaplan
Copy link
Contributor

After some further investigation It looks to me that the best fit for the modern openpyxl API would be have a drop-in replacement for ExcelFormatter class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEPS: require modern openpyxl
4 participants