-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the rename()
method for performance improvement
#550
Refactor the rename()
method for performance improvement
#550
Conversation
rename()
method for performance improvement
Codecov Report
@@ Coverage Diff @@
## main #550 +/- ##
=======================================
- Coverage 93.5% 93.5% -0.1%
=======================================
Files 48 48
Lines 5310 5321 +11
=======================================
+ Hits 4970 4979 +9
- Misses 340 342 +2
Continue to review full report at Codecov.
|
I have done a bit of testing smaller files (35 MB xlsx) - 11 regions rename
Large file ~30 million rows (in long format - 650 MB csv) - 335 regions rename
which is a bit strange -... What it appears is actually the time scales with # of regions being renamed
Anyways - I think its a good improvement - thanks @danielhuppmann ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Thanks @byersiiasa for doing more elaborate tests! Bit surprised that you don't see more of a performance improvement... But it does make sense that doing a list comprehension with a large mapping inside is not efficient. So I added two things:
In my example, this gives another 10-20% performance improvement (now testing a one-item dictionary, several-items and all-items). |
Please confirm that this PR has done the following:
Name of contributors Added to AUTHORS.rstDescription of PR
@byersiiasa asked me about the
rename()
method, which was causing some headache in the AR6 processing workflow. I had a look and realized that it was operating on a full DataFrame rather than the indexed Series (still from before the refactoring last summer).So I refactored it to use the pyam.index-module functions.
Performance improvement
Using a 140MB-file from the ongoing ENGAGE work, a multi-dimensional rename (touching some regions and some variables) went from 14 seconds to <4 seconds (and I used
pyam.compare()
to assert that they are identical).API change
While working on this, I noticed that the current implementation can be used as an equivalent to the
aggregate()
method by mapping several items to the same new name, e.g.,df.rename(variable={"foo": "new", "bar": "new"}
. Currently, this will implicitly apply a groupby-sum, and only raise an error if a variable "new" already exists.(This error can be silenced with
check_duplicates=False
, in which case groupby-sum will be applied to have a unique timeseries data index).The new implementation shows a deprecation warning when renaming to a non-unique index, and it will switched to an error with release 1.0.
(The override
check_duplicates=False
will continue to work.)