Refactor the `rename()` method for performance improvement #550

danielhuppmann · 2021-06-24T11:10:49Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
~~Name of contributors Added to AUTHORS.rst~~
Description in RELEASE_NOTES.md Added

Description of PR

@byersiiasa asked me about the rename() method, which was causing some headache in the AR6 processing workflow. I had a look and realized that it was operating on a full DataFrame rather than the indexed Series (still from before the refactoring last summer).

So I refactored it to use the pyam.index-module functions.

Performance improvement

Using a 140MB-file from the ongoing ENGAGE work, a multi-dimensional rename (touching some regions and some variables) went from 14 seconds to <4 seconds (and I used pyam.compare() to assert that they are identical).

API change

While working on this, I noticed that the current implementation can be used as an equivalent to the aggregate() method by mapping several items to the same new name, e.g., df.rename(variable={"foo": "new", "bar": "new"}. Currently, this will implicitly apply a groupby-sum, and only raise an error if a variable "new" already exists.

(This error can be silenced with check_duplicates=False, in which case groupby-sum will be applied to have a unique timeseries data index).

The new implementation shows a deprecation warning when renaming to a non-unique index, and it will switched to an error with release 1.0.

(The override check_duplicates=False will continue to work.)

codecov · 2021-06-24T11:18:59Z

Codecov Report

Merging #550 (6318a9f) into main (979dfe2) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main    #550     +/-   ##
=======================================
- Coverage   93.5%   93.5%   -0.1%     
=======================================
  Files         48      48             
  Lines       5310    5321     +11     
=======================================
+ Hits        4970    4979      +9     
- Misses       340     342      +2

Impacted Files	Coverage Δ
pyam/logging.py	`59.3% <ø> (ø)`
pyam/core.py	`92.5% <100.0%> (-0.3%)`	⬇️
pyam/index.py	`97.9% <100.0%> (+0.4%)`	⬆️
tests/test_index.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 979dfe2...6318a9f. Read the comment docs.

byersiiasa · 2021-06-24T19:08:08Z

I have done a bit of testing

smaller files (35 MB xlsx) - 11 regions rename

v0.11.xxx = ~12s
this version: = ~6s

Large file ~30 million rows (in long format - 650 MB csv) - 335 regions rename

v0.11.xxx = ~53s
this version: = ~49s

which is a bit strange -...

What it appears is actually the time scales with # of regions being renamed
Large file (old vs new):

335 regions: 53 vs 49
200 regions: 36 vs 34
100 regions: 17 vs 17
10 regions: 16.5 vs 16.5

Anyways - I think its a good improvement - thanks @danielhuppmann !

byersiiasa

Great!

danielhuppmann · 2021-06-25T05:09:52Z

Thanks @byersiiasa for doing more elaborate tests! Bit surprised that you don't see more of a performance improvement...

But it does make sense that doing a list comprehension with a large mapping inside is not efficient. So I added two things:

~~use pd.Series.replace instead of the mapping inside a list comprehension~~
use the replace_index_values() function in a nested manner
do a fast-pass if all rows of the timeseries data are touched (working directly on the index codes/levels rather than touching index-level values)

In my example, this gives another 10-20% performance improvement (now testing a one-item dictionary, several-items and all-items).

danielhuppmann added 4 commits June 24, 2021 10:07

Generalize append_index_level() and replace_index_values()

8d95649

Refactor rename() to work directly on data index

8020a48

Fix summation

d7c7f38

Reintroduce current behavior, add deprecation warning

4c0e474

danielhuppmann requested a review from byersiiasa June 24, 2021 11:10

danielhuppmann self-assigned this Jun 24, 2021

danielhuppmann changed the title ~~Refactor~~ Refactor the rename() method for performance improvement Jun 24, 2021

danielhuppmann added 2 commits June 24, 2021 13:11

Add to release notes

9a6f782

Appease stickler

084a0ca

danielhuppmann added 2 commits June 24, 2021 13:28

Update the docs

d5f043a

Fix a typo

4c65d10

danielhuppmann marked this pull request as ready for review June 24, 2021 11:51

byersiiasa approved these changes Jun 24, 2021

View reviewed changes

Add fast-pass for all rows, use pd.Series.replace instead of list

2940e1e

danielhuppmann added 2 commits June 25, 2021 10:35

Use the replace_index_values() function instead of pd.Series.replace

da587a4

Merge branch 'main' into rename/fastpass

6318a9f

danielhuppmann merged commit ed130e3 into IAMconsortium:main Jun 28, 2021

danielhuppmann deleted the rename/fastpass branch June 28, 2021 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the `rename()` method for performance improvement #550

Refactor the `rename()` method for performance improvement #550

danielhuppmann commented Jun 24, 2021 •

edited

Loading

codecov bot commented Jun 24, 2021 •

edited

Loading

byersiiasa commented Jun 24, 2021

byersiiasa left a comment

danielhuppmann commented Jun 25, 2021 •

edited

Loading

Refactor the rename() method for performance improvement #550

Refactor the rename() method for performance improvement #550

Conversation

danielhuppmann commented Jun 24, 2021 • edited Loading

Please confirm that this PR has done the following:

Description of PR

Performance improvement

API change

codecov bot commented Jun 24, 2021 • edited Loading

Codecov Report

byersiiasa commented Jun 24, 2021

byersiiasa left a comment

Choose a reason for hiding this comment

danielhuppmann commented Jun 25, 2021 • edited Loading

Refactor the `rename()` method for performance improvement #550

Refactor the `rename()` method for performance improvement #550

danielhuppmann commented Jun 24, 2021 •

edited

Loading

codecov bot commented Jun 24, 2021 •

edited

Loading

danielhuppmann commented Jun 25, 2021 •

edited

Loading