Add `get_data_column()`, refactor filtering by the time domain #562

danielhuppmann · 2021-07-16T15:58:38Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
~~Name of contributors Added to AUTHORS.rst~~
Description in RELEASE_NOTES.md Added

Description of PR

This PR adds a utility function get_data_column(name) as short-hand (and more efficient implementation) for df.data[name], because it avoids casting the internal _data pd.Series to a pd.DataFrame.

This utility function is then used to make filtering by the time domain more performant.

codecov · 2021-07-16T16:06:18Z

Codecov Report

Merging #562 (4407f85) into main (666da23) will increase coverage by 0.0%.
The diff coverage is 100.0%.

@@          Coverage Diff          @@
##            main    #562   +/-   ##
=====================================
  Coverage   93.7%   93.7%           
=====================================
  Files         50      50           
  Lines       5322    5332   +10     
=====================================
+ Hits        4987    4997   +10     
  Misses       335     335

Impacted Files	Coverage Δ
pyam/core.py	`94.3% <100.0%> (+<0.1%)`	⬆️
tests/test_core.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 666da23...4407f85. Read the comment docs.

danielhuppmann · 2021-07-20T06:25:12Z

While working on this PR, I noticed that pyam also has a getter df[column], which returns the same as the new function df.get_data_column(column) (if column is a dimension of data).

I see three options:

mark the pandas-style getter as deprecated because the implicit selection whether you get a column from data or meta is a bit confusing
refactor the getter to use the same speed improvement (avoid casting the internal pd.Series to a DataFrame)
use the getter instead of the get_data_column()

Any thoughts @gidden or anyone else?

gidden · 2021-07-28T15:34:48Z

For reference, the implementation is reproduced below:

    def __getitem__(self, key):
        _key_check = [key] if isstr(key) else key
        if set(_key_check).issubset(self.meta.columns):
            return self.meta.__getitem__(key)
        else:
            return self.data.__getitem__(key)

In which case if get_data_column() is indeed faster than self.data.__getitem__(key) and is guaranteed to have the same outcome in all situations, then I would simply replace the last line here.

I would advise against deprecation, as this is a pretty innocuous feature and we have no idea how many users are actually using it. Not to mention, it would be almost impossible to update legacy code...

danielhuppmann · 2021-07-28T17:58:52Z

I would advise against deprecation, as this is a pretty innocuous feature and we have no idea how many users are actually using it. Not to mention, it would be almost impossible to update legacy code...

Don't quite agree that it's so innocuous... If you do df["region"], it's intuitive what you get (the column from df.data), if you do df["category"], it's also intuitive (the column from df.meta, if it exists). But what do you get if you do df["model"]?

But if you think a DeprecationWarning is too strong, how about a FutureWarning - so this will give users probably a year's worth of time to update any code...?

gidden · 2021-07-30T08:13:13Z

I would strongly prefer not removing/deprecating. Have we heard anyone who has raised an issue with this?

danielhuppmann · 2021-07-30T08:14:58Z

Ok, let's leave it - wil simply implement the more efficient approach.

gidden · 2021-07-30T08:18:31Z

Great, thank you!

danielhuppmann · 2021-07-30T09:50:31Z

Implemented the more efficient approach for the direct getter IamDataFrame[<column>] and extended the corresponding unit test for good measure...

gidden · 2021-07-30T10:51:22Z

Looks great, thanks!

danielhuppmann self-assigned this Jul 16, 2021

danielhuppmann changed the title ~~Add get_data_colum(), Rrfactor filtering by the time domain~~ Add get_data_column(), refactor filtering by the time domain Jul 16, 2021

danielhuppmann marked this pull request as ready for review July 16, 2021 16:39

danielhuppmann added 6 commits July 30, 2021 11:35

Add a utility function get_data_column()

b709674

Refactor the filter() function to avoid casting to data

d9428fb

Add to release notes

5344894

Improve the docstring

dfef197

Use new function in unit_mapping

7f171a1

Implement changes per discussion with @gidden, improve the test

4407f85

danielhuppmann force-pushed the performance/filter branch from a41e954 to 4407f85 Compare July 30, 2021 09:45

gidden merged commit a8c60d9 into IAMconsortium:main Jul 30, 2021

danielhuppmann deleted the performance/filter branch July 30, 2021 11:20

danielhuppmann mentioned this pull request Aug 27, 2021

Bugfix for failing getter on value column #575

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `get_data_column()`, refactor filtering by the time domain #562

Add `get_data_column()`, refactor filtering by the time domain #562

danielhuppmann commented Jul 16, 2021 •

edited

Loading

codecov bot commented Jul 16, 2021 •

edited

Loading

danielhuppmann commented Jul 20, 2021

gidden commented Jul 28, 2021

danielhuppmann commented Jul 28, 2021

gidden commented Jul 30, 2021

danielhuppmann commented Jul 30, 2021

gidden commented Jul 30, 2021

danielhuppmann commented Jul 30, 2021

gidden commented Jul 30, 2021

Add get_data_column(), refactor filtering by the time domain #562

Add get_data_column(), refactor filtering by the time domain #562

Conversation

danielhuppmann commented Jul 16, 2021 • edited Loading

Please confirm that this PR has done the following:

Description of PR

codecov bot commented Jul 16, 2021 • edited Loading

Codecov Report

danielhuppmann commented Jul 20, 2021

gidden commented Jul 28, 2021

danielhuppmann commented Jul 28, 2021

gidden commented Jul 30, 2021

danielhuppmann commented Jul 30, 2021

gidden commented Jul 30, 2021

danielhuppmann commented Jul 30, 2021

gidden commented Jul 30, 2021

Add `get_data_column()`, refactor filtering by the time domain #562

Add `get_data_column()`, refactor filtering by the time domain #562

danielhuppmann commented Jul 16, 2021 •

edited

Loading

codecov bot commented Jul 16, 2021 •

edited

Loading