Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nans in additional columns #351

Closed
Rlamboll opened this issue Mar 24, 2020 · 15 comments · Fixed by #352
Closed

Nans in additional columns #351

Rlamboll opened this issue Mar 24, 2020 · 15 comments · Fixed by #352
Assignees
Labels

Comments

@Rlamboll
Copy link
Collaborator

When a dataframe has additional data created, e.g. by interpolate, additional columns are often given np.nan as the value, rather than, say, an empty string or a 0.0.. This is presumably a choice rather than a bug. However when there are nans in these columns, they will not appear in timeseries, and will also disappear if the dataframe is recalculated by pyam.IamDataFrame(database_with_nans.data). This seems like a bug.

@danielhuppmann
Copy link
Member

Thanks for raising this issue @Rlamboll but I'm not sure which part here you see as a bug and which is bad (or unexpected) design choices - or if this is related to some not-yet-supported aspect of continuous-time data.

My understanding:

  • interpolate() (and the underlying fill_series()) return nan if the requested time is outside of the range of a specific timeseries (doing something else would be not inter- but extrapolation)
  • nan are dropped when initializing a new IamDataFrame because pandas behaves strangely with nan in columns (examples provided @znicholls, see the docs). But calling [timeseries(https://pyam-iamc.readthedocs.io/en/latest/api/iamdataframe.html#pyam.IamDataFrame.timeseries)] fills nan in any time-column where row data is missing (ie. if at least one data point exists for that time index).

Can you specify or provide a short test case that illustrates what you would expect to see?

@Rlamboll
Copy link
Collaborator Author

Rlamboll commented Mar 24, 2020

Perhaps I wasn't clear, all these nans occur in additional columns. Interpolate returns a value correctly in the value column, but not in other columns (not a bug). However I'd expect that initialising a dataframe only drops cases where important columns are nan, not where any column is nan. If it is going to do that, can we infill those columns with something other than nan. Here is an example (apologies for lack of indentation in this format):

import pandas as pd
import pyam

tdb = pd.DataFrame(
[["_ma", "_sa", "World", "CO2", "Gt CO2", 0, 2, 3],
["_ma", "_sa", "World", "CH4", "Gt CH4", 0, 200, 300],
],
columns=["model", "scenario", "region", "variable", "unit", "sub_annual"] + [2050, 2070],
)
df = pyam.IamDataFrame(tdb)
df.interpolate(2060)
df.data # Contains 2060 data, nan in additional column looks innocuous
df.timeseries() # No 2060 appears here
pyam.IamDataFrame(df.data).data # Interpolated data has gone

@danielhuppmann
Copy link
Member

Thanks for the clarification, indeed a bug because the extra_cols feature is not yet tested thoroughly. PR with a bugfix coming soon.

@danielhuppmann danielhuppmann self-assigned this Mar 24, 2020
@Rlamboll
Copy link
Collaborator Author

In terms of how this should work though, should we expect that extra_cols is left out of timeseries? This means that df->timeseries->df loses info. Alternatively, all rows of a model/scenario/variable/region grouping have to have the same additional cols info (and will need to be initialised in this way by interpolate) otherwise they will result in separate rows on the timeseries, with nans shared between rows. Effectively we then have an unspecified (and infinitely extensible) primary key in timeseries format.

danielhuppmann added a commit to danielhuppmann/pyam that referenced this issue Mar 24, 2020
danielhuppmann added a commit to danielhuppmann/pyam that referenced this issue Mar 24, 2020
@danielhuppmann
Copy link
Member

good question, let me bounce it back to you:

if there is dataframe like

TEST_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'foo', 2005, 1],
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'bar', 2010, 3],
],
    columns=pyam.IAMC_IDX + ['extra_col', 'year', 'value'],
)

i.e, all index values the same except for extra_col, should there be any interpolation or not?

The bugfix in #352 assumes no (there will not be any interpolation because the first and second line have a different "index" so are different timeseries).

If you think it should yes, how should extra_col in the new row be filled?

@Rlamboll
Copy link
Collaborator Author

Yes, I would like to see interpolation, i.e. extra columns are ignored for this index. E.g. if I have a meta column with 'infilled' at one timepoint, I'd like to infill around it. I'm mostly treating these cols as extra info, not core function. OTOH, I have no strong preference for what to infill. The annoying answer would be check if the two points around it have identical entries, copy them if so and a non-breaking empty value (0 or "" etc) if not. But I also want this function to work quickly and currently don't have a strong use-case, so depends on how much effort it takes to do that. Always using a non-breaking empty value is fine too.

@danielhuppmann
Copy link
Member

Ok, I think we agree. Please review #352, where I just added another test based on the example above to ensure that timeseries with non-matching extra-column values are not interpolated.

@Rlamboll
Copy link
Collaborator Author

OK, will there be an update later to make it fail that test?

@danielhuppmann
Copy link
Member

OK, will there be an update later to make it fail that test?

No, that test shouldn't fail - that's the expected behaviour.

@Rlamboll
Copy link
Collaborator Author

This confuses me because I just said that I did want interpolation in that case.

@danielhuppmann
Copy link
Member

Ok, then we were talking past each other. Can you please specify which output you expect from interpolate(2007) on the following timeseries data:

model scenario region variable unit extra_col 2005 2010
model_a scen_a World Primary Energy EJ/yr foo 1 np.nan
model_a scen_a World Primary Energy EJ/yr bar np.nan 3
model_a scen_a World Primary Energy EJ/yr dummy np.nan 5

In my understanding, because there are no data points with the same index (including extra_col) before and after the time to be interpolated, there should not be any data added to this dataframe.

@Rlamboll
Copy link
Collaborator Author

I didn't realise we were planning to allow that sort of data. I thought that (model x scenario x region x variable x time) was a unique key, hence this is an invalid df. Similar to the multi-unit question in #338, it's massively annoying to work out if a model-scenario is "complete" if I need to ask whether an unknown number of extra columns are also populated in every combination. Can we combine data from extra_col = foo with data from extra_col = bar? Without knowing what those columns are, it seems impossible to know. If the extra column is just a label to say 'infilled', clearly it's silly to predicate anything on the basis of that, and I assumed we'd want to combine infilled and non-infilled data into the same row of a timseries. If it somehow changes the meaning (like it's a unit working under a different conversion metric), we need to consider that. I believe we're trying to set up a zoom chat this week on nomenclature for additional columns.

@danielhuppmann
Copy link
Member

At least we got to the bottom of the misunderstanding...

Yes, this use case was very much the driver for the extra_cols - think of having a variable Temperature|Global Mean and then having an additional column climate_model to specify which climate model generated that timeseries. Rather than what we did previously, having something like Temperature|Global Mean|<Climate Model>.

What you seem to have in mind is more akin to #287 - tracking which operations modified specific subsections of the data. This should, in my opinion, not be part of the data dataframe.

I suggest the following way forward:

  • (with your permission) merge PR bugfix for interpolation with extra columns #352 because it still fixes a bug identified by this conversation and enables the original use of extra_cols
  • have a conversation what is strictly necessary vs. what might be useful for AR6 (with @byersiiasa, @znicholls and others)
  • start a new issue here or in a related repository on how to handle comments, annotations and workflow tracking

@Rlamboll
Copy link
Collaborator Author

The merge fixes the most acute of the problems, so go ahead with that. This issue doesn't currently exist in emissions data since these columns are unused. I think the analogy with how things work there would be Temperature (FaIR model)|Global Mean, since different simulations are effectively different measurement metrics.

@danielhuppmann
Copy link
Member

closed via #352

@Rlamboll Rlamboll mentioned this issue Nov 15, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants