Nans in additional columns #351

Rlamboll · 2020-03-24T00:19:44Z

When a dataframe has additional data created, e.g. by interpolate, additional columns are often given np.nan as the value, rather than, say, an empty string or a 0.0.. This is presumably a choice rather than a bug. However when there are nans in these columns, they will not appear in timeseries, and will also disappear if the dataframe is recalculated by pyam.IamDataFrame(database_with_nans.data). This seems like a bug.

The text was updated successfully, but these errors were encountered:

danielhuppmann · 2020-03-24T05:43:08Z

Thanks for raising this issue @Rlamboll but I'm not sure which part here you see as a bug and which is bad (or unexpected) design choices - or if this is related to some not-yet-supported aspect of continuous-time data.

My understanding:

interpolate() (and the underlying fill_series()) return nan if the requested time is outside of the range of a specific timeseries (doing something else would be not inter- but extrapolation)
nan are dropped when initializing a new IamDataFrame because pandas behaves strangely with nan in columns (examples provided @znicholls, see the docs). But calling [timeseries(https://pyam-iamc.readthedocs.io/en/latest/api/iamdataframe.html#pyam.IamDataFrame.timeseries)] fills nan in any time-column where row data is missing (ie. if at least one data point exists for that time index).

Can you specify or provide a short test case that illustrates what you would expect to see?

Rlamboll · 2020-03-24T10:32:41Z

Perhaps I wasn't clear, all these nans occur in additional columns. Interpolate returns a value correctly in the value column, but not in other columns (not a bug). However I'd expect that initialising a dataframe only drops cases where important columns are nan, not where any column is nan. If it is going to do that, can we infill those columns with something other than nan. Here is an example (apologies for lack of indentation in this format):

import pandas as pd
import pyam

tdb = pd.DataFrame(
[["_ma", "_sa", "World", "CO2", "Gt CO2", 0, 2, 3],
["_ma", "_sa", "World", "CH4", "Gt CH4", 0, 200, 300],
],
columns=["model", "scenario", "region", "variable", "unit", "sub_annual"] + [2050, 2070],
)
df = pyam.IamDataFrame(tdb)
df.interpolate(2060)
df.data # Contains 2060 data, nan in additional column looks innocuous
df.timeseries() # No 2060 appears here
pyam.IamDataFrame(df.data).data # Interpolated data has gone

danielhuppmann · 2020-03-24T12:06:58Z

Thanks for the clarification, indeed a bug because the extra_cols feature is not yet tested thoroughly. PR with a bugfix coming soon.

Rlamboll · 2020-03-24T12:23:39Z

In terms of how this should work though, should we expect that extra_cols is left out of timeseries? This means that df->timeseries->df loses info. Alternatively, all rows of a model/scenario/variable/region grouping have to have the same additional cols info (and will need to be initialised in this way by interpolate) otherwise they will result in separate rows on the timeseries, with nans shared between rows. Effectively we then have an unspecified (and infinitely extensible) primary key in timeseries format.

danielhuppmann · 2020-03-24T12:38:47Z

good question, let me bounce it back to you:

if there is dataframe like

TEST_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'foo', 2005, 1],
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'bar', 2010, 3],
],
    columns=pyam.IAMC_IDX + ['extra_col', 'year', 'value'],
)

i.e, all index values the same except for extra_col, should there be any interpolation or not?

The bugfix in #352 assumes no (there will not be any interpolation because the first and second line have a different "index" so are different timeseries).

If you think it should yes, how should extra_col in the new row be filled?

Rlamboll · 2020-03-24T13:22:20Z

Yes, I would like to see interpolation, i.e. extra columns are ignored for this index. E.g. if I have a meta column with 'infilled' at one timepoint, I'd like to infill around it. I'm mostly treating these cols as extra info, not core function. OTOH, I have no strong preference for what to infill. The annoying answer would be check if the two points around it have identical entries, copy them if so and a non-breaking empty value (0 or "" etc) if not. But I also want this function to work quickly and currently don't have a strong use-case, so depends on how much effort it takes to do that. Always using a non-breaking empty value is fine too.

danielhuppmann · 2020-03-24T13:48:14Z

Ok, I think we agree. Please review #352, where I just added another test based on the example above to ensure that timeseries with non-matching extra-column values are not interpolated.

Rlamboll · 2020-03-24T14:08:44Z

OK, will there be an update later to make it fail that test?

danielhuppmann · 2020-03-24T15:19:53Z

OK, will there be an update later to make it fail that test?

No, that test shouldn't fail - that's the expected behaviour.

Rlamboll · 2020-03-24T15:38:13Z

This confuses me because I just said that I did want interpolation in that case.

danielhuppmann · 2020-03-24T16:21:21Z

Ok, then we were talking past each other. Can you please specify which output you expect from interpolate(2007) on the following timeseries data:

model	scenario	region	variable	unit	extra_col	2005	2010
model_a	scen_a	World	Primary Energy	EJ/yr	foo	1	`np.nan`
model_a	scen_a	World	Primary Energy	EJ/yr	bar	`np.nan`	3
model_a	scen_a	World	Primary Energy	EJ/yr	dummy	`np.nan`	5

In my understanding, because there are no data points with the same index (including extra_col) before and after the time to be interpolated, there should not be any data added to this dataframe.

Rlamboll · 2020-03-24T16:39:31Z

I didn't realise we were planning to allow that sort of data. I thought that (model x scenario x region x variable x time) was a unique key, hence this is an invalid df. Similar to the multi-unit question in #338, it's massively annoying to work out if a model-scenario is "complete" if I need to ask whether an unknown number of extra columns are also populated in every combination. Can we combine data from extra_col = foo with data from extra_col = bar? Without knowing what those columns are, it seems impossible to know. If the extra column is just a label to say 'infilled', clearly it's silly to predicate anything on the basis of that, and I assumed we'd want to combine infilled and non-infilled data into the same row of a timseries. If it somehow changes the meaning (like it's a unit working under a different conversion metric), we need to consider that. I believe we're trying to set up a zoom chat this week on nomenclature for additional columns.

danielhuppmann · 2020-03-24T17:44:41Z

At least we got to the bottom of the misunderstanding...

Yes, this use case was very much the driver for the extra_cols - think of having a variable Temperature|Global Mean and then having an additional column climate_model to specify which climate model generated that timeseries. Rather than what we did previously, having something like Temperature|Global Mean|<Climate Model>.

What you seem to have in mind is more akin to #287 - tracking which operations modified specific subsections of the data. This should, in my opinion, not be part of the data dataframe.

I suggest the following way forward:

(with your permission) merge PR bugfix for interpolation with extra columns #352 because it still fixes a bug identified by this conversation and enables the original use of extra_cols
have a conversation what is strictly necessary vs. what might be useful for AR6 (with @byersiiasa, @znicholls and others)
start a new issue here or in a related repository on how to handle comments, annotations and workflow tracking

Rlamboll · 2020-03-24T18:06:13Z

The merge fixes the most acute of the problems, so go ahead with that. This issue doesn't currently exist in emissions data since these columns are unused. I think the analogy with how things work there would be Temperature (FaIR model)|Global Mean, since different simulations are effectively different measurement metrics.

danielhuppmann · 2020-03-24T20:05:14Z

closed via #352

danielhuppmann added the bug label Mar 24, 2020

danielhuppmann self-assigned this Mar 24, 2020

danielhuppmann added a commit to danielhuppmann/pyam that referenced this issue Mar 24, 2020

add failing test for IAMconsortium#351

97cfced

danielhuppmann added a commit to danielhuppmann/pyam that referenced this issue Mar 24, 2020

add (quick) fix for IAMconsortium#351

82a8c44

danielhuppmann mentioned this issue Mar 24, 2020

bugfix for interpolation with extra columns #352

Merged

2 tasks

danielhuppmann closed this as completed Mar 24, 2020

Rlamboll mentioned this issue Nov 15, 2020

Interp update #456

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nans in additional columns #351

Nans in additional columns #351

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020 •

edited

Loading

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Nans in additional columns #351

Nans in additional columns #351

Comments

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020 • edited Loading

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020

danielhuppmann commented Mar 24, 2020

Rlamboll commented Mar 24, 2020 •

edited

Loading