-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nans in additional columns #351
Comments
Thanks for raising this issue @Rlamboll but I'm not sure which part here you see as a bug and which is bad (or unexpected) design choices - or if this is related to some not-yet-supported aspect of continuous-time data. My understanding:
Can you specify or provide a short test case that illustrates what you would expect to see? |
Perhaps I wasn't clear, all these nans occur in additional columns. Interpolate returns a value correctly in the import pandas as pd tdb = pd.DataFrame( |
Thanks for the clarification, indeed a bug because the |
In terms of how this should work though, should we expect that extra_cols is left out of timeseries? This means that df->timeseries->df loses info. Alternatively, all rows of a model/scenario/variable/region grouping have to have the same additional cols info (and will need to be initialised in this way by interpolate) otherwise they will result in separate rows on the timeseries, with nans shared between rows. Effectively we then have an unspecified (and infinitely extensible) primary key in timeseries format. |
good question, let me bounce it back to you: if there is dataframe like
i.e, all index values the same except for The bugfix in #352 assumes no (there will not be any interpolation because the first and second line have a different "index" so are different timeseries). If you think it should yes, how should |
Yes, I would like to see interpolation, i.e. extra columns are ignored for this index. E.g. if I have a meta column with 'infilled' at one timepoint, I'd like to infill around it. I'm mostly treating these cols as extra info, not core function. OTOH, I have no strong preference for what to infill. The annoying answer would be check if the two points around it have identical entries, copy them if so and a non-breaking empty value (0 or "" etc) if not. But I also want this function to work quickly and currently don't have a strong use-case, so depends on how much effort it takes to do that. Always using a non-breaking empty value is fine too. |
Ok, I think we agree. Please review #352, where I just added another test based on the example above to ensure that timeseries with non-matching extra-column values are not interpolated. |
OK, will there be an update later to make it fail that test? |
No, that test shouldn't fail - that's the expected behaviour. |
This confuses me because I just said that I did want interpolation in that case. |
Ok, then we were talking past each other. Can you please specify which output you expect from
In my understanding, because there are no data points with the same index (including |
I didn't realise we were planning to allow that sort of data. I thought that (model x scenario x region x variable x time) was a unique key, hence this is an invalid df. Similar to the multi-unit question in #338, it's massively annoying to work out if a model-scenario is "complete" if I need to ask whether an unknown number of extra columns are also populated in every combination. Can we combine data from extra_col = foo with data from extra_col = bar? Without knowing what those columns are, it seems impossible to know. If the extra column is just a label to say 'infilled', clearly it's silly to predicate anything on the basis of that, and I assumed we'd want to combine infilled and non-infilled data into the same row of a timseries. If it somehow changes the meaning (like it's a unit working under a different conversion metric), we need to consider that. I believe we're trying to set up a zoom chat this week on nomenclature for additional columns. |
At least we got to the bottom of the misunderstanding... Yes, this use case was very much the driver for the What you seem to have in mind is more akin to #287 - tracking which operations modified specific subsections of the data. This should, in my opinion, not be part of the I suggest the following way forward:
|
The merge fixes the most acute of the problems, so go ahead with that. This issue doesn't currently exist in emissions data since these columns are unused. I think the analogy with how things work there would be |
closed via #352 |
When a dataframe has additional data created, e.g. by interpolate, additional columns are often given np.nan as the value, rather than, say, an empty string or a 0.0.. This is presumably a choice rather than a bug. However when there are nans in these columns, they will not appear in timeseries, and will also disappear if the dataframe is recalculated by
pyam.IamDataFrame(database_with_nans.data)
. This seems like a bug.The text was updated successfully, but these errors were encountered: