Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932

Closed
jorisvandenbossche opened this issue Apr 13, 2021 · 26 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Datetime Datetime data dtype Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action Non-Nano datetime64/timedelta64 with non-nanosecond resolution Roadmap A proposal for the roadmap.

Comments

@jorisvandenbossche
Copy link
Member

Motivation for this proposal: for full dtype support with nullable dtypes, we also need a nullable version of the datetime-like dtypes. For backwards compatibility, we need a new dtype (like we did for the other nullable dtypes), and that's what the proposal below is describing. And when creating a new dtype, I think it is the perfect opportunity to have a different default for the resolution (eg microsecond unit instead of nanosecond).

Summary: This proposal puts forward a new TimestampDtype, a nullable extension dtype to hold timestamp data:

  • A new timestamp data type that follows the pattern of the nullable dtypes (e.g. integer, boolean) with consistent missing value behaviour.
  • A parameterized data type with support for multiple resolutions (from seconds through nanoseconds) and optionally time zones (unifying the tz-naive and tz-aware dtypes into a single ExtensionDtype).
  • The new data type can have a better default resolution (e.g. microseconds instead of nanoseconds).
  • I suggest using "timestamp" for the dtype name, because 1) we need a different name to differentiate from "datetime64" anyway and 2) this is then internally consistent with our Timestamp scalar. But an alternative could also be "Datetime64" (capitalized).

Full version at https://docs.google.com/document/d/1uCdxjlYAafdHD7f57kpkPsJV2Q9Oaxkg1a8V5steMBM/edit?usp=sharing

This would address #7307

Small illustrative code snippet:

>>> s
0   2020-01-01 00:00:00
1                  <NA>
2   2020-01-01 02:00:00
dtype: timestamp[us]

>>> s_tz
0   2020-01-01 00:00:00+01:00
1   2020-01-01 01:00:00+01:00
2   2020-01-01 02:00:00+01:00
dtype: timestamp[ns, tz=Europe/Brussels]

Looking forward to your thoughts / comments.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

@jorisvandenbossche jorisvandenbossche added Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap. NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 13, 2021
@jbrockmendel
Copy link
Member

There are two unrelated things here: 1) implementing non-nano resolution, and 2) implementing a MaskedArray wrapper around the existing DatetimeArray/TimedeltaArray.

The former mostly takes place in the cython code and doesn't need to know anything about the the np.datetime64 dtype, much less whatever EADtype we want to wrap it with. I've begun work on this.

The latter I don't care about enough to object to, just think its kind of pointless.

@datapythonista
Copy link
Member

About supporting different resolutions, I may have funding to work on it later this year, as part of a project which requires year resolution to work on very long periods (i.e. millions of years). It is not confirmed yet. If it happen, I should start working on this in September.

@bashtage
Copy link
Contributor

bashtage commented Apr 14, 2021

My personal take on timestamps is that 64 bits is just not enough. I think a propper future proof timestamp would need to be 96 bits, which would naturally be implemented as a int128/uint128. While this adds some complexity, there are some well developed libraries that let these types be seemlessly handled across compilers.

96 bits allows for effectively infinite length spans at the ns precision (> 10**12 years, which is about 4 billion times larger than the current range). The remaining 32 bits can then be used for timezone or other purposes.

For comparrison, MATLAB uses 96 bits in their timestamp.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 14, 2021

@bashtage do you think there are many use cases that need both of nanosecond resolution and million+ time range? For example with micro-second resolution, you already have +/- 2.9e5 years, and with second resolution you get the million+ time range (+/- 2.9e11 years).

It might well be that there are well-developed libraries to handles int96 types (can you give some examples?), but the core libraries we will typically depend on the short term (numpy, arrow) only support up to in64. I think that would make it, on the short term, not very practical to go with int96.

Also, I am wondering if MATLAB is rather an outlier here (which doesn't mean they can't have good reasons to do so, of course). Most other systems I am somewhat familiar with do not use int96 (in addition to numpy and Arrow, eg R, Julia, Spark, databases like Postgres or ClickHouse). Also the fact that Arrow decided to use int64 for their standard in-memory format (with involvement of people from both database world as python data science world), is a strong argument for me (also in terms of compatibility with Arrow).
The Parquet file format did have a INT96 timestamp, but that is deprecated now in favor of the in64 versions (https://issues.apache.org/jira/browse/PARQUET-323)

@jorisvandenbossche
Copy link
Member Author

The latter I don't care about enough to object to, just think its kind of pointless.

@jbrockmendel can you try to explain why you find this pointless? How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)

as part of a project which requires year resolution to work on very long periods (i.e. millions of years).

@datapythonista Interesting! Now specifically related to the above quote: do you need to resolution to be "year" for that project, or would the time range of "second" resolution also be fine, since this already can provide millions+ (2.9e11 years, i.e. 290 billion years)

Because personally, I am not sure we should add a "year" resolution to a timestamp dtype. For me, that's what we have Period for (since a year is a non-fixed amount of time). (see also the last section in the google docs proposal)

@shoyer
Copy link
Member

shoyer commented Apr 14, 2021

The latter I don't care about enough to object to, just think its kind of pointless.

@jbrockmendel can you try to explain why you find this pointless? How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)

One question would be how this differs from the existing datetime64/timedelta64, with already allows for NaT.

Because personally, I am not sure we should add a "year" resolution to a timestamp dtype. For me, that's what we have Period for (since a year is a non-fixed amount of time). (see also the last section in the google docs proposal)

+1 from me. Non-fixed periods introduce a bunch of different issues. In contrast, supporting the second to nanosecond range for precision is just a matter of adjusting conversions by multiples of 1000.

There are definitely use-cases for different (non-default) time resolutions, but to be honest I'm not sure they need to be in pandas. For example:

  • For climate data analysis, a larger time range is important. But climate users also care about non-Gregorian calendars, like a fixed 365 day year without leap days. Hence we implemented a custom CFTimeIndex in Xarray.
  • For astronomy, both high precision and a long range are desirable. Astropy thus uses two float64s to represent times: https://docs.astropy.org/en/stable/time/

So let's think about what use-cases this functionality would actually solve. The most valuable feature might be allowing for new custom domain-specific time dtypes via extension APIs (especially for indexing & resampling), rather than extending pandas' built-in functionality.

@datapythonista
Copy link
Member

@datapythonista Interesting! Now specifically related to the above quote: do you need to resolution to be "year" for that project, or would the time range of "second" resolution also be fine, since this already can provide millions+ (2.9e11 years, i.e. 290 billion years)

I think that would work too. Afaik for the use case I discussed, the first relevant "date" is when the earth was created, around 5 billion years ago, with granularity of years. Period didn't seem to have enough features for what was required, an option is also improving Period support.

In any case, if the project is finally happening, and I've got funding to work on this, I'll be discussing in more detail, and we can decide what it's best for pandas and the rest of the community. All the work should be generic and reusable.

So let's think about what use-cases this functionality would actually solve. The most valuable feature might be allowing for new custom domain-specific time dtypes via extension APIs (especially for indexing & resampling), rather than extending pandas' built-in functionality.

This sounds like a good approach. And thanks for all the info about use cases.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 14, 2021

One question would be how this differs from the existing datetime64/timedelta64, with already allows for NaT.

@shoyer NaT has a different behaviour as NA (NaT behaves like NaN, eg comparing to False). IMO we want a datetime-like dtype that follows the same semantics as the other nullable dtypes (in comparison and logical operations), and that's one of the main goals of the proposal.

@jbrockmendel
Copy link
Member

How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)

I would implement a NullableDatetimeArray as a hopefully-thin wrapper around the existing DatetimeArray. (Presumably this would involve refactoring MaskedArray so it could also wrap PeriodArray, TimedeltaArray, ...)

@jorisvandenbossche
Copy link
Member Author

I would implement a NullableDatetimeArray as a hopefully-thin wrapper around the existing DatetimeArray

That's indeed a good possibility for the implementation, since we need of course some way to share code (whether it's with composition like this, or with both separately calling shared functions, or ..).
But for me, that's more of an implementation "detail", which doesn't make any other aspect of the proposal irrelevant (e.g. the nullable semantics, new dtype class for both tz-naive/aware dat, etc)

(and with "detail", I don't want to imply that it's not important, but rather that it's to be discussed later / not yet covered by the proposal)

@jbrockmendel
Copy link
Member

But for me, that's more of an implementation "detail", which doesn't make any other aspect of the proposal irrelevant (e.g. the nullable semantics, new dtype class for both tz-naive/aware dat, etc)

The "but" here suggests you're disagreeing about... something. I'm not clear on what that is.

@bashtage
Copy link
Contributor

@bashtage do you think there are many use cases that need both of nanosecond resolution and million+ time range? For example with micro-second resolution, you already have +/- 2.9e5 years, and with second resolution you get the million+ time range (+/- 2.9e11 years).

@jorisvandenbossche It isn't so much about needing ns resolution a billion years ago, but having a single unit of timestamp that can handle all time horizons without needing to worry about converting. It has the added benefit of matching the resolution that is already in pandas, at the cost of 32 bit (but practically 2x the space since it is more natural to use a __int128 to offset to an epoch).

@jbrockmendel
Copy link
Member

(but practically 2x the space since it is more natural to use a __int128 to offset to an epoch).

Do we even have int128 available? i dont think we support it in any of our cython code.

Having worked in memory-constrained environments, the idea of doubling the footprint is unappealing (this is part of why im not wild about masked arrays too)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 21, 2021

But for me, that's more of an implementation "detail", which doesn't make any other aspect of the proposal irrelevant (e.g. the nullable semantics, new dtype class for both tz-naive/aware dat, etc)

The "but" here suggests you're disagreeing about... something. I'm not clear on what that is.

@jbrockmendel Well, what started our discussion was your comment above about finding the proposed nullable dtype (minus the multiple resolutions) "kind of pointless" (and that's clearly something we are disagreeing about ;)).
I asked for a clarification of that statement (#40932 (comment)), but you might have only answered the sub-question of how to provide a nullable dtype. And your answer to that IMO doesn't contradict the general proposal (it's giving a possible implementation path). So what is not clear for me is whether you then actually support the proposal or not (or which aspects of it)?

@jbrockmendel
Copy link
Member

I asked for a clarification of that statement (#40932 (comment)), but you might have only answered the sub-question of how to provide a nullable dtype.

Yes, you asked if I could explain why I found it pointless. I can, but choose not to.

So what is not clear for me is whether you then actually support the proposal or not (or which aspects of it)?

The best I can give you is actively supporting the non-nano support and not standing in the way of the rest.

@sterlinm
Copy link

sterlinm commented Aug 2, 2021

This sounds like it's going to be very useful! What's the current status on this?

@jbrockmendel
Copy link
Member

What's the current status on this?

I'm planning to implement this once we are able to use cython 3

@sterlinm
Copy link

sterlinm commented Aug 3, 2021

That's great to hear, thanks! I've got a project where I'd like to use Timestamp but need to use microsecond resolution for the larger range, so it will be very useful. I'm not sure if it's a one-person job or not, but if you think you could use help please let me know.

@sterlinm
Copy link

Hi @jbrockmendel, this is so closely aligned with one of the projects I'm working on at work that there's some interest in seeing if it would make sense to work on this in Pandas and contribute it rather than build our own custom solution.

Is cython 3 a requirement for updating pd.Timestamp? We're just interested in getting some sense of the timeline so we can figure out what would make sense for us to potentially contribute. Thanks!

@jbrockmendel
Copy link
Member

Is cython 3 a requirement for updating pd.Timestamp?

Not a hard requirement, no.

Is your use case about the nullability or the non-nano part of the proposal here? Do you need timezone-awareness? If not, PeriodDtype should do what you need.

@sterlinm
Copy link

We are interested in both the nullability and the non-nano part of the proposal. We don't need timezone-awareness and will probably explicitly avoid any timezone-aware Timestamps.

Performance is another big motivator. Comparisons of scalar timestamps outperform pretty much all of the alternatives I've looked at with the exception of datetime.datetime.

period_1, period_2 = pd.Period('1900-01-01', freq='us'), pd.Period('2000-01-01', freq='us')
timestamp_1, timestamp_2 = period_1.to_timestamp(), period_2.to_timestamp()
npdatetime_1, npdatetime_2 = timestamp_1.to_numpy(), timestamp_2.to_numpy()
pydatetime_1, pydatetime_2 = timestamp_1.to_pydatetime(), timestamp_2.to_pydatetime()

# datetime.datetime
# 64 ms +/- 1.46 ms per loop
%%timeit
for _ in range(1000000):
    _ = pydatetime_1 < pydatetime_2

# pd.Timestamp
# 95.2 ms +/- 532 us per loop
%%timeit
for _ in range(1000000):
    _ = timestamp_1 < timestamp_2

# np.datetime64
# 1.64 s +/- 19.9 ms per loop
%%timeit
for _ in range(1000000):
    _ = npdatetime_1 < npdatetime_2

# pd.Period
# 10.7 s +/- 9.52 ms per loop
%%timeit
for _ in range(1000000):
    _ = period_1 < period_2

@jbrockmendel
Copy link
Member

side-note: a PR that changed Period.__richcmp__ checking for freq equality from self.freq != other.freq to self.freq._period_dtype_code != other.freq._period_dtype_code would likely improve the performance there quite a bit

@sterlinm
Copy link

side-note: a PR that changed Period.__richcmp__ checking for freq equality from self.freq != other.freq to self.freq._period_dtype_code != other.freq._period_dtype_code would likely improve the performance there quite a bit

I’m happy to give that a shot!

@josham
Copy link
Contributor

josham commented Oct 17, 2021

+1 for supporting multiple resolutions. It would be nice to have a proper date dtype, rather than being stuck using datetime.date with object dtype or using a higher resolution and operating under the assumption the higher frequency values will always be 0.

@jbrockmendel
Copy link
Member

The non-nano part of this is done, and we have arrow dtypes that support both non-nano+nullable. Can this be closed?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Mar 24, 2023
@mroeschke
Copy link
Member

Yeah I believe pd.ArrowDtype with pa.timestamp probably suffices the nullability with non-ns resolutions so closing. I think for follow ups it would be best to new issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Datetime Datetime data dtype Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action Non-Nano datetime64/timedelta64 with non-nanosecond resolution Roadmap A proposal for the roadmap.
Projects
None yet
Development

No branches or pull requests

8 participants