Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932

jorisvandenbossche · 2021-04-13T20:15:07Z

Motivation for this proposal: for full dtype support with nullable dtypes, we also need a nullable version of the datetime-like dtypes. For backwards compatibility, we need a new dtype (like we did for the other nullable dtypes), and that's what the proposal below is describing. And when creating a new dtype, I think it is the perfect opportunity to have a different default for the resolution (eg microsecond unit instead of nanosecond).

Summary: This proposal puts forward a new TimestampDtype, a nullable extension dtype to hold timestamp data:

A new timestamp data type that follows the pattern of the nullable dtypes (e.g. integer, boolean) with consistent missing value behaviour.
A parameterized data type with support for multiple resolutions (from seconds through nanoseconds) and optionally time zones (unifying the tz-naive and tz-aware dtypes into a single ExtensionDtype).
The new data type can have a better default resolution (e.g. microseconds instead of nanoseconds).
I suggest using "timestamp" for the dtype name, because 1) we need a different name to differentiate from "datetime64" anyway and 2) this is then internally consistent with our Timestamp scalar. But an alternative could also be "Datetime64" (capitalized).

Full version at https://docs.google.com/document/d/1uCdxjlYAafdHD7f57kpkPsJV2Q9Oaxkg1a8V5steMBM/edit?usp=sharing

This would address #7307

Small illustrative code snippet:

>>> s
0   2020-01-01 00:00:00
1                  <NA>
2   2020-01-01 02:00:00
dtype: timestamp[us]

>>> s_tz
0   2020-01-01 00:00:00+01:00
1   2020-01-01 01:00:00+01:00
2   2020-01-01 02:00:00+01:00
dtype: timestamp[ns, tz=Europe/Brussels]

Looking forward to your thoughts / comments.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-04-13T21:13:29Z

There are two unrelated things here: 1) implementing non-nano resolution, and 2) implementing a MaskedArray wrapper around the existing DatetimeArray/TimedeltaArray.

The former mostly takes place in the cython code and doesn't need to know anything about the the np.datetime64 dtype, much less whatever EADtype we want to wrap it with. I've begun work on this.

The latter I don't care about enough to object to, just think its kind of pointless.

datapythonista · 2021-04-13T21:38:47Z

About supporting different resolutions, I may have funding to work on it later this year, as part of a project which requires year resolution to work on very long periods (i.e. millions of years). It is not confirmed yet. If it happen, I should start working on this in September.

bashtage · 2021-04-14T11:40:05Z

My personal take on timestamps is that 64 bits is just not enough. I think a propper future proof timestamp would need to be 96 bits, which would naturally be implemented as a int128/uint128. While this adds some complexity, there are some well developed libraries that let these types be seemlessly handled across compilers.

96 bits allows for effectively infinite length spans at the ns precision (> 10**12 years, which is about 4 billion times larger than the current range). The remaining 32 bits can then be used for timezone or other purposes.

For comparrison, MATLAB uses 96 bits in their timestamp.

jorisvandenbossche · 2021-04-14T12:08:56Z

@bashtage do you think there are many use cases that need both of nanosecond resolution and million+ time range? For example with micro-second resolution, you already have +/- 2.9e5 years, and with second resolution you get the million+ time range (+/- 2.9e11 years).

It might well be that there are well-developed libraries to handles int96 types (can you give some examples?), but the core libraries we will typically depend on the short term (numpy, arrow) only support up to in64. I think that would make it, on the short term, not very practical to go with int96.

Also, I am wondering if MATLAB is rather an outlier here (which doesn't mean they can't have good reasons to do so, of course). Most other systems I am somewhat familiar with do not use int96 (in addition to numpy and Arrow, eg R, Julia, Spark, databases like Postgres or ClickHouse). Also the fact that Arrow decided to use int64 for their standard in-memory format (with involvement of people from both database world as python data science world), is a strong argument for me (also in terms of compatibility with Arrow).
The Parquet file format did have a INT96 timestamp, but that is deprecated now in favor of the in64 versions (https://issues.apache.org/jira/browse/PARQUET-323)

jorisvandenbossche · 2021-04-14T12:26:00Z

The latter I don't care about enough to object to, just think its kind of pointless.

@jbrockmendel can you try to explain why you find this pointless? How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)

as part of a project which requires year resolution to work on very long periods (i.e. millions of years).

@datapythonista Interesting! Now specifically related to the above quote: do you need to resolution to be "year" for that project, or would the time range of "second" resolution also be fine, since this already can provide millions+ (2.9e11 years, i.e. 290 billion years)

Because personally, I am not sure we should add a "year" resolution to a timestamp dtype. For me, that's what we have Period for (since a year is a non-fixed amount of time). (see also the last section in the google docs proposal)

shoyer · 2021-04-14T16:20:05Z

The latter I don't care about enough to object to, just think its kind of pointless.

@jbrockmendel can you try to explain why you find this pointless? How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)

One question would be how this differs from the existing datetime64/timedelta64, with already allows for NaT.

Because personally, I am not sure we should add a "year" resolution to a timestamp dtype. For me, that's what we have Period for (since a year is a non-fixed amount of time). (see also the last section in the google docs proposal)

+1 from me. Non-fixed periods introduce a bunch of different issues. In contrast, supporting the second to nanosecond range for precision is just a matter of adjusting conversions by multiples of 1000.

There are definitely use-cases for different (non-default) time resolutions, but to be honest I'm not sure they need to be in pandas. For example:

For climate data analysis, a larger time range is important. But climate users also care about non-Gregorian calendars, like a fixed 365 day year without leap days. Hence we implemented a custom CFTimeIndex in Xarray.
For astronomy, both high precision and a long range are desirable. Astropy thus uses two float64s to represent times: https://docs.astropy.org/en/stable/time/

So let's think about what use-cases this functionality would actually solve. The most valuable feature might be allowing for new custom domain-specific time dtypes via extension APIs (especially for indexing & resampling), rather than extending pandas' built-in functionality.

datapythonista · 2021-04-14T16:23:50Z

@datapythonista Interesting! Now specifically related to the above quote: do you need to resolution to be "year" for that project, or would the time range of "second" resolution also be fine, since this already can provide millions+ (2.9e11 years, i.e. 290 billion years)

I think that would work too. Afaik for the use case I discussed, the first relevant "date" is when the earth was created, around 5 billion years ago, with granularity of years. Period didn't seem to have enough features for what was required, an option is also improving Period support.

In any case, if the project is finally happening, and I've got funding to work on this, I'll be discussing in more detail, and we can decide what it's best for pandas and the rest of the community. All the work should be generic and reusable.

So let's think about what use-cases this functionality would actually solve. The most valuable feature might be allowing for new custom domain-specific time dtypes via extension APIs (especially for indexing & resampling), rather than extending pandas' built-in functionality.

This sounds like a good approach. And thanks for all the info about use cases.

jorisvandenbossche · 2021-04-14T16:24:20Z

One question would be how this differs from the existing datetime64/timedelta64, with already allows for NaT.

@shoyer NaT has a different behaviour as NA (NaT behaves like NaN, eg comparing to False). IMO we want a datetime-like dtype that follows the same semantics as the other nullable dtypes (in comparison and logical operations), and that's one of the main goals of the proposal.

jbrockmendel · 2021-04-15T18:49:16Z

How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)

I would implement a NullableDatetimeArray as a hopefully-thin wrapper around the existing DatetimeArray. (Presumably this would involve refactoring MaskedArray so it could also wrap PeriodArray, TimedeltaArray, ...)

jorisvandenbossche · 2021-04-20T20:11:51Z

I would implement a NullableDatetimeArray as a hopefully-thin wrapper around the existing DatetimeArray

That's indeed a good possibility for the implementation, since we need of course some way to share code (whether it's with composition like this, or with both separately calling shared functions, or ..).
But for me, that's more of an implementation "detail", which doesn't make any other aspect of the proposal irrelevant (e.g. the nullable semantics, new dtype class for both tz-naive/aware dat, etc)

(and with "detail", I don't want to imply that it's not important, but rather that it's to be discussed later / not yet covered by the proposal)

jbrockmendel · 2021-04-20T22:45:40Z

But for me, that's more of an implementation "detail", which doesn't make any other aspect of the proposal irrelevant (e.g. the nullable semantics, new dtype class for both tz-naive/aware dat, etc)

The "but" here suggests you're disagreeing about... something. I'm not clear on what that is.

bashtage · 2021-04-20T23:01:20Z

@bashtage do you think there are many use cases that need both of nanosecond resolution and million+ time range? For example with micro-second resolution, you already have +/- 2.9e5 years, and with second resolution you get the million+ time range (+/- 2.9e11 years).

@jorisvandenbossche It isn't so much about needing ns resolution a billion years ago, but having a single unit of timestamp that can handle all time horizons without needing to worry about converting. It has the added benefit of matching the resolution that is already in pandas, at the cost of 32 bit (but practically 2x the space since it is more natural to use a __int128 to offset to an epoch).

jbrockmendel · 2021-04-20T23:05:00Z

(but practically 2x the space since it is more natural to use a __int128 to offset to an epoch).

Do we even have int128 available? i dont think we support it in any of our cython code.

Having worked in memory-constrained environments, the idea of doubling the footprint is unappealing (this is part of why im not wild about masked arrays too)

jorisvandenbossche · 2021-04-21T21:55:21Z

But for me, that's more of an implementation "detail", which doesn't make any other aspect of the proposal irrelevant (e.g. the nullable semantics, new dtype class for both tz-naive/aware dat, etc)

The "but" here suggests you're disagreeing about... something. I'm not clear on what that is.

@jbrockmendel Well, what started our discussion was your comment above about finding the proposed nullable dtype (minus the multiple resolutions) "kind of pointless" (and that's clearly something we are disagreeing about ;)).
I asked for a clarification of that statement (#40932 (comment)), but you might have only answered the sub-question of how to provide a nullable dtype. And your answer to that IMO doesn't contradict the general proposal (it's giving a possible implementation path). So what is not clear for me is whether you then actually support the proposal or not (or which aspects of it)?

jbrockmendel · 2021-04-22T01:10:39Z

I asked for a clarification of that statement (#40932 (comment)), but you might have only answered the sub-question of how to provide a nullable dtype.

Yes, you asked if I could explain why I found it pointless. I can, but choose not to.

So what is not clear for me is whether you then actually support the proposal or not (or which aspects of it)?

The best I can give you is actively supporting the non-nano support and not standing in the way of the rest.

sterlinm · 2021-08-02T15:37:55Z

This sounds like it's going to be very useful! What's the current status on this?

jbrockmendel · 2021-08-02T22:19:42Z

What's the current status on this?

I'm planning to implement this once we are able to use cython 3

sterlinm · 2021-08-03T03:42:40Z

That's great to hear, thanks! I've got a project where I'd like to use Timestamp but need to use microsecond resolution for the larger range, so it will be very useful. I'm not sure if it's a one-person job or not, but if you think you could use help please let me know.

sterlinm · 2021-08-17T20:46:49Z

Hi @jbrockmendel, this is so closely aligned with one of the projects I'm working on at work that there's some interest in seeing if it would make sense to work on this in Pandas and contribute it rather than build our own custom solution.

Is cython 3 a requirement for updating pd.Timestamp? We're just interested in getting some sense of the timeline so we can figure out what would make sense for us to potentially contribute. Thanks!

jbrockmendel · 2021-08-19T17:51:13Z

Is cython 3 a requirement for updating pd.Timestamp?

Not a hard requirement, no.

Is your use case about the nullability or the non-nano part of the proposal here? Do you need timezone-awareness? If not, PeriodDtype should do what you need.

sterlinm · 2021-08-20T18:51:27Z

We are interested in both the nullability and the non-nano part of the proposal. We don't need timezone-awareness and will probably explicitly avoid any timezone-aware Timestamps.

Performance is another big motivator. Comparisons of scalar timestamps outperform pretty much all of the alternatives I've looked at with the exception of datetime.datetime.

period_1, period_2 = pd.Period('1900-01-01', freq='us'), pd.Period('2000-01-01', freq='us')
timestamp_1, timestamp_2 = period_1.to_timestamp(), period_2.to_timestamp()
npdatetime_1, npdatetime_2 = timestamp_1.to_numpy(), timestamp_2.to_numpy()
pydatetime_1, pydatetime_2 = timestamp_1.to_pydatetime(), timestamp_2.to_pydatetime()

# datetime.datetime
# 64 ms +/- 1.46 ms per loop
%%timeit
for _ in range(1000000):
    _ = pydatetime_1 < pydatetime_2

# pd.Timestamp
# 95.2 ms +/- 532 us per loop
%%timeit
for _ in range(1000000):
    _ = timestamp_1 < timestamp_2

# np.datetime64
# 1.64 s +/- 19.9 ms per loop
%%timeit
for _ in range(1000000):
    _ = npdatetime_1 < npdatetime_2

# pd.Period
# 10.7 s +/- 9.52 ms per loop
%%timeit
for _ in range(1000000):
    _ = period_1 < period_2

jbrockmendel · 2021-08-22T04:57:57Z

side-note: a PR that changed Period.__richcmp__ checking for freq equality from self.freq != other.freq to self.freq._period_dtype_code != other.freq._period_dtype_code would likely improve the performance there quite a bit

sterlinm · 2021-08-22T15:30:08Z

side-note: a PR that changed Period.__richcmp__ checking for freq equality from self.freq != other.freq to self.freq._period_dtype_code != other.freq._period_dtype_code would likely improve the performance there quite a bit

I’m happy to give that a shot!

josham · 2021-10-17T13:43:55Z

+1 for supporting multiple resolutions. It would be nice to have a proper date dtype, rather than being stuck using datetime.date with object dtype or using a higher resolution and operating under the assumption the higher frequency values will always be 0.

jbrockmendel · 2023-03-24T23:11:31Z

The non-nano part of this is done, and we have arrow dtypes that support both non-nano+nullable. Can this be closed?

mroeschke · 2023-03-29T18:14:55Z

Yeah I believe pd.ArrowDtype with pa.timestamp probably suffices the nullability with non-ns resolutions so closing. I think for follow ups it would be best to new issues

jorisvandenbossche added Needs Discussion Requires discussion from core team before further action Roadmap A proposal for the roadmap. NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 13, 2021

mroeschke added Enhancement Datetime Datetime data dtype labels Aug 19, 2021

jbrockmendel mentioned this issue Aug 19, 2021

BUG: pd.to_datetime with format doesn't work with pd.NA #42982

Merged

4 tasks

jbrockmendel mentioned this issue Dec 18, 2021

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

ivirshup mentioned this issue Dec 21, 2021

Writing to H5AD fails if AnnData contains datetime64 objects scverse/anndata#455

Open

jbrockmendel added the Non-Nano datetime64/timedelta64 with non-nanosecond resolution label Jan 10, 2022

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Mar 24, 2023

mroeschke closed this as completed Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932

Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932

jorisvandenbossche commented Apr 13, 2021

jbrockmendel commented Apr 13, 2021

datapythonista commented Apr 13, 2021

bashtage commented Apr 14, 2021 •

edited

Loading

jorisvandenbossche commented Apr 14, 2021 •

edited

Loading

jorisvandenbossche commented Apr 14, 2021

shoyer commented Apr 14, 2021

datapythonista commented Apr 14, 2021

jorisvandenbossche commented Apr 14, 2021 •

edited

Loading

jbrockmendel commented Apr 15, 2021

jorisvandenbossche commented Apr 20, 2021

jbrockmendel commented Apr 20, 2021

bashtage commented Apr 20, 2021

jbrockmendel commented Apr 20, 2021

jorisvandenbossche commented Apr 21, 2021 •

edited

Loading

jbrockmendel commented Apr 22, 2021

sterlinm commented Aug 2, 2021 •

edited

Loading

jbrockmendel commented Aug 2, 2021

sterlinm commented Aug 3, 2021

sterlinm commented Aug 17, 2021

jbrockmendel commented Aug 19, 2021

sterlinm commented Aug 20, 2021

jbrockmendel commented Aug 22, 2021

sterlinm commented Aug 22, 2021

josham commented Oct 17, 2021

jbrockmendel commented Mar 24, 2023

mroeschke commented Mar 29, 2023

Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932

Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932

Comments

jorisvandenbossche commented Apr 13, 2021

jbrockmendel commented Apr 13, 2021

datapythonista commented Apr 13, 2021

bashtage commented Apr 14, 2021 • edited Loading

jorisvandenbossche commented Apr 14, 2021 • edited Loading

jorisvandenbossche commented Apr 14, 2021

shoyer commented Apr 14, 2021

datapythonista commented Apr 14, 2021

jorisvandenbossche commented Apr 14, 2021 • edited Loading

jbrockmendel commented Apr 15, 2021

jorisvandenbossche commented Apr 20, 2021

jbrockmendel commented Apr 20, 2021

bashtage commented Apr 20, 2021

jbrockmendel commented Apr 20, 2021

jorisvandenbossche commented Apr 21, 2021 • edited Loading

jbrockmendel commented Apr 22, 2021

sterlinm commented Aug 2, 2021 • edited Loading

jbrockmendel commented Aug 2, 2021

sterlinm commented Aug 3, 2021

sterlinm commented Aug 17, 2021

jbrockmendel commented Aug 19, 2021

sterlinm commented Aug 20, 2021

jbrockmendel commented Aug 22, 2021

sterlinm commented Aug 22, 2021

josham commented Oct 17, 2021

jbrockmendel commented Mar 24, 2023

mroeschke commented Mar 29, 2023

bashtage commented Apr 14, 2021 •

edited

Loading

jorisvandenbossche commented Apr 14, 2021 •

edited

Loading

jorisvandenbossche commented Apr 14, 2021 •

edited

Loading

jorisvandenbossche commented Apr 21, 2021 •

edited

Loading

sterlinm commented Aug 2, 2021 •

edited

Loading