-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: a new nullable Timestamp data type with support for non-ns resolutions #40932
Comments
There are two unrelated things here: 1) implementing non-nano resolution, and 2) implementing a MaskedArray wrapper around the existing DatetimeArray/TimedeltaArray. The former mostly takes place in the cython code and doesn't need to know anything about the the np.datetime64 dtype, much less whatever EADtype we want to wrap it with. I've begun work on this. The latter I don't care about enough to object to, just think its kind of pointless. |
About supporting different resolutions, I may have funding to work on it later this year, as part of a project which requires year resolution to work on very long periods (i.e. millions of years). It is not confirmed yet. If it happen, I should start working on this in September. |
My personal take on timestamps is that 64 bits is just not enough. I think a propper future proof timestamp would need to be 96 bits, which would naturally be implemented as a int128/uint128. While this adds some complexity, there are some well developed libraries that let these types be seemlessly handled across compilers. 96 bits allows for effectively infinite length spans at the ns precision (> 10**12 years, which is about 4 billion times larger than the current range). The remaining 32 bits can then be used for timezone or other purposes. For comparrison, MATLAB uses 96 bits in their timestamp. |
@bashtage do you think there are many use cases that need both of nanosecond resolution and million+ time range? For example with micro-second resolution, you already have +/- 2.9e5 years, and with second resolution you get the million+ time range (+/- 2.9e11 years). It might well be that there are well-developed libraries to handles int96 types (can you give some examples?), but the core libraries we will typically depend on the short term (numpy, arrow) only support up to in64. I think that would make it, on the short term, not very practical to go with int96. Also, I am wondering if MATLAB is rather an outlier here (which doesn't mean they can't have good reasons to do so, of course). Most other systems I am somewhat familiar with do not use int96 (in addition to numpy and Arrow, eg R, Julia, Spark, databases like Postgres or ClickHouse). Also the fact that Arrow decided to use int64 for their standard in-memory format (with involvement of people from both database world as python data science world), is a strong argument for me (also in terms of compatibility with Arrow). |
@jbrockmendel can you try to explain why you find this pointless? How would you provide a datetime dtype with nullable semantics? (using the existing DatetimeArray?)
@datapythonista Interesting! Now specifically related to the above quote: do you need to resolution to be "year" for that project, or would the time range of "second" resolution also be fine, since this already can provide millions+ (2.9e11 years, i.e. 290 billion years) Because personally, I am not sure we should add a "year" resolution to a timestamp dtype. For me, that's what we have Period for (since a year is a non-fixed amount of time). (see also the last section in the google docs proposal) |
One question would be how this differs from the existing datetime64/timedelta64, with already allows for NaT.
+1 from me. Non-fixed periods introduce a bunch of different issues. In contrast, supporting the second to nanosecond range for precision is just a matter of adjusting conversions by multiples of 1000. There are definitely use-cases for different (non-default) time resolutions, but to be honest I'm not sure they need to be in pandas. For example:
So let's think about what use-cases this functionality would actually solve. The most valuable feature might be allowing for new custom domain-specific time dtypes via extension APIs (especially for indexing & resampling), rather than extending pandas' built-in functionality. |
I think that would work too. Afaik for the use case I discussed, the first relevant "date" is when the earth was created, around 5 billion years ago, with granularity of years. In any case, if the project is finally happening, and I've got funding to work on this, I'll be discussing in more detail, and we can decide what it's best for pandas and the rest of the community. All the work should be generic and reusable.
This sounds like a good approach. And thanks for all the info about use cases. |
@shoyer NaT has a different behaviour as NA (NaT behaves like NaN, eg comparing to False). IMO we want a datetime-like dtype that follows the same semantics as the other nullable dtypes (in comparison and logical operations), and that's one of the main goals of the proposal. |
I would implement a NullableDatetimeArray as a hopefully-thin wrapper around the existing DatetimeArray. (Presumably this would involve refactoring MaskedArray so it could also wrap PeriodArray, TimedeltaArray, ...) |
That's indeed a good possibility for the implementation, since we need of course some way to share code (whether it's with composition like this, or with both separately calling shared functions, or ..). (and with "detail", I don't want to imply that it's not important, but rather that it's to be discussed later / not yet covered by the proposal) |
The "but" here suggests you're disagreeing about... something. I'm not clear on what that is. |
@jorisvandenbossche It isn't so much about needing ns resolution a billion years ago, but having a single unit of timestamp that can handle all time horizons without needing to worry about converting. It has the added benefit of matching the resolution that is already in pandas, at the cost of 32 bit (but practically 2x the space since it is more natural to use a __int128 to offset to an epoch). |
Do we even have int128 available? i dont think we support it in any of our cython code. Having worked in memory-constrained environments, the idea of doubling the footprint is unappealing (this is part of why im not wild about masked arrays too) |
@jbrockmendel Well, what started our discussion was your comment above about finding the proposed nullable dtype (minus the multiple resolutions) "kind of pointless" (and that's clearly something we are disagreeing about ;)). |
Yes, you asked if I could explain why I found it pointless. I can, but choose not to.
The best I can give you is actively supporting the non-nano support and not standing in the way of the rest. |
This sounds like it's going to be very useful! What's the current status on this? |
I'm planning to implement this once we are able to use cython 3 |
That's great to hear, thanks! I've got a project where I'd like to use Timestamp but need to use microsecond resolution for the larger range, so it will be very useful. I'm not sure if it's a one-person job or not, but if you think you could use help please let me know. |
Hi @jbrockmendel, this is so closely aligned with one of the projects I'm working on at work that there's some interest in seeing if it would make sense to work on this in Pandas and contribute it rather than build our own custom solution. Is cython 3 a requirement for updating pd.Timestamp? We're just interested in getting some sense of the timeline so we can figure out what would make sense for us to potentially contribute. Thanks! |
Not a hard requirement, no. Is your use case about the nullability or the non-nano part of the proposal here? Do you need timezone-awareness? If not, PeriodDtype should do what you need. |
We are interested in both the nullability and the non-nano part of the proposal. We don't need timezone-awareness and will probably explicitly avoid any timezone-aware Timestamps. Performance is another big motivator. Comparisons of scalar timestamps outperform pretty much all of the alternatives I've looked at with the exception of period_1, period_2 = pd.Period('1900-01-01', freq='us'), pd.Period('2000-01-01', freq='us')
timestamp_1, timestamp_2 = period_1.to_timestamp(), period_2.to_timestamp()
npdatetime_1, npdatetime_2 = timestamp_1.to_numpy(), timestamp_2.to_numpy()
pydatetime_1, pydatetime_2 = timestamp_1.to_pydatetime(), timestamp_2.to_pydatetime()
# datetime.datetime
# 64 ms +/- 1.46 ms per loop
%%timeit
for _ in range(1000000):
_ = pydatetime_1 < pydatetime_2
# pd.Timestamp
# 95.2 ms +/- 532 us per loop
%%timeit
for _ in range(1000000):
_ = timestamp_1 < timestamp_2
# np.datetime64
# 1.64 s +/- 19.9 ms per loop
%%timeit
for _ in range(1000000):
_ = npdatetime_1 < npdatetime_2
# pd.Period
# 10.7 s +/- 9.52 ms per loop
%%timeit
for _ in range(1000000):
_ = period_1 < period_2 |
side-note: a PR that changed |
I’m happy to give that a shot! |
+1 for supporting multiple resolutions. It would be nice to have a proper date dtype, rather than being stuck using datetime.date with object dtype or using a higher resolution and operating under the assumption the higher frequency values will always be 0. |
The non-nano part of this is done, and we have arrow dtypes that support both non-nano+nullable. Can this be closed? |
Yeah I believe |
Motivation for this proposal: for full dtype support with nullable dtypes, we also need a nullable version of the datetime-like dtypes. For backwards compatibility, we need a new dtype (like we did for the other nullable dtypes), and that's what the proposal below is describing. And when creating a new dtype, I think it is the perfect opportunity to have a different default for the resolution (eg microsecond unit instead of nanosecond).
Summary: This proposal puts forward a new
TimestampDtype
, a nullable extension dtype to hold timestamp data:Full version at https://docs.google.com/document/d/1uCdxjlYAafdHD7f57kpkPsJV2Q9Oaxkg1a8V5steMBM/edit?usp=sharing
This would address #7307
Small illustrative code snippet:
Looking forward to your thoughts / comments.
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage
The text was updated successfully, but these errors were encountered: