Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion._to_numpy: Add tests for pandas.Series with datetime dtypes #3670

Merged
merged 12 commits into from
Jan 9, 2025

Conversation

seisman
Copy link
Member

@seisman seisman commented Dec 3, 2024

This PR adds tests for pandas.Series with datetime dtypes. Address #3600.

In pandas, datetime dtypes can be specified in following ways:

  1. Via NumPy dtypes, e.g., "datetime64[s]"
  2. Via pandas.DatetimeTZDtype, e.g., pd.DatetimeTZDtype(s, tz="UTC") or "datetime64[s, UTC]"
  3. Via pyarrow timestamp types, e.g., pd.ArrowDtype(pyarrow.Timestamp(s, tz="UTC")) or "timestamp[s, UTC][pyarrow]"

The following codes help us understand the default conversion behaviors:

>>> import numpy as np
>>> import pandas as pd

>>> # The sample data
>>> data = [pd.Timestamp("2024-01-02T03:04:05"), pd.Timestamp("2024-01-02T03:04:06")]

Via NumPy dtypes. The conversions are done as expected.

>>> series = pd.Series(data, dtype="datetime64[s]")
>>> series
0   2024-01-02 03:04:05
1   2024-01-02 03:04:06
dtype: datetime64[s]
>>> np.ascontiguousarray(series)
array(['2024-01-02T03:04:05', '2024-01-02T03:04:06'],
      dtype='datetime64[s]')

Via pd.DateTimeTZDtype with TZ. The pandas.series object is converted to object dtype. So we need to deal with the conversion manually. The expected numpy dtype and TZ information can be accessed via series.dtype.base and series.dtype.tz.

>>> series = pd.Series(data, dtype="datetime64[s, America/New_York]")
>>> series
0   2024-01-02 03:04:05-05:00
1   2024-01-02 03:04:06-05:00
dtype: datetime64[s, America/New_York]
>>> np.ascontiguousarray(series)
array([Timestamp('2024-01-02 03:04:05-0500', tz='America/New_York'),
       Timestamp('2024-01-02 03:04:06-0500', tz='America/New_York')],
      dtype=object)

>>> series.dtype.base
dtype('<M8[s]')
>>> series.dtype.tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

In pandas 2.0, there was a bug (pandas-dev/pandas#52705) that pd.DateTimeTZDtype with any units are stored with dtype in ns resolution. The bug was fixed in pandas 2.1 (pandas-dev/pandas#52706), but there is no workaround on our side so the related tests are marked as xfail for pandas 2.0


Via pa.Timestamp. The pandas.Series object is converted to object dtype. So, we need to deal with it manually. The expected numpy datetime type and TZ information can be accessed via series.dtype.numpy_dtype and series.dtype.pyarrow_dtype.tz.

>>> series = pd.Series(data, dtype="timestamp[s, America/New_York][pyarrow]")
>>> series
0    2024-01-01 22:04:05-05:00
1    2024-01-01 22:04:06-05:00
dtype: timestamp[s, tz=America/New_York][pyarrow]

>>> np.ascontiguousarray(series)
array([Timestamp('2024-01-01 22:04:05-0500', tz='America/New_York'),
       Timestamp('2024-01-01 22:04:06-0500', tz='America/New_York')],
      dtype=object)
>>> series.dtype.numpy_dtype
dtype('<M8[s]')
>>> series.dtype.pyarrow_dtype.tz
'America/New_York'

In pandas 2.0, series.dtype.numpy_dtype is dtype('O'), and it doesn't have the series.dt.tz_convert method.

Base automatically changed from to_numpy/pandas_numeric to main December 12, 2024 01:29
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from 850ac31 to a64e9e3 Compare December 12, 2024 01:31
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from a64e9e3 to 5867999 Compare December 12, 2024 07:22
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from 068c5cb to 56a266d Compare December 12, 2024 07:59
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from 6b7017b to fb14509 Compare December 12, 2024 09:55
@seisman seisman added the maintenance Boring but important stuff for the core devs label Dec 12, 2024
@seisman seisman marked this pull request as ready for review December 12, 2024 10:00
@seisman seisman added the needs review This PR has higher priority and needs review. label Dec 12, 2024
@seisman seisman added this to the 0.14.0 milestone Dec 12, 2024
@seisman seisman removed the needs review This PR has higher priority and needs review. label Dec 19, 2024
@seisman seisman removed this from the 0.14.0 milestone Dec 19, 2024
Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, just one suggestion to add a reminder to remove the pandas<2.1 workaround.

Comment on lines +201 to +203
if Version(pd.__version__) < Version("2.1"):
# In pandas 2.0, dtype.numpy_type is dtype("O").
numpy_dtype = np.dtype(f"M8[{dtype.pyarrow_dtype.unit}]") # type: ignore[assignment, attr-defined]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO here to remove this once we drop support for pandas 2.0? Should be after 2025-08-29 according to https://scientific-python.org/specs/spec-0000/#support-window

@seisman seisman added this to the 0.15.0 milestone Jan 9, 2025
@seisman seisman merged commit 9e912ba into main Jan 9, 2025
18 of 21 checks passed
@seisman seisman deleted the to_numpy/pandas_datetime branch January 9, 2025 02:38
@michaelgrund michaelgrund changed the title clib.converison._to_numpy: Add tests for pandas.Series with datetime dtypes clib.conversion._to_numpy: Add tests for pandas.Series with datetime dtypes Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants