Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: Single way to get underlying values for Index / Series #19548

Closed
TomAugspurger opened this issue Feb 6, 2018 · 9 comments
Closed

CLN: Single way to get underlying values for Index / Series #19548

TomAugspurger opened this issue Feb 6, 2018 · 9 comments
Labels
Clean Internals Related to non-user accessible pandas implementation
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 6, 2018

(split from #19520)

It'd be convenient to have an internal method for getting the highest-fidelity array stored by a container (Index, Series, Block?).

For Series, this is already what ._values does. Index._values is sometimes different though (Period, DatetimeIndex with TZ), so we'll use a different name.

dtype array type
category Categorical
datetime64ns ndarray
datetime64ns-tz DatetimeIndex
interval IntervalIndex (eventually IntervalArray)
numeric ndarray
period PeriodIndex (eventually PeriodArray)
sparse SparseArray
str ndarray
@TomAugspurger TomAugspurger added Internals Related to non-user accessible pandas implementation Difficulty Intermediate Clean labels Feb 6, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Feb 6, 2018
@jorisvandenbossche
Copy link
Member

So currently it seems that Index._values is only different from Index.values by PeriodIndex ? (and there it returns the integer data)

Wondering if we can't just use _values for Index as well (and change the return value for the extension types). Depends on what it is used for of course, but that might be possible? (In the Index class itself it is used to pass through attributes like shape, strides, nbytes, itemsize, ..)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 6, 2018

(edited the original post to note the differences)

Also DatetimeIndex w/ TZ. That returns an ndarray.

In [6]: pd.DatetimeIndex(['2017'], tz="US/Central")._values
Out[6]: array(['2017-01-01T06:00:00.000000000'], dtype='datetime64[ns]')

In [7]: pd.Series(pd.DatetimeIndex(['2017'], tz="US/Central"))._values
Out[7]: DatetimeIndex(['2017-01-01 00:00:00-06:00'], dtype='datetime64[ns, US/Central]', freq=None)

@TomAugspurger
Copy link
Contributor Author

When I looked at changing DatetimeIndex._values to return a DatetimeIndex when it had a tz, those properties you mentioned all broke (but they aren't too hard to fix).

The more difficult thing was in the indexing engines. Those really do need an ndarray of integers (or whatever).

So we need both .values_as_an_ndarray and .values_as_whatever_array, just with better names.

@TomAugspurger
Copy link
Contributor Author

On naming, we want something that conveys "This is the best / fullest-information array representation." i.e. we aren't going to drop the TZ and convert to UTC.

@jorisvandenbossche
Copy link
Member

Ah yes that one for sure as well, I was only looking at Index.values vs Index._values (with the idea, if it is almost always the same, we could use _values for the new thing to be consistent with Series)

I think names like ._as_ndarray and ._as_internal_array/._as_pd_array would be fine

@TomAugspurger
Copy link
Contributor Author

Ah, I like _pd_array, though I'm not sure about _as, since I when I see that I think of a method like asarray(), not a property.

So how about _ndarray and _pdarray?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 6, 2018

Although implementation-wise, it may be easier to

  1. Rename the current Index._values to ._internal_values
  2. Make Index._values equivalent with Series._values.

I may start with that.

@jorisvandenbossche
Copy link
Member

Yep, that might be easier for now (to go forward on the other PR). Although it will not decrease the complexity .. :-)

@jbrockmendel
Copy link
Member

Make Index._values equivalent with Series._values.

+1

See also #19294. Some of the accessors have _data/data/values/_values attributes that could probably be renamed to get closer to "_values always means X"

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

4 participants