-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: overhaul/unify/improve .unique #22824
Comments
there are multiple issues about changing .unique (for quite some time) pls show the references |
-1 in pd.unique being 2-d as this will make an already complicated function much more so |
I tried best I could. Searching open issues with "unique" yields 185 results which are at first glance completely unrelated - so I'll admit I didn't comb through all of those. (https://github.com/pandas-dev/pandas/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+unique) Searching for "df.unique" or "dataframe.unique" yielded only #21357, and now also this issue. |
if Also, the implementation of |
there was tremendous discussion on this - what should Series.unique return - rather than rehash much better to search for it - follow the links in the first issue |
Changing
For someone who hasn't been through these discussions through the years, it's nigh impossible to find. Not going to go through thousands of closed issues to find some orthogonal discussions that may or not be hiding in there.
This is but one aspect of this issue (in particular, the rest are independent of it). And still, most of the discussion in #13395 was what index |
@h-vetinari Thanks for all the discussion I think the main previous discussions on the return value of But thanks for looking through the older issues! That is indeed not always easy (even for a core dev that might have participated in them ..) Some replies to the inconsistencies / solutions:
Personally, I think that boat has sailed (for Series, Index already returns an Index). I don't really see the added value in having our users go through such a deprecation cycle for such a core function. Also, returning a Series, i.e. returning an indexed object, gives you more complexity, which is what
Which is not relevant as long as the the return value is non an indexed object?
In naming yes, but what is the "natural" behaviour? If you consider the difference between the current
As I said in the other thread, I think there would be not much disagreement on adding
But what would you want it to do? Work on the flattened values, or on a certain axis? (for
For |
So maybe the main question for me is this (limiting it to @h-vetinari's proposal is to unify them, but now they serve an overlapping but still different purpose, and it can also be fine to have two different methods for that (eg |
@jorisvandenbossche
I would argue that they do exactly the same thing (up to and including
No, it IS relevant even now. Pandas strongly advertises that
Add
WRT that, I found another inconsistency,
Finally,
If the change was ok for |
BTW, @h-vetinari in case you would be interested, we are having a dev hangout tomorrow: https://mail.python.org/pipermail/pandas-dev/2018-September/000830.html |
@jorisvandenbossche |
@h-vetinari ah, that wasn't on the agenda yet, will add it to the list of potential topics to discuss, but can't promise we will get to it |
What do you mean by this? I don't see how the output would differ if just an array is being returned. |
Since pandas advertises not sorting
Imagine you want to find out which unique errors occurred in whatever system generated the output, and when they last (or first occurred) - presumably the order of events will have some relevant information. Since
I fully understand that It's not even a perf-issue, because the hashtable code has the info at which index it is (for Finally, this is how it should look, IMO:
|
After speaking about this in the dev chat yesterday I'm wondering if it wouldn't make the most sense for Because Index, Series and DataFrame already have drop_duplicates it's an easier change to add the keyword there than creating new methods. For consistency the return value of Index |
Sorry I couldn't take part in the dev chat yesterday. I understand that the hurdle is larger to change the output from The distinction between
Again, 1.0 is the perfect time for this - it only gets harder later. |
@jorisvandenbossche @TomAugspurger @jreback @WillAyd I wanna pick this up again, please, and need your feedback. In addition to the points mentioned directly above, there's another important point:
Long story short:
|
I still don't see the point of unifying those two functions. For me, they have a clear different behaviour (and use case). (personal use case: If I quickly want to check the unique values, I use So I am personally fine with the current distinction between both functions.
So for
Let's start with adding the functionality to the hashtables and to |
@jorisvandenbossche @jreback @TomAugspurger @WillAyd @gfyoung @jschendel @toobaz Return type of
|
your example looks like a simple group by and merge or just a merge_asod for ordered things why is that not simply the answer? |
I tried to make a short enough toy example (as the post is already very long), so obviously some complexity gets brushed under the carpet, which then may allow alternative solutions. But even then, using A merge with (modified) uniques is fundamentally the same as using I had another toy example in mind that does something like "cumsum-only-new-values" (not a simple groupby AFAICT) - I just didn't work that one out as thoroughly: >>> s = pd.Series([1, 4, 1, 2, 2, 4, 3], index=[x ** 2 for x in range(7)])
>>> uniques, inv = s.unique(return_inverse=True)
>>> # csonv = cumsum-only-new-values
>>> uniques_enh = uniques.to_frame('orig').assign(csonv = uniques.cumsum())
>>> s_enh = uniques_enh.reindex(inv).set_index(inv.index)
>>> s_enh.csonv = s_enh.csonv.expanding().max().astype(int)
>>> s_enh
orig csonv
0 1 1
1 4 5
4 1 5
9 2 7
16 2 7
25 4 7
36 3 10 |
@h-vetinari it might be besides the point, since it is only a simplified example and the below might not be the case in other actual use case, but I would solve the toy problem with
(I won't say this is the most beautiful code, but the annoyance are rather in different things: not being able to set the name of the index in |
I see now you touched in your last post as answer to Jeff on reasons why you don't want the above, because a merge is less efficient? Another way, assuming the uniques and inverse are arrays:
So from that, a |
@jorisvandenbossche
Because to merge with the
This was not as cleanly delineated as it could have been:
I'm really -1 on having to deal with both I'm proposing a solution that's pandas-native, and only needs two objects for reconstruction (instead of four). |
@jorisvandenbossche |
@jorisvandenbossche |
@jorisvandenbossche
|
Copying from #24108:
This is what #24108 is about, since the issue had stalled for >1 months despite several pings.
I disagree with this quite strongly:
Would you mind chiming in here then? |
It's just inconsistencies over inconsistencies (discovered while writing tests for a precursor PR of #24108):
So
|
Some of those inconsistencies (certainly not all :-)) have historical reasons / justifications. In a similar vein, for Index, we decided to return an Index object, as this then could already naturally retain all type information, and still return something array-like.
I don't think this change (Index.unique from Index->ndarray) is necessarily needed, but note that this is certainly a less breaking change than ndarray -> Series. For example, indexing an Index or an ndarray is much more alike.
What do you mean here?
It certainly needs more effort of the user for certain use case, but I don't think it is impossible and in many cases there are other (maybe even nicer alternatives) available**. To repeat what I said above: I personally don't see the need to change the return type of a Series; I personally like the distinction between ** For your example above, this is another alternative solution:
|
@jorisvandenbossche
That's the main point to me. Pandas is moving away numpy arrays in general towards richer and more versatile array formats ( It would just be wrong (IMO) to return an ndarray for these cases (as you've already decided for In fact, I'd argue that probably the main reason why PS. The example I gave can indeed be solved differently, but that's because I had to simplify the example for reasons of space. In practice, I don't have the option of a simple |
I don't know what progress has been made regarding the bulk of this issue, but the following bullet point, at least, will be resolved by #27874, FWIW:
|
@stuarteberg You will see that the approach of #27874 must fail for everything that's a pandas-dtype for example, because numpy does not know them. Said otherwise: maintaining the dtype is part of the reason for aligning the type that's returned by I had a working POC implementation in #24108 (not including tests), but I haven't followed up on this much because the appetite of the core devs for changing the return type was low. I still consider it essential enough to deal with the transition "pain", and would be willing to work on it. |
@h-vetinari Ah, I misunderstood. Thanks for the clarification! (Sorry for the noise...) |
repeating from the linked issues @h-vetinari if you'd summarize.
|
@jreback
It's useful for preserving the index of the data (since pandas prominently advertises that In general, pandas is emancipating itself more and more from numpy, and it is both an artifact** of its development and a burden to usability to have
Factorize is a similar but [ifferent method (docs here):
There's actually an ancient issue (#4087, by you, no less ;-)) about adding inverse-handling to
Among other things, all cases where factorize is (mis-)used for getting the faux-inverse. Of course, you'll be able to mock most of such functionality through joins somehow, but the point is that the cython code already has all the necessary information for the inverse, and we just need to spit it out. Additionally, perhaps the following rough example of a clean use case may be easier to imagine:
** it's non-trivial to calculate which index gets preserved after the application of |
you misunderstand - i mean a Pandas Array |
pls answer why .factorize is not just the answer here |
The whole middle block of my comment deals with that...
Because the index information is valuable too, especially since |
this doesn’t answer the question an Array preserves ordering the actual index values by definition would disappear even for a Series |
Yes, the current implementation already maintains the order, but it's darn hard to correctly reconstruct the correct index information for the np.array (or pd.Array) the user then has. And factorize does not solve it because it treats all missing values equally, whereas unique distinguishes between
There must be some sort of misunderstanding here? Of course a Series would have an index...? In fact, the result would be the same as |
Ok, maybe the misunderstanding was partly on my side. Obviously the Series has an index (that's what I mistakenly thought might be your question), but more than that, the mapping from the original values to the unique values preserves a meaningful index, precisely because it does not sort - and therefore (the implicit) |
again this is exactly what factorize does; it returns a tuple of the indexers and the unique values if there is a specific issue with say not preserving a type of null value i suppose that could be a bug, but i see lots of conflation here |
Factorize and unique are different by design: The docs for factorize explicitly say that missing values will be ignored:
This is not what
Here as well, the indexer (into an array!) plus an Index of unique values is not nearly the same as just a thinned-out Series containing the (first) unique values and their correct index from the original
May I ask you to not assume confusion so quickly? I've been on this case for about 1.5 years, and I do not conflate the various methods and their goals. |
The state of the various flavours of
.unique
as ofv0.23
:[pd/Series/Index].unique
does not havekeep
-kwargSeries.unique
returns array,Series.drop_duplicates
returnsSeries
. Returning a plainnp.ndarray
is quite unusual for aSeries
method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMOIndex
DataFrame.unique
does not exist, but is a much more natural candidate (from the behaviour of numpy, resp.Series/Index
) than.drop_duplicates
pd.unique
chokes on 2-dimensional datareturn_inverse
-kwarg for any of the.unique
variants; see API: provide a better way of doing np.unique(return_inverses=True) #4087 (milestoned since 0.14), ENH: adding .unique() to DF (or return_inverse for duplicated) #21357I originally wanted to add
df.unique(..., return_inverse=True|False)
for #21357, but got directed to add it toduplicated
instead. After slow progress over 3 months in #21645 (PR essentially finished since 2), @jorisvandenbossche brought up the - justified (IMO) - feedback that:and
This prompted me to have another look at the situation with
.unique
, and I found the list of the above inconsistencies. To resolve them, I suggest to:[Series/Index].unique
to be same as caller (deprecation cycle by introducingraw=None
which at first defaults to True?)keep
-kwarg to[Series/Index].unique
(make.unique
a wrapper around.drop_duplicates
?)df.unique
(as thin wrapper around.drop_duplicates
?)keep
-kwarg topd.unique
and dispatch toDataFrame/Series/Index
as necessaryreturn_inverse
-kwarg to all of them (and add to EA interface); under the hood by exposing the same kwarg toduplicated
anddrop_duplicates
as wellnp.nan/None
indf.duplicated
inconsistent vs. Series behaviour)Each point is essentially self-contained and independent of the others, but of course they make more sense together.
The text was updated successfully, but these errors were encountered: