Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on duplicate names in MI? #19029

Closed
TomAugspurger opened this issue Jan 1, 2018 · 18 comments
Closed

Warn on duplicate names in MI? #19029

TomAugspurger opened this issue Jan 1, 2018 · 18 comments
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Groupby MultiIndex
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Opening a new issue so this isn't lost.

In #18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]:     pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
   ...:                         'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
   ...:                        index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
   ...:
   ...:

In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)

Another, more realistic example: groupwise drop_duplicates:

In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))

In [19]: df
Out[19]:
   B
a
0  0
1  0
1  0
2  1
2  1
2  1
0  2
0  2
1  2

In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates)
Out[20]:
     B
a a
0 0  0
  0  2
1 1  0
  1  2
2 2  1

Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?

@TomAugspurger TomAugspurger added API Design Compat pandas objects compatability with Numpy or Python functions Groupby MultiIndex labels Jan 1, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jan 1, 2018
@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

cc @toobaz

@toobaz
Copy link
Member

toobaz commented Jan 1, 2018

I don't have a strong opinion on this... we could revert and emit a warning, we could revert and abandon the idea of forbidding duplicated names (and solve otherwise the problems exposed in #18872), we could rename name to name1 (then name2...) when name is already present, or we could just drop the name (reset to None) when it is already present.

My personal preference, at least if we think that our main source of concern should be groupby, is probably for this last solution, which would then (temporarily?) also emit a warning. It is maybe not very elegant, but it's very close to perfect backward compatibility, while still allowing to simplify/clean a bit the code. (But should we do this even when the user sets two levels with the same name simultaneously, or could we at least in that case raise an error?)

This said, in the above examples we would probably better serve the user by dropping one of the two levels, which are exact copies. Certainly we can come up with examples which would still fail, but maybe both things are worth implementing together, so that the "black magic" (reset name to None) is used sparingly, if ever.

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Feb 19, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.1 May 16, 2018
@TomAugspurger
Copy link
Contributor Author

Hmm, this was supposed to be done for 0.23, but we missed it.

I still think it's worthwhile doing for 0.23.1 (cc @guenteru if you have time to make a PR).

guenteru added a commit to guenteru/pandas that referenced this issue May 21, 2018
As of version 0.23.0 MultiIndex throws an exception in case it contains
duplicated level names. This can happen as a result of various groupby
operations (pandas-dev#21075). This commit changes the behavior of groupby slightly: In
case there are duplicated names contained in the index these names get suffixed by there
corresonding position (i.e. [name,name] => [name0,name1])
guenteru added a commit to guenteru/pandas that referenced this issue May 21, 2018
@jreback jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018
@jorisvandenbossche
Copy link
Member

I think this is actually an important one to decide upon.

The cases we have seen are in my opinion genuine use cases that we should somehow enable (eg the df.groupby(df.index.year, df.index.month)).
I don't know if we have other ways to enable this than to actually allow duplicate index names?

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.2, 0.23.1 Jun 7, 2018
@TomAugspurger
Copy link
Contributor Author

I don't know if we have other ways to enable this than to actually allow duplicate index names?

Mangling the name like ['Date_0', 'Date_1'] when we detect that there's a conflict? It'll still be a breaking change, but less painful?

@TomAugspurger
Copy link
Contributor Author

Though we allow non-string names for names, so mangling isn't always straightforward.

@toobaz
Copy link
Member

toobaz commented Jun 7, 2018

Yeah, mangling wouldn't be a very general solution. I'd rather set problematic names to None.

As an alternative, it should be trivial to add an index_names argument to groupby - it would be required when names would otherwise be duplicated, and optional otherwise.

As I stated, I'm not necessarily against re-allowing duplicate names, but on an index with duplicated names, all level selection by names (e.g. ``mi.get_level_values("string_label")'', but also unstacking) should then just error.

@jorisvandenbossche
Copy link
Member

but on an index with duplicated names, all level selection by names (e.g. ``mi.get_level_values("string_label")'', but also unstacking) should then just error.

This is certainly fine I think

@jreback
Copy link
Contributor

jreback commented Jun 8, 2018

moving this to 0.23.2. there are a number of solutions, need to see an implementation.

@jreback jreback modified the milestones: 0.23.1, 0.23.2 Jun 8, 2018
@jorisvandenbossche
Copy link
Member

but on an index with duplicated names, all level selection by names (e.g. ``mi.get_level_values("string_label")'', but also unstacking) should then just error.

This is certainly fine I think

And we seem to already do this. At least every code that uses mi._get_level_number(name) will raise if name occurs multiple times (eg in stack).

@jorisvandenbossche
Copy link
Member

So to make this more concrete, I put up a PR for the option to again allow duplicate index level names: #21423

IMO, this is the most sensible thing to do for now on the short term. Alternatives:

  • mangling the names (but several good reasons have been given above why this is also not really a good solution).
  • set the clashing names both to None (this might actually be another sensible thing to do, but, requires more custom code inside pandas to detect those cases)
  • introduce a keyword to groupby to set those names (this does not really contradict re-allowing it for now. We can later still add this keyword if we think it is useful, and use it as a way to deprecate the duplicate level names)

@jorisvandenbossche
Copy link
Member

Any feedback on my last comments here / the PR ?

@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018
@jorisvandenbossche
Copy link
Member

Any feedback here?
If not, I would like to go forward with again allowing duplicate index level names.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.3, 0.23.2 Jun 27, 2018
@TomAugspurger
Copy link
Contributor Author

Will look at the PR now.

@TomAugspurger
Copy link
Contributor Author

I agree that in the short-term, re-allowing duplicate names is the best path forward.

I think we (I) didn't fully appreciate all the cases that can lead to duplicate names. So a sequence of

  1. re-allowing duplicate names
  2. providing ways to avoid getting in a situation with duplicate names
  3. deprecate duplicate names with a warning
  4. disallow duplicate names

seems sensible.

@jorisvandenbossche
Copy link
Member

That sequence seems sensible indeed. I only don't yet really know what "providing ways to avoid getting in a situation with duplicate names" would look like, and if we would find a solution here.
As long as we have not a good solution for that, I am also fine with the "allow duplicate names in se, but raise an error once do anything related to selecting an index level by name" situation.

@toobaz
Copy link
Member

toobaz commented Jun 27, 2018

If not, I would like to go forward with again allowing duplicate index level names.

No objection. I would even dare to say that duplicate index level names are analogous to duplicate elements in axes: not ideal, and we should avoid producing them in our API, but if the user does, fair enough, we will just raise an error any time levels are requested by name. In particular, I don't see a MI with repeated names as more problematic than a MI with no/missing names.

@jorisvandenbossche
Copy link
Member

Closed by #21423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Groupby MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants