Warn on duplicate names in MI? #19029

TomAugspurger · 2018-01-01T20:21:14Z

Opening a new issue so this isn't lost.

In #18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]:     pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
   ...:                         'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
   ...:                        index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
   ...:
   ...:

In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)

Another, more realistic example: groupwise drop_duplicates:

In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))

In [19]: df
Out[19]:
   B
a
0  0
1  0
1  0
2  1
2  1
2  1
0  2
0  2
1  2

In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates)
Out[20]:
     B
a a
0 0  0
  0  2
1 1  0
  1  2
2 2  1

Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?

jreback · 2018-01-01T22:14:59Z

cc @toobaz

toobaz · 2018-01-01T23:23:37Z

I don't have a strong opinion on this... we could revert and emit a warning, we could revert and abandon the idea of forbidding duplicated names (and solve otherwise the problems exposed in #18872), we could rename name to name1 (then name2...) when name is already present, or we could just drop the name (reset to None) when it is already present.

My personal preference, at least if we think that our main source of concern should be groupby, is probably for this last solution, which would then (temporarily?) also emit a warning. It is maybe not very elegant, but it's very close to perfect backward compatibility, while still allowing to simplify/clean a bit the code. (But should we do this even when the user sets two levels with the same name simultaneously, or could we at least in that case raise an error?)

This said, in the above examples we would probably better serve the user by dropping one of the two levels, which are exact copies. Certainly we can come up with examples which would still fail, but maybe both things are worth implementing together, so that the "black magic" (reset name to None) is used sparingly, if ever.

TomAugspurger · 2018-05-17T12:45:35Z

Hmm, this was supposed to be done for 0.23, but we missed it.

I still think it's worthwhile doing for 0.23.1 (cc @guenteru if you have time to make a PR).

As of version 0.23.0 MultiIndex throws an exception in case it contains duplicated level names. This can happen as a result of various groupby operations (pandas-dev#21075). This commit changes the behavior of groupby slightly: In case there are duplicated names contained in the index these names get suffixed by there corresonding position (i.e. [name,name] => [name0,name1])

jorisvandenbossche · 2018-06-07T20:59:32Z

I think this is actually an important one to decide upon.

The cases we have seen are in my opinion genuine use cases that we should somehow enable (eg the df.groupby(df.index.year, df.index.month)).
I don't know if we have other ways to enable this than to actually allow duplicate index names?

TomAugspurger · 2018-06-07T21:05:03Z

I don't know if we have other ways to enable this than to actually allow duplicate index names?

Mangling the name like ['Date_0', 'Date_1'] when we detect that there's a conflict? It'll still be a breaking change, but less painful?

TomAugspurger · 2018-06-07T21:06:05Z

Though we allow non-string names for names, so mangling isn't always straightforward.

toobaz · 2018-06-07T22:26:44Z

Yeah, mangling wouldn't be a very general solution. I'd rather set problematic names to None.

As an alternative, it should be trivial to add an index_names argument to groupby - it would be required when names would otherwise be duplicated, and optional otherwise.

As I stated, I'm not necessarily against re-allowing duplicate names, but on an index with duplicated names, all level selection by names (e.g. ``mi.get_level_values("string_label")'', but also unstacking) should then just error.

jorisvandenbossche · 2018-06-07T22:28:40Z

but on an index with duplicated names, all level selection by names (e.g. ``mi.get_level_values("string_label")'', but also unstacking) should then just error.

This is certainly fine I think

jreback · 2018-06-08T11:31:08Z

moving this to 0.23.2. there are a number of solutions, need to see an implementation.

jorisvandenbossche · 2018-06-11T12:49:00Z

but on an index with duplicated names, all level selection by names (e.g. ``mi.get_level_values("string_label")'', but also unstacking) should then just error.

This is certainly fine I think

And we seem to already do this. At least every code that uses mi._get_level_number(name) will raise if name occurs multiple times (eg in stack).

jorisvandenbossche · 2018-06-11T14:27:16Z

So to make this more concrete, I put up a PR for the option to again allow duplicate index level names: #21423

IMO, this is the most sensible thing to do for now on the short term. Alternatives:

mangling the names (but several good reasons have been given above why this is also not really a good solution).
set the clashing names both to None (this might actually be another sensible thing to do, but, requires more custom code inside pandas to detect those cases)
introduce a keyword to groupby to set those names (this does not really contradict re-allowing it for now. We can later still add this keyword if we think it is useful, and use it as a way to deprecate the duplicate level names)

jorisvandenbossche · 2018-06-14T18:11:49Z

Any feedback on my last comments here / the PR ?

jorisvandenbossche · 2018-06-27T12:11:16Z

Any feedback here?
If not, I would like to go forward with again allowing duplicate index level names.

TomAugspurger · 2018-06-27T12:44:12Z

Will look at the PR now.

TomAugspurger · 2018-06-27T12:49:13Z

I agree that in the short-term, re-allowing duplicate names is the best path forward.

I think we (I) didn't fully appreciate all the cases that can lead to duplicate names. So a sequence of

re-allowing duplicate names
providing ways to avoid getting in a situation with duplicate names
deprecate duplicate names with a warning
disallow duplicate names

seems sensible.

jorisvandenbossche · 2018-06-27T13:00:29Z

That sequence seems sensible indeed. I only don't yet really know what "providing ways to avoid getting in a situation with duplicate names" would look like, and if we would find a solution here.
As long as we have not a good solution for that, I am also fine with the "allow duplicate names in se, but raise an error once do anything related to selecting an index level by name" situation.

toobaz · 2018-06-27T14:15:31Z

If not, I would like to go forward with again allowing duplicate index level names.

No objection. I would even dare to say that duplicate index level names are analogous to duplicate elements in axes: not ideal, and we should avoid producing them in our API, but if the user does, fair enough, we will just raise an error any time levels are requested by name. In particular, I don't see a MI with repeated names as more problematic than a MI with no/missing names.

jorisvandenbossche · 2018-07-02T15:29:10Z

Closed by #21423

TomAugspurger added API Design Compat pandas objects compatability with Numpy or Python functions Groupby MultiIndex labels Jan 1, 2018

TomAugspurger added this to the Next Major Release milestone Jan 1, 2018

TomAugspurger mentioned this issue Jan 2, 2018

COMPAT: Pandas 0.23 duplicate names in MI dask/dask#3041

Merged

3 tasks

jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Feb 19, 2018

jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

toobaz mentioned this issue May 16, 2018

groupby breaks when using duplicated level names #21075

Closed

jorisvandenbossche modified the milestones: Next Major Release, 0.23.1 May 16, 2018

guenteru added a commit to guenteru/pandas that referenced this issue May 21, 2018

add groupby testcase (pandas-dev#19029)

fbcc2ab

guenteru mentioned this issue May 22, 2018

BUG: group with multiple named results #21171

Closed

4 tasks

jschendel mentioned this issue May 30, 2018

"Duplicated level name" when using groupby with two different attributes of a datetime #21250

Closed

jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018

jorisvandenbossche modified the milestones: 0.23.2, 0.23.1 Jun 7, 2018

jreback modified the milestones: 0.23.1, 0.23.2 Jun 8, 2018

TomAugspurger mentioned this issue Jun 11, 2018

RLS: 0.23.1 #21312

Closed

jorisvandenbossche mentioned this issue Jun 11, 2018

API: re-allow duplicate index level names #21423

Merged

jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018

jorisvandenbossche modified the milestones: 0.23.3, 0.23.2 Jun 27, 2018

jorisvandenbossche closed this as completed Jul 2, 2018

h-vetinari mentioned this issue Jul 14, 2018

TST/CLN: clean up indexes/multi/test_unique_and_duplicates #21900

Merged

cipri-tom mentioned this issue Sep 14, 2020

BUG: DataFrame.stack does not work when MultiIndex had duplicated names #36353

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn on duplicate names in MI? #19029

Warn on duplicate names in MI? #19029

TomAugspurger commented Jan 1, 2018

jreback commented Jan 1, 2018

toobaz commented Jan 1, 2018

TomAugspurger commented May 17, 2018

jorisvandenbossche commented Jun 7, 2018

TomAugspurger commented Jun 7, 2018

TomAugspurger commented Jun 7, 2018

toobaz commented Jun 7, 2018

jorisvandenbossche commented Jun 7, 2018

jreback commented Jun 8, 2018

jorisvandenbossche commented Jun 11, 2018

jorisvandenbossche commented Jun 11, 2018

jorisvandenbossche commented Jun 14, 2018

jorisvandenbossche commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018

toobaz commented Jun 27, 2018

jorisvandenbossche commented Jul 2, 2018

Warn on duplicate names in MI? #19029

Warn on duplicate names in MI? #19029

Comments

TomAugspurger commented Jan 1, 2018

jreback commented Jan 1, 2018

toobaz commented Jan 1, 2018

TomAugspurger commented May 17, 2018

jorisvandenbossche commented Jun 7, 2018

TomAugspurger commented Jun 7, 2018

TomAugspurger commented Jun 7, 2018

toobaz commented Jun 7, 2018

jorisvandenbossche commented Jun 7, 2018

jreback commented Jun 8, 2018

jorisvandenbossche commented Jun 11, 2018

jorisvandenbossche commented Jun 11, 2018

jorisvandenbossche commented Jun 14, 2018

jorisvandenbossche commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018

toobaz commented Jun 27, 2018

jorisvandenbossche commented Jul 2, 2018