Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: relabel method #15104

Closed
wants to merge 3 commits into from
Closed

API/ENH: relabel method #15104

wants to merge 3 commits into from

Conversation

chris-b1
Copy link
Contributor

Alt to #15029

small example

In [7]: df = pd.DataFrame({'a': [1,2], 'b': [3, 4]})

In [8]: df.relabel(columns=['J', 'K'])
Out[8]: 
   J  K
0  1  3
1  2  4

@chris-b1 chris-b1 added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 11, 2017
@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

how would this work with groupby? (just thinking out loud)

@chris-b1
Copy link
Contributor Author

Setup:

df = pd.DataFrame({'key': [1, 1, 2], 'value': [1, 2, 3]})
gb = df.groupby('key')

I can't think of a meaningful answer for what gb.relabel would do? You would of course be able to do:

In [17]: gb.sum()
Out[17]: 
     value
key       
1        3
2        3

In [18]: gb.sum().relabel(columns=['newvalue'])
Out[18]: 
     newvalue
key          
1           3
2           3

Labels to construct new axis from - number of labels
must match the length of the existing axis
copy : boolean, default True
Also copy underlying data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have this flag (copy) on .rename as well. I think these are confusing, though not sure what a better API is for these. (and copy=True, inplace=True is pretty much meaningless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not sure - I agree it's confusing. I'd be tempted to deprecate copy altogether, though if you know what you're doing I suppose it's useful.


See Also
--------
pandas.NDFrame.rename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob best to add pandas.Series.rename, pandas.DataFrame.rename, pandas.Panel.rename here instead

@codecov-io
Copy link

codecov-io commented Jan 11, 2017

Current coverage is 84.75% (diff: 93.75%)

Merging #15104 into master will decrease coverage by <.01%

@@             master     #15104   diff @@
==========================================
  Files           145        145          
  Lines         51220      51263    +43   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43415      43448    +33   
- Misses         7805       7815    +10   
  Partials          0          0          

Powered by Codecov. Last update c71f214...db1779b

@@ -1586,6 +1586,23 @@ If you create an index yourself, you can just assign it to the ``index`` field:

data.index = index

.. versionadded:: 0.20.0

Alternatively, the :meth:`~DataFrame.relabel` can be used to assign new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these underlined? or is that the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the diff - no clue why it looks like that though

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

what does SQL, Spark, R call this rename/relabel API?

not that we need to follow, but consistency is not a bad thing.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Jan 12, 2017

dplyr and SQL would both do this through the select verb, dplyr also has rename, which basically works the same way, but selects everything else too.

# R
df %>% select(J = a, K = b)
df %>% rename(J = a)  # includes column `b`

# SQL
SELECT a as J, b as K
FROM df

In base R it also looks like you can do label assignment via the names function (not 100% sure about this)

# R
names(df) <- c(J, K)

@chris-b1
Copy link
Contributor Author

I suppose #12392 also needs consideration. From the discussion there, there was some preference towards moving towards a relabel(labels=['J', 'K'], axis=1) API, though I personally prefer the named axis arguments for something like this.

cc @nickeubank @jorisvandenbossche @shoyer

@jorisvandenbossche
Copy link
Member

Yes, we need to take #12392 also in consideration. For now, you only implemented the new feature (renaming all columns using a list-like) in relabel, but I think the original idea was to have the full functionality of rename in relabel (at least, that was my interpretation, but that's certainly open for discussion!). In that case, we should first decide on #12392 before implementing it here.

though I personally prefer the named axis arguments for something like this.

me as well, but I think there was a compromise in #12392 to have both possibilities at the same time. But let's leave that discussion over there. #12392 is something I would like to see happen for 0.20/1.0, so I can try to take a new look at that this weekend.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

I view this different than @jorisvandenbossche

This is like .set_index but w/o the 'guessing' whether you are meaning to set by a particular column (we had a PR try to implement, #11944 this but it got complicated). So I view this as complimentary to .rename, and would have both actually (and keep the API consistent until / unless we change in #12392)

@jorisvandenbossche
Copy link
Member

Regarding the relabel name: I personally find it also not ideal, as rename(column={'a':'b'}) flows very naturally for "renaming column 'a' to 'b'".
I think the main problem with rename is the conflict with the fact that a Series has a name attribute? Or were there other reasons to introduce a relabel method?

ibis also calls this relabel: http://docs.ibis-project.org/generated/ibis.expr.api.TableExpr.relabel.html#ibis.expr.api.TableExpr.relabel

cc @pandas-dev/pandas-core

@jorisvandenbossche
Copy link
Member

This is like .set_index but w/o the 'guessing' whether you are meaning to set by a particular column

That is certainly also a possibility, but then I would keep relabel for exactly that (and not being able to also rename individual labels like rename, and then maybe something like set_labels would be a clearer name (that you replace the full set of labels)

@chris-b1
Copy link
Contributor Author

chris-b1 commented Jan 12, 2017

Or were there other reasons to introduce a relabel method?

My thinking was that rename is already too overloaded and can run into corner cases with a list-like #14829 (comment), but can't really be changed for back-compat.

set_labels wouldn't be bad either, although somewhat confusingly we already have a set_axis that does the same thing, but inplace only.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 13, 2017 via email

@shoyer
Copy link
Member

shoyer commented Jan 13, 2017

The name "labels" is nicely unambiguous about referring to DataFrame.columns or DataFrame.index, unlike the highly overloaded "name". So all things being equal, relabel would seem more consistent with setting index/column values than rename.

BUT rename already basically does this and I'm not sure a deprecation cycle is worth it. I do think multiple methods may be a good thing, but they should be distinguished by which names they change, not now how they change them (dict vs list).

I am more inclined toward the solution #14829, which enables list input to rename. I am OK with distinguishing between tuples and lists for dispatching on Series.name. If I had my way, np.array and pd.Series would not coerce tuples (treating them as immutable scalars), only lists.

It's also worth noting that we do have an existing rename_axis method, too. To summarize the current state of affairs:

  • Series.rename sets Series.index (dict/function) or Series.name, depending on the type of input .
  • Series.rename_axis sets Series.index (dict/function) or Series.index.name (others), depending on the type of input.
  • DataFrame.rename sets DataFrame.columns/DataFrame.index
  • DataFrame.rename_axis sets DataFrame.columns/DataFrame.index (dict/function) or DataFrame.columns.name/DataFrame.index.name (others), depending on the type of input

CCing @MaximilianR in case he has ideas.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

#14636 here is a proposal to change .set_axis and make it a full-fledged functions (which is similar to .relabel)

@chris-b1
Copy link
Contributor Author

I'm uncertain about the api here, and this specific change isn't solving a significant problem so closing for now. Worth re-evaluating after/ as part of #12392.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: allow DataFrame.rename to take a list-like of colums
6 participants