Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: (explode) Splitting a column content over multiple rows while duplicating other columns content to these rows #16538

Closed
BLMeltdown opened this issue May 30, 2017 · 6 comments · Fixed by #27267
Labels
Enhancement Strings String extension data type and string data

Comments

@BLMeltdown
Copy link

BLMeltdown commented May 30, 2017

Hello
I know it is not a problem per se but I think there should be some pandas built in solution for this problem, as no simple function exists for this.
It is largely discussed here with various ideas (https://stackoverflow.com/questions/12680754/split-pandas-dataframe-string-entry-to-separate-rows) but they are quite tricky actually, not using built in things but hacking normal behavior for the most part.

I believe that could use some strategy drawn from the reindexing/filling functions, but as I am not a Pandas specialist, I am not sure this would use the most efficient function to perform this task.

Let's say we have

df

var1  var2

0 a,b,c x
1 d,e,f y
df.desired_function("var1")

var1 var2
0 a x
1 b x
2 c x
3 d y
4 e y
5 f y

Best regards

@jreback
Copy link
Contributor

jreback commented May 30, 2017

can you post a copy-pastable example. having these in-line in the top description (as well as a link) is useful.

@BLMeltdown
Copy link
Author

BLMeltdown commented May 30, 2017

Done;)

@BLMeltdown
Copy link
Author

BLMeltdown commented May 30, 2017

d=pd.DataFrame(list(zip(*s2.str.split(","))))
Out[93]: 
     0    1    2
0   a2   b2   c2
1   a3   b3   c3

d.stack()
>

0  0     a2
   1     b2
   2     c2
1  0     a3
   1     b3
   2     c3

Is a good starting point, but i guess that reindexing the other columns is not that easy. And anyway, it would be better to use a dedicated algorithm than to make a general call to various functions.

@jreback
Copy link
Contributor

jreback commented May 30, 2017

you are asking for explode: see some other references here: #8517 (comment)

sure I think we could have this as a string method. It is generally useful.

@jreback jreback added Difficulty Intermediate Enhancement Strings String extension data type and string data labels May 30, 2017
@jreback jreback added this to the Next Major Release milestone May 30, 2017
@jreback jreback changed the title Splitting a column content over multiple rows while duplicating other columns content to these rows ENH: (explode) Splitting a column content over multiple rows while duplicating other columns content to these rows May 30, 2017
@BLMeltdown
Copy link
Author

Thanks a lot! yes, this is it. actually the most satisfactory method I found is from SO InoDB (https://stackoverflow.com/users/1894184/inodb)

a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
s = a.var1.str.split(",").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
del a['var1']
a.join(s)
var2 var1
0 1 a
0 1 b
0 1 c
1 2 d
1 2 e
1 2 f

(I came up with the stack and before, I was not good with reindexing and joining)

But anyway a general purpose function with adjustable setting will be nice. Thanks!

changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
changhiskhan added a commit to changhiskhan/pandas that referenced this issue Dec 20, 2018
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 15, 2019

Caught this on pandas-dev-request, and there is a similar feature that I need with respect to indexing.
Here's an example. In this example, I have a Series that has a MultiIndex with names 'ab' and 'ott'. Now I want to repeat that Series for every element in the list xy. So the code I use to do this is as follows:

In [2]: idx = pd.MultiIndex.from_product([['a','b'], [1,2,3]], names=['ab
   ...: ','ott'])
   ...: s = pd.Series(range(6), index=idx)
   ...: s
Out[2]:
ab  ott
a   1      0
    2      1
    3      2
b   1      3
    2      4
    3      5
dtype: int64

In [3]: xy = ['x', 'y']
   ...: (s
   ...:  .to_frame('s')
   ...:  .assign(one=1)
   ...:  .reset_index()
   ...:  .merge(pd.Series(1, index=pd.Index(xy,name='xy'))
   ...:         .to_frame('one').reset_index(), on='one')
   ...:  .set_index(s.index.names+['xy'])
   ...:  .drop(columns=['one'])
   ...: )
Out[3]:
           s
ab ott xy
a  1   x   0
       y   0
   2   x   1
       y   1
   3   x   2
       y   2
b  1   x   3
       y   3
   2   x   4
       y   4
   3   x   5
       y   5

I'd like to "explode" the Series (or a DataFrame ) from the existing MultiIndex by adding a level to the MultiIndex using a list (or any sequence) that I provide. Alternatively, just adding 'xy' as column would work as well.

jreback pushed a commit to jreback/pandas that referenced this issue Jul 6, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback pushed a commit to jreback/pandas that referenced this issue Jul 7, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback pushed a commit to jreback/pandas that referenced this issue Jul 8, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback pushed a commit to jreback/pandas that referenced this issue Jul 11, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback pushed a commit to jreback/pandas that referenced this issue Jul 11, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback pushed a commit to jreback/pandas that referenced this issue Jul 17, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback pushed a commit to jreback/pandas that referenced this issue Jul 17, 2019
…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
jreback added a commit that referenced this issue Jul 18, 2019
* [ENH] Add DataFrame method to explode a list-like column (GH #16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.

* move to Series

* handle generic list-like

* lint on asv

* move is_list_like to cython and share impl

* moar docs

* test larger sides to avoid a segfault

* fix ref

* typos

* benchmarks wrong

* add inversion

* add usecase

* cimport is_list_like

* use cimports

* doc-string

* docs & lint

* isort

* clean object check & update doc-strings

* lint

* test for nested

* better test

* try adding frame

* test for nested EA

* lint

* remove multi subset support

* update docs

* doc-string

* add test for MI

* lint and docs

* ordering

* moar lint

* multi-index column support

* 32-bit compat

* moar 32-bit compat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
3 participants