DOC: section on caveats of storing lists inside DataFrame/Series #17027

chris-b1 · 2017-07-19T14:44:51Z

xref to a lot of issues, for example #16864

I think we could use a doc section stating storing nested lists/arrays inside a pandas object is preferred to be avoided, showing the downsides (perf, memory use) and a worked out example of an alternative. This seems to be earned knowledge that many have, but not sure we do a good job stating it clearly.

Closely related, might also benefit from a little section encouraging use of Python core data structures when appropriate.

probably goes here - http://pandas.pydata.org/pandas-docs/stable/gotchas.html

pdpark · 2017-12-30T05:09:37Z

I'd be happy to take this, just not sure what "a worked out example of an alternative" would look like? I've found a few discussions around storing lists in Dataframe cells and none of them discouraged it. This discussion on Stack Overflow is the only one I've found with alternatives: https://stackoverflow.com/questions/39661198/optimal-way-to-add-small-lists-to-pandas-dataframe. Which is the best option? Or is there another, better option? Thanks.

jreback · 2017-12-30T13:27:10Z

https://stackoverflow.com/questions/45587778/python-explode-rows-from-panda-dataframe
https://stackoverflow.com/questions/44361160/explode-a-csv-in-python
https://stackoverflow.com/questions/38428796/how-to-do-lateral-view-explode-in-pandas

FYI, the timings are suspect of course, these examples don't use a large enough frame to actually matter.

#16538

We should make a small section on this. Also should prob just write .explode :< (note for strings we already have this, its the expand=True option in .str.split()

jreback · 2017-12-30T13:30:11Z

more refs

#8517

jreback · 2017-12-30T13:32:36Z

This is pretty idiomatic / efficient.

(pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name='nearest_neighbors')
     .set_index(['name', 'opponent'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

pdpark · 2017-12-30T18:50:48Z

I read through the examples in the links, very informative, thanks. I'll put something together and submit a PR.

pdpark · 2018-01-01T03:12:39Z

Just want to clarify something: this issue was opened with the intent, as I understand it, to document the fact that storing lists in dataframes is not ideal. However, the examples above are all about how to explode lists stored in data frames. Is the recommended approach to create a temporary data frame with lists in order to create the preferred dataframe without lists?

jreback · 2018-01-01T04:09:28Z

no a long form dataframe is ideal from a performance and idiomatic perspective. those examples are illustrative of what to do if they already have lists

point is that you shouldn’t have them in the first place; if you do then you invariable need to convert them anyways

pdpark · 2018-01-02T01:30:28Z

This example, also from here: https://stackoverflow.com/a/46161733, seems simpler/easier to understand?

(df.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))

Any reason not to prefer it as the canonical example?

jreback · 2018-01-02T01:32:43Z

yep that prob would be a nice example

pdpark · 2018-01-02T03:17:41Z

Cool, thanks.

pdpark · 2018-01-03T02:09:40Z

I want to include an example of doing an "explosion" without creating an intermediary df with lists in cells. Here's my example - what do you think?

df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3),
('opponent', ['76ers', 'blazers', 'bobcats']),
('attribute x', ['A','B','C'])
])
))

nn = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3

df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nn)], axis=1)

df3 = (df2.set_index(['name', 'opponent'])
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
df3

pdpark · 2018-01-05T08:06:52Z

Added this change to existing pull request.

Restores: pandas-dev#17027

chris-b1 added Difficulty Novice Docs labels Jul 19, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

jreback added this to the Next Major Release milestone Dec 30, 2017

pdpark mentioned this issue Jan 5, 2018

DOC: Added note about groupby excluding Decimal columns by default #18953

Merged

pdpark pushed a commit to pdpark/pandas that referenced this issue Jan 12, 2018

DOC: Adds example of alternative to storing lists in a Dataframe

e91444e

Restores: pandas-dev#17027

pdpark mentioned this issue Jan 12, 2018

Doc: Adds example of exploding lists into columns instead of storing in dataframe cells #19215

Closed

1 task

mgautam98 mentioned this issue Oct 8, 2018

Doc: Adds example of exploding lists into columns instead of storing in dataframe cells #23041

Closed

1 task

jbrockmendel removed the Effort Medium label Oct 21, 2019

jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: section on caveats of storing lists inside DataFrame/Series #17027

DOC: section on caveats of storing lists inside DataFrame/Series #17027

chris-b1 commented Jul 19, 2017

pdpark commented Dec 30, 2017

jreback commented Dec 30, 2017 •

edited

Loading

jreback commented Dec 30, 2017

jreback commented Dec 30, 2017

pdpark commented Dec 30, 2017

pdpark commented Jan 1, 2018

jreback commented Jan 1, 2018

pdpark commented Jan 2, 2018

jreback commented Jan 2, 2018

pdpark commented Jan 2, 2018

pdpark commented Jan 3, 2018

pdpark commented Jan 5, 2018

DOC: section on caveats of storing lists inside DataFrame/Series #17027

DOC: section on caveats of storing lists inside DataFrame/Series #17027

Comments

chris-b1 commented Jul 19, 2017

pdpark commented Dec 30, 2017

jreback commented Dec 30, 2017 • edited Loading

jreback commented Dec 30, 2017

jreback commented Dec 30, 2017

pdpark commented Dec 30, 2017

pdpark commented Jan 1, 2018

jreback commented Jan 1, 2018

pdpark commented Jan 2, 2018

jreback commented Jan 2, 2018

pdpark commented Jan 2, 2018

pdpark commented Jan 3, 2018

pdpark commented Jan 5, 2018

jreback commented Dec 30, 2017 •

edited

Loading