Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: section on caveats of storing lists inside DataFrame/Series #17027

Open
chris-b1 opened this issue Jul 19, 2017 · 12 comments
Open

DOC: section on caveats of storing lists inside DataFrame/Series #17027

chris-b1 opened this issue Jul 19, 2017 · 12 comments
Labels
Docs good first issue Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@chris-b1
Copy link
Contributor

xref to a lot of issues, for example #16864

I think we could use a doc section stating storing nested lists/arrays inside a pandas object is preferred to be avoided, showing the downsides (perf, memory use) and a worked out example of an alternative. This seems to be earned knowledge that many have, but not sure we do a good job stating it clearly.

Closely related, might also benefit from a little section encouraging use of Python core data structures when appropriate.

probably goes here - http://pandas.pydata.org/pandas-docs/stable/gotchas.html

@pdpark
Copy link

pdpark commented Dec 30, 2017

I'd be happy to take this, just not sure what "a worked out example of an alternative" would look like? I've found a few discussions around storing lists in Dataframe cells and none of them discouraged it. This discussion on Stack Overflow is the only one I've found with alternatives: https://stackoverflow.com/questions/39661198/optimal-way-to-add-small-lists-to-pandas-dataframe. Which is the best option? Or is there another, better option? Thanks.

@jreback jreback added this to the Next Major Release milestone Dec 30, 2017
@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

https://stackoverflow.com/questions/45587778/python-explode-rows-from-panda-dataframe
https://stackoverflow.com/questions/44361160/explode-a-csv-in-python
https://stackoverflow.com/questions/38428796/how-to-do-lateral-view-explode-in-pandas

FYI, the timings are suspect of course, these examples don't use a large enough frame to actually matter.

#16538

We should make a small section on this. Also should prob just write .explode :< (note for strings we already have this, its the expand=True option in .str.split()

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

This is pretty idiomatic / efficient.

(pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name='nearest_neighbors')
     .set_index(['name', 'opponent'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

@pdpark
Copy link

pdpark commented Dec 30, 2017

I read through the examples in the links, very informative, thanks. I'll put something together and submit a PR.

@pdpark
Copy link

pdpark commented Jan 1, 2018

Just want to clarify something: this issue was opened with the intent, as I understand it, to document the fact that storing lists in dataframes is not ideal. However, the examples above are all about how to explode lists stored in data frames. Is the recommended approach to create a temporary data frame with lists in order to create the preferred dataframe without lists?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

no a long form dataframe is ideal from a performance and idiomatic perspective. those examples are illustrative of what to do if they already have lists

point is that you shouldn’t have them in the first place; if you do then you invariable need to convert them anyways

@pdpark
Copy link

pdpark commented Jan 2, 2018

This example, also from here: https://stackoverflow.com/a/46161733, seems simpler/easier to understand?

(df.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))

Any reason not to prefer it as the canonical example?

@jreback
Copy link
Contributor

jreback commented Jan 2, 2018

yep that prob would be a nice example

@pdpark
Copy link

pdpark commented Jan 2, 2018

Cool, thanks.

@pdpark
Copy link

pdpark commented Jan 3, 2018

I want to include an example of doing an "explosion" without creating an intermediary df with lists in cells. Here's my example - what do you think?

df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3),
('opponent', ['76ers', 'blazers', 'bobcats']),
('attribute x', ['A','B','C'])
])
))

nn = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3

df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nn)], axis=1)

df3 = (df2.set_index(['name', 'opponent'])
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
df3

@pdpark
Copy link

pdpark commented Jan 5, 2018

Added this change to existing pull request.

pdpark pushed a commit to pdpark/pandas that referenced this issue Jan 12, 2018
@jbrockmendel jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
6 participants