-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: section on caveats of storing lists inside DataFrame/Series #17027
Comments
I'd be happy to take this, just not sure what "a worked out example of an alternative" would look like? I've found a few discussions around storing lists in Dataframe cells and none of them discouraged it. This discussion on Stack Overflow is the only one I've found with alternatives: https://stackoverflow.com/questions/39661198/optimal-way-to-add-small-lists-to-pandas-dataframe. Which is the best option? Or is there another, better option? Thanks. |
https://stackoverflow.com/questions/45587778/python-explode-rows-from-panda-dataframe FYI, the timings are suspect of course, these examples don't use a large enough frame to actually matter. We should make a small section on this. Also should prob just write |
more refs |
This is pretty idiomatic / efficient.
|
I read through the examples in the links, very informative, thanks. I'll put something together and submit a PR. |
Just want to clarify something: this issue was opened with the intent, as I understand it, to document the fact that storing lists in dataframes is not ideal. However, the examples above are all about how to explode lists stored in data frames. Is the recommended approach to create a temporary data frame with lists in order to create the preferred dataframe without lists? |
no a long form dataframe is ideal from a performance and idiomatic perspective. those examples are illustrative of what to do if they already have lists point is that you shouldn’t have them in the first place; if you do then you invariable need to convert them anyways |
This example, also from here: https://stackoverflow.com/a/46161733, seems simpler/easier to understand? (df.nearest_neighbors.apply(pd.Series) Any reason not to prefer it as the canonical example? |
yep that prob would be a nice example |
Cool, thanks. |
I want to include an example of doing an "explosion" without creating an intermediary df with lists in cells. Here's my example - what do you think? df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3), nn = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3 df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nn)], axis=1) df3 = (df2.set_index(['name', 'opponent']) |
Added this change to existing pull request. |
xref to a lot of issues, for example #16864
I think we could use a doc section stating storing nested lists/arrays inside a pandas object is preferred to be avoided, showing the downsides (perf, memory use) and a worked out example of an alternative. This seems to be earned knowledge that many have, but not sure we do a good job stating it clearly.
Closely related, might also benefit from a little section encouraging use of Python core data structures when appropriate.
probably goes here - http://pandas.pydata.org/pandas-docs/stable/gotchas.html
The text was updated successfully, but these errors were encountered: