Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: spin msgpack support to [new] pandas-msgpack #15841

Closed
jreback opened this issue Mar 29, 2017 · 16 comments
Closed

CLN: spin msgpack support to [new] pandas-msgpack #15841

jreback opened this issue Mar 29, 2017 · 16 comments
Labels

Comments

@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

similar to the split-off for pandas-gbq.

Would simplify the main pandas codebase a bit by making this a separately maintained package.

@jreback jreback added this to the Next Major Release milestone Mar 29, 2017
@jreback
Copy link
Contributor Author

jreback commented Mar 29, 2017

cc @llllllllll @ssanderson

@llllllllll
Copy link
Contributor

Thanks for the heads up. We mainly use this with blaze so we just need to add that as a dependency to blaze[server]. I take it we would just need to import loads/dumps from pandas_msgpack instead of pandas, right?

@jreback
Copy link
Contributor Author

jreback commented Mar 29, 2017

@llllllllll I was just proposing this to see if people think its a good idea :>

but yes you could be explicit on the dependency.

my reasoning for proposing this are several fold really.

  • its removes some code from main pandas and allows this package to have issues solved independently of the pandas release (not a big deal as this is pretty mature).
  • this won't be deprecated in pandas 1.0, but I don't think will be supported in pandas 2 mainly because we are bringing on stream more 'standardized' formats (parquet / arrow / feather). These DO support wire formats (arrow), so that covers this usecase.
  • the internal format of the msgpack is pretty pandas specific. not that this is a problem, but it makes interop with other systems a bit more tricky.

@jreback
Copy link
Contributor Author

jreback commented Mar 29, 2017

cc @wesm

@wesm
Copy link
Member

wesm commented Mar 29, 2017

+1 on splitting off this code

@wesm
Copy link
Member

wesm commented Mar 29, 2017

I also think splitting off pandas-json would be a good idea (I would like to consider deprecate that for a native RapidJSON-based C++ reader via Arrow sometime in the next year)

@ssanderson
Copy link
Contributor

the internal format of the msgpack is pretty pandas specific. not that this is a problem, but it makes interop with other systems a bit more tricky.

This to me feels like an argument against splitting off this code into a separate package. If the pandas-msgpack format is tied to implementation details of pandas objects, then you would usually want to upgrade pandas-msgpack and pandas in lockstep, or else you risk breakage because your pandas-msgpack version is incompatible with your pandas version. Having the msgpack format in the pandas codebase itself ensures that you don't have to worry as much about version compatibility issues.

@wesm
Copy link
Member

wesm commented Mar 30, 2017

With continuous integration tools, we can automate the testing, so that doesn't seem like an issue. I think pandas's monolithic nature has made it harder for the community to make progress on components that may evolve at a different pace vs. the rest of the project.

@ssanderson
Copy link
Contributor

I think pandas's monolithic nature has made it harder for the community to make progress on components that may evolve at a different pace vs. the rest of the project.

I agree with this, and as someone building an application on top of pandas, I appreciate the value of potentially slimming down the core distribution. At the same time, I suspect that the monolithic nature of pandas is actually a feature for many of pandas' users, since it means they don't have worry about or manage a constellation of pandas-* packages; they can just pip install pandas and get everything they might use.

One way to alleviate this concern might be to have something like a pandas-extra "meta-package", which would install the core pandas library along with related packages like pandas-msgpack. This would allow users to pick and choose which pieces they want if they care to do so, while still allowing users to install the whole world if they can't be bothered to figure out exactly which sub-packages they need.

@jreback
Copy link
Contributor Author

jreback commented Mar 30, 2017

actually metapackagea would be great for this

e.g. right now we should have a

pandas-perf which includes numexpr and bottleneck

you. could actually go nuts with this

pandas-io (excel, sql, hdf5, json)
pandas-io-extras (html, hig query, msgpsck)

so a big question is how to organize this to make it useful and not confusing

@jreback
Copy link
Contributor Author

jreback commented Mar 30, 2017

morning project!

https://github.com/pydata/pandas-msgpack
http://pandas-msgpack.readthedocs.io/en/latest/

not released yet.
is there an easy way to build wheels all at once (rather than using my separate envs to do this)? could/should have travis do it I suppose.

@wikiped
Copy link

wikiped commented Mar 31, 2017

@jreback

so a big question is how to organize this to make it useful and not confusing

Just to add a "lamer's" perspective on this. Not a long time ago iptyhon notebook / jupyter went through a similar split. So now there is a bunch of 'sub-packages' available for installation. For example in conda:

jupyter
jupyter_client
jupyter_cms
jupyter_console
jupyter_contrib_core
jupyter_contrib_nbextensions
jupyter_core
jupyter_dashboards
jupyter_highlight_selected_word
jupyter_kernel_gateway
jupyter_latex_envs
jupyter_nbextensions_configurator
jupyter_sphinx

When the split happened - my initial reaction was indeed "what should I install now?". Well it turned out that conda install jupyter was sufficient to have all the functionality that was available previously before the split. Any "extra" feature (that was not historically part of the core ipython notebook) had to be added by installing that "extra" package.

I might be wrong, but it is my understanding that jupyter is not a package per se, but a script to install all required sub-packages to have expected (i.e. what users have gotten used to) functionality.

Without doubts clearly documented changes come indispensable.
And because pandas is an exemplary well documentation library it also largely 'defines' perception of 'what pandas consists of'. So the closer parts of the split are to what is in the docs the easier it might be for users to understand 'what's going on and where it's going'.

I.e. why should not performance_package be part of core pandas functionality? It's subjective, but I would expect that conda install pandas should deliver pandas with all its mighty powers and not some stripped down version. If for some reason user decides it's not required - "lighter" version through installation of the right set of sub-packages might be appropriate.

@jreback
Copy link
Contributor Author

jreback commented Mar 31, 2017

@wikiped jupyter is itself a metapackage. This looks and feels exactly like a regular package (with versioning), and the purpose is to simply install dependencies.

pandas just tries to install the minimum dependencies ATM. IOW, it does not install numexpr or bottleneck (or scipy).

For example, if you are using SQL, there are myriad of options to install, pandas cannot know what to do here (sure we could install sqlalchemy I suppose).

A possible future path is to make pandas a meta-package, with sub-packages like:

pandas-core
pandas-io
pandas-perf
pandas-...

etc. The questions are:

  • is this worth is from a complexity point of view (surely the core pandas code is less, but its not necessarily simpler and now has to deal with the various sub-package APIs that could change)
  • from a user point of view is this wanted? most people prob just want 'pandas' and don't care about where code/packages live. Though for example, if they never use pandas-gbq or pandas-msgpack then its fine to have these as non-core.
  • using conda this is really simple. but pip is harder to make dependencies work like this (though is better in recent times). Our installed base is a big user of both, so this is a consideration.

My feeling at this point is that is ok to split off non-core functionaility and direct the user to install it if they want it.

@jorisvandenbossche
Copy link
Member

For pip there is the possibility to use extras_require to have something like pip install pandas[io-extras] (dask also makes use of this: http://dask.pydata.org/en/latest/install.html).
But I am not sure if conda has a similar mechanism

@wikiped
Copy link

wikiped commented Mar 31, 2017

I also think making pandas a metapackage is right way to go and I would say it would be beneficial for both developers (easier "develo-gement" of sub-parts) and users (easier selection of required features and better control of space).

Complexity might be constrained / driven by:

  1. Speed of overall library development (new features, etc).
    I.e. does spin off of the sub-part help to speed up development of the library as a whole? (i.e. msgpack would probably score high on this)
  2. Already mentioned development / management benefits. I.e. does spin off make development process of the sub-part and library easier? (i.e. msgpack might also score higher on this)
  3. How "essential" is sub-part to the core functionality? I.e. will pandas lose its core benefits from user's pov if this sub-part has to be installed separately (i.e. 'msgpack` would perhaps score lower on this.
  4. Already mention usability / "installability" of sub-parts. I.e. How confusing it is for the end-user to understand weather he/she needs to install this particular sub-part and is it easy to do? (i.e. msgpack would also score high on this).

I can hardly speak for the whole pandas user-community and perhaps it would be best to get a broader feedback on this from users. And I am not sure what would be the right way to handle this. Perhaps some warning message inviting users to give feedback in whatever-the-right-channel might be?

Part of this exercise is to make a good list of sub-parts to facilitate good feedback. And it might be good to go a bit nuts about it and have longer than needed list of parts to select from:

pandas_core
pandas_io_excel
pandas_io_sql
pandas_io_hdf5
pandas_io_hadoop / hdfs
pandas_statistics
pandas_machine_learning
pandas_perf
pandas_parallel
pandas_notebook
...

The final list of sub-parts might be re-grouped depending on feedback and where it makes sense.

@simonjayhawkins
Copy link
Member

msgpack is deprecated #30112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants