-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow for creating new_series / from_dict without specifying a backend #876
Comments
I was thinking https://github.com/altair-viz/vega_datasets might be a good candidate for experimenting with this?
Since |
yup, definitely, thanks @dangotbanned ! |
I think it's a nice idea, and it would apply to cases where your default backend is for instance For instance, if we change our dependency from |
The use case I was thinking of was that a Polars user could use Fairlearn without needing to install pandas (nor probably PyArrow - maybe not in 2024, but I'd be surprised if it didn't become a pandas required dependency eventually) |
So imagine you're a library that requires some sort of a dataframe backend. It would be very impractical to not have any default dataframe backend in your On the other hand, as the package maintainers we could make So basically, on a fresh env, |
interesting, thanks...and it's kinda unfortunate that there's not a way to do "pip install minus", so that 🤔 not really sure if this could work at all in Fairlearn then? Unless your only reason for using it was a stable api across pandas versions 😉 |
So I think on the |
that could make sense, thanks for explaining just out of interest, if you're choosing a backend, why not use ibis? presumably it's that they still require pyarrow for all backends? if so, a feature request to them to not require pyarrow for all backends would probably be a good idea 😉 |
Never really gave |
@adrinjalali can you explain the rationale behind:
|
@lostmygithubaccount pyarrow has had an interesting history, and my personal decision for not wanting to depend on it is multi faceted. I remember in 2018-2019 being quite excited about it. We had discussions about supporting it in scikit-learn, and at some point I was happy to even work on the library. But then it became a massive piece of software which is a kind of dependency that you'd only want to add if you really have to. Otherwise rather keep the environments lightweight. At least on $ mmamba install pyarrow
conda-forge/noarch 16.4MB @ 4.2MB/s 4.0s
conda-forge/linux-64 38.0MB @ 5.2MB/s 7.5s
Transaction
Prefix: /home/adrin/micromamba/envs/delete
Updating specs:
- pyarrow
Package Version Build Channel Size
─────────────────────────────────────────────────────────────────────────────────────────────────
Install:
─────────────────────────────────────────────────────────────────────────────────────────────────
+ libstdcxx 14.1.0 hc0a3c3a_1 conda-forge 4MB
+ libutf8proc 2.8.0 h166bdaf_0 conda-forge Cached
+ c-ares 1.33.1 heb4867d_0 conda-forge 183kB
+ libssh2 1.11.0 h0841786_0 conda-forge Cached
+ keyutils 1.6.1 h166bdaf_0 conda-forge Cached
+ aws-c-common 0.9.28 hb9d3cd8_0 conda-forge 236kB
+ s2n 1.5.2 h7b32b05_0 conda-forge 352kB
+ libbrotlicommon 1.1.0 hb9d3cd8_2 conda-forge 69kB
+ python_abi 3.12 5_cp312 conda-forge 6kB
+ libiconv 1.17 hd590300_2 conda-forge Cached
+ libevent 2.1.12 hf998b51_1 conda-forge Cached
+ libev 4.33 hd590300_2 conda-forge Cached
+ libgfortran5 14.1.0 hc5f4f2c_1 conda-forge 1MB
+ libedit 3.1.20191231 he28a2e2_2 conda-forge Cached
+ libstdcxx-ng 14.1.0 h4852527_1 conda-forge 52kB
+ aws-checksums 0.1.18 h756ea98_11 conda-forge 50kB
+ aws-c-cal 0.7.4 hfd43aa1_1 conda-forge 48kB
+ aws-c-compression 0.2.19 h756ea98_1 conda-forge 19kB
+ aws-c-sdkutils 0.1.19 h756ea98_3 conda-forge 56kB
+ libbrotlienc 1.1.0 hb9d3cd8_2 conda-forge 282kB
+ libbrotlidec 1.1.0 hb9d3cd8_2 conda-forge 33kB
+ libgfortran 14.1.0 h69a702a_1 conda-forge 52kB
+ icu 75.1 he02047a_0 conda-forge 12MB
+ lz4-c 1.9.4 hcb278e6_0 conda-forge Cached
+ gflags 2.2.2 he1b5a44_1004 conda-forge Cached
+ libnghttp2 1.58.0 h47da74e_1 conda-forge Cached
+ krb5 1.21.3 h659f571_0 conda-forge 1MB
+ libthrift 0.20.0 h0e7cc3e_1 conda-forge 417kB
+ libabseil 20240116.2 cxx17_he02047a_1 conda-forge 1MB
+ libcrc32c 1.1.2 h9c3ff4c_0 conda-forge Cached
+ zstd 1.5.6 ha6fb4c9_0 conda-forge Cached
+ snappy 1.2.1 ha2e4443_0 conda-forge 42kB
+ aws-c-io 0.14.18 hc2627b9_9 conda-forge 159kB
+ libgfortran-ng 14.1.0 h69a702a_1 conda-forge 52kB
+ libxml2 2.12.7 he7c6b58_4 conda-forge 707kB
+ glog 0.7.1 hbabe93e_0 conda-forge 143kB
+ libre2-11 2023.09.01 h5a48ba9_2 conda-forge Cached
+ libprotobuf 4.25.3 h08a7969_0 conda-forge Cached
+ libcurl 8.10.0 hbbe4b11_0 conda-forge 425kB
+ aws-c-event-stream 0.4.3 h235a6dd_1 conda-forge 54kB
+ aws-c-http 0.8.9 h5e77a74_0 conda-forge 198kB
+ libopenblas 0.3.27 pthreads_hac2b453_1 conda-forge Cached
+ re2 2023.09.01 h7f4b329_2 conda-forge Cached
+ orc 2.0.2 h669347b_0 conda-forge 1MB
+ azure-core-cpp 1.13.0 h935415a_0 conda-forge 338kB
+ aws-c-mqtt 0.10.5 h0009854_0 conda-forge 194kB
+ aws-c-auth 0.7.30 hec5e740_0 conda-forge 107kB
+ libblas 3.9.0 23_linux64_openblas conda-forge Cached
+ libgrpc 1.62.2 h15f2491_0 conda-forge Cached
+ azure-identity-cpp 1.8.0 hd126650_2 conda-forge 200kB
+ azure-storage-common-cpp 12.7.0 h10ac4d7_1 conda-forge 143kB
+ aws-c-s3 0.6.5 hbaf354b_4 conda-forge 113kB
+ libcblas 3.9.0 23_linux64_openblas conda-forge Cached
+ liblapack 3.9.0 23_linux64_openblas conda-forge Cached
+ libgoogle-cloud 2.29.0 h435de7b_0 conda-forge 1MB
+ azure-storage-blobs-cpp 12.12.0 hd2e3451_0 conda-forge 523kB
+ aws-crt-cpp 0.28.2 h6c0439f_6 conda-forge 350kB
+ numpy 2.1.1 py312h58c1407_0 conda-forge 8MB
+ libgoogle-cloud-storage 2.29.0 h0121fbd_0 conda-forge 782kB
+ azure-storage-files-datalake-cpp 12.11.0 h325d260_1 conda-forge 274kB
+ aws-sdk-cpp 1.11.379 h5a9005d_9 conda-forge 3MB
+ libarrow 17.0.0 hc80a628_14_cpu conda-forge 9MB
+ libarrow-acero 17.0.0 h5888daf_14_cpu conda-forge 608kB
+ libparquet 17.0.0 h39682fd_14_cpu conda-forge 1MB
+ pyarrow-core 17.0.0 py312h9cafe31_1_cpu conda-forge 5MB
+ libarrow-dataset 17.0.0 h5888daf_14_cpu conda-forge 585kB
+ libarrow-substrait 17.0.0 hf54134d_14_cpu conda-forge 550kB
+ pyarrow 17.0.0 py312h9cebb41_1 conda-forge 26kB
Summary:
Install: 93 packages
Total download: 66MB
─────────────────────────────────────────────────────────────────────────────────────────────────
Confirm changes: [Y/n] As a maintainer of a library which has nothing to do with cloud computing, why on earth would I want to have aws AND azure libraries as transient dependencies? Even if I was doing cloud stuff, I'd probably be working with one of them, not both. That's an insane number of bloatware installed when pulling On top of that, we've had the time where And the cherry on top is your employer firing pyarrow maintainers, including some of my friends, who have been working on the project for a while. Not only that doesn't make me want to have the lib as a dependency, it also doesn't give me confidence on the future of the project. |
out of curiosity what's your bar for a lightweight Python environment? some # of MBs? I don't personally use conda but it does seem like you can get similarly sized installations as PyPI: https://arrow.apache.org/docs/python/install.html#python-conda-differences I know there are ongoing efforts to reduce the installation size of PyArrow further (and reduce dependencies) |
@MarcoGorelli this looks fun :-) Do you think this is ready to accept a PR? May I give it a go during the sprint tomorrow? |
Thanks @Cheukting ! I'm not 100% sure about this one, as I'd originally misunderstood the Fairlearn use case. Maybe we can punt on it for the time being We'll open some issues later which we've reserved specially for the sprint though so there'll be plenty of interesting things to work on 😎 |
Ok, thanks @MarcoGorelli but in the future if this is needed I am happy to help too |
Would it be feasible for values_df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]}, like=other_df) Or perhaps there is a better representation of a backend that could be used instead of the DataFrame itself. The motivation would be to make it easy to have a function that accepts and returns a Narwhals DataFrame using the same backend, but where the resulting DataFrame isn't computed directly from the input DataFrame. I can create a separate issue for this, but for my actual usecase in VegaFusion, I'd actually want something like |
Hey @jonmmease If I've understood the request, I think you can do this already: In [13]: other_df = nw.from_native(pl.DataFrame({'a': [2, 3]}))
In [14]: values_df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]}, native_namespace=nw.get_native_namespace(other_df))
In [15]: values_df
Out[15]:
┌───────────────────────────────────────┐
| Narwhals DataFrame |
| Use `.to_native` to see native output |
└───────────────────────────────────────┘
In [16]: values_df.to_native()
Out[16]:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘ |
Ah, nice. I had missed the |
We could allow users to do something like:
without specifying which backend they want to use. Then, we use whatever they have installed, but with some priority list, like:
If there's demand, we could allow users to customise the priority order (e.g. first try pandas, then Polars, then PyArrow...)
Use case: in Fairlearn, they use dataframes internally to do some calculations, but this is hidden from the user. The user shouldn't care which dataframe Fairlearn uses internally to do those calculations, so long as it's something they have installed
cc @adrinjalali in case you have comments / requests 🙏
The text was updated successfully, but these errors were encountered: