-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Initial split for libarrow components #1175
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First look - thanks for tackling this!
recipe/meta.yaml
Outdated
- {{ pin_subpackage('libarrow', exact=True) }} | ||
- {{ pin_subpackage('libarrow-dataset', exact=True) }} | ||
- {{ pin_subpackage('libarrow-flight', exact=True) }} | ||
- {{ pin_subpackage('libarrow-gandiva', exact=True) }} | ||
- {{ pin_subpackage('libarrow-substrait', exact=True) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's a question hiding in there whether we want to provide a libarrow-all
or something like that, which is just a metapackage that pulls together all the lib*
components, and then pyarrow
would host-depend on that.
@kou @jorisvandenbossche do you have any idea why only some of the registered functions for compute like
and I can see:
but some of the registered functions via
In case you are curious the CI one I am trying to fix first are the Linux 64 ones: edit: added link to CI |
The issue seems to be due to basic kernels where we do not require |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another quick look; I don't think we need the full current test suite (for libarrow) duplicated to all outputs, can probably cut that down a tad.
{% set libs = (cuda_compiler_version != "None") * ["arrow_cuda"] + [ | ||
"arrow", "arrow_acero" | ||
] %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the situation around libarrow_cuda.so
? It's currently listed as an expected library in a bunch of outputs; if indeed those bits get mashed together in a single library, then these outputs would clobber each other...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this just because I am using libarrow
as a run dependency? I can remove libarrow
from the run dependency but as libarrow
also contains all headers at the moment, I should probably stop testing for headers on that case.
Potentially dumb question: could we "just" build Arrow C++ once with all components enabled, and then have a way to have each actual output package ship one of the shared libraries that was created with the initial build? |
Yes, that works from a package-building POV, and actually what we should do, because we don't have the (CI) time to rebuild everything X times |
I was planning on investigating why / how to reuse the previously built component so we don't rebuild again from scratch but I can investigate how this "selection" of which shared libraries to ship per package can be done on the conda recipe. @h-vetinari do you know any other project that might do something similar where I could take inspiration? Otherwise I'll investigate how can I just pick the specific .so + headers on the specific package. It's my first time working on a conda recipe :) |
I think the best pattern for this (or at least: my personal favourite) is the one with installing into a temporary prefix, and then copying the required bits per output. See for example llvmdev. The structure would then be:
If copying the headers becomes a hassle, we can also consider having a |
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe:
|
4987581
to
88fb08d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reduce the host dependencies per output to the necessary minimum. The link check helps you on this. For example:
WARNING (libarrow-dataset): run-exports library package conda-forge::aws-crt-cpp-0.23.1-hf7d0843_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::aws-sdk-cpp-1.11.156-he6c2984_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::bzip2-1.0.8-h7f98852_4 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::glog-0.6.0-h6f12383_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libabseil-20230802.1-cxx17_h59595ed_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libbrotlidec-1.1.0-hd590300_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libbrotlienc-1.1.0-hd590300_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libgoogle-cloud-2.12.0-h8d7e28b_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libthrift-0.19.0-h8fd135c_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libutf8proc-2.8.0-h166bdaf_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libzlib-1.2.13-hd590300_5 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::lz4-c-1.9.3-h9c3ff4c_1 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::orc-1.9.0-h52d3b3c_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::re2-2023.03.02-h8c504da_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::snappy-1.1.10-h9fff704_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::ucx-1.14.0-h3484d09_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::zstd-1.5.5-hfc55251_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
means that libarrow-dataset
has a large amount of unnecessary host dependencies.
This cuts both ways though, for example, libparquet
also depends on openssl (presumably for the encryption facilities):
WARNING (libarrow,lib/libparquet.so.1300.0.0): Needed DSO lib/libcrypto.so.3 found in ['openssl']
WARNING (libarrow,lib/libparquet.so.1300.0.0): .. but ['openssl'] not in reqs/run, (i.e. it is overlinking) (likely) or a missing dependency (less likely)
(note that - despite the error message talking about reqs/run - this is still a question about host-dependencies, because if openssl is a host-dependency, the respective run-export will ensure it becomes a run-dependency as well).
- xsimd | ||
- zlib | ||
- zstd | ||
- __cuda >=11.2 # [cuda_compiler_version != "None"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari this was raising an error locally if I used here __cuda >={{ cuda_compiler_version_min }} # [cuda_compiler_version != "None"]
. The error when running python build-locally.py linux_64_cuda_compiler_versionNone
:
RuntimeError: Received dictionary as spec. Note that pip requirements are not supported in conda-build meta.yaml. Error message: Invalid version '>=': invalid operator
Do you know if there is any issue where the global host dependency syntax would behave differently with this templating syntax?
edit: syntax
…w-flight, libarrow-flight-sql, libarrow-gandiva, libarrow-substrait and use them on pyarrow
0ecc9ee
to
d2f91ef
Compare
The dependencies are beginning to look better and better 👍 substrait is still missing a libprotobuf dependency, and pyarrow needs more (all?) of the libarrow subpackages. For finding these, I recommend opening the "raw log" in azure and then searching for |
@h-vetinari I think I've fixed the different ones but I am not sure why pyarrow fails with finding the different libarrow related DSO's, example:
shouldn't the
and pyarrow having both:
and
should contain the DSO's. |
It's enough to have the DSO's (and so the package will be functional), but the metadata is off, because pyarrow does not run-depend on the outputs that contain the DSOs, and so the metadata is incomplete. In this case, it wouldn't be an actual issue because pyarrow depends exactly on libarrow-all, which depends exactly on the right libs for the same build, but conda cannot assume that such transitive dependencies are actually ABI-safe. In any case, the solution is simple: we need to add run-exports for all the sub-libraries to In turn, you'll be able to remove the run-dependence on libarrow-all. |
@raulcd, I took your changes here and rebased them on main, plus slimmed down the refactor where possible: #1201 This should more-or-less be ready to go for arrow 14 (should be in the final rounds of polishing). However, I noticed one crucial things w.r.t. #1035 - the new |
Thanks @h-vetinari ! And many thanks for your support when I was working on it!! I know there are some conversations in order to create a libarrow-fs in order to split the filesystems away from Arrow core. Which will allow us to remove |
We have indeed chatted about that privately (and exactly with the motivation to be able to remove the large aws and libgoogle-cloud dependencies from the core libarrow package here). I opened apache/arrow#38309 to have a public record of this idea. |
Initial split for libarrow, libarrow-acero, libarrow-dataset, libarrow-flight, libarrow-flight-sql, libarrow-gandiva, libarrow-substrait and use them on pyarrow.
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)This is still under development pending tasks are:
Closes #1035