-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Stop initializing s3 upon import #38364
Comments
I believe @pentschev has put together a patch demonstrating how this could be done. |
@westonpace has worked on the problem of S3 initialization in the past and it seems like we can either init eagerly and avoid some issues or init lazily and suffer bugs. Any solution here needs to consider the issues that lead to the eager init solution. |
Reference #33858 |
If Weston is able to review PR ( #38375 ), we would appreciate hearing his insights 🙂 |
### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
@pentschev Can you post a message on this issue, so that you can assign it to you? |
Sure, please assign the issue to myself, @pitrou . |
The aws-sdk-pinning is proving to be far too problematic to maintain. It causes conflicts in many environments due to its common usage across many other packages in the conda-forge ecosystem that have since updated their pinning to require newer versions than the 1.10.* that we've pinned to. This reversion will unblock most of RAPIDS CI. We will search for alternative fixes to the dask-cuda/distributed issues that we're observing (in particular, resolution of the underlying issues apache/arrow#38364 and aws/aws-sdk-cpp#2681). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ray Douglass (https://github.com/raydouglass) - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - Peter Andreas Entschev (https://github.com/pentschev) URL: #14319
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the enhancement requested
Currently pyarrow initializes the s3 filesystem when pyarrow.fs is imported. This leads to AWS consuming resources on startup that may never be used if the user is not actually taking advantage of that support. Ideally the s3fs would instead be delayed to first use to avoid AWS spinning up unnecessary threads/doing work on pyarrow import.
Making this change would also allow sidestepping a bug present in newer versions of the aws-sdk-cpp that occasionally leads to segfaults simply by using the AWS APIs, at least for the majority of users who are not using the s3fs by default.
Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: