[Python] Stop initializing s3 upon import #38364

vyasr · 2023-10-19T21:33:53Z

Describe the enhancement requested

Currently pyarrow initializes the s3 filesystem when pyarrow.fs is imported. This leads to AWS consuming resources on startup that may never be used if the user is not actually taking advantage of that support. Ideally the s3fs would instead be delayed to first use to avoid AWS spinning up unnecessary threads/doing work on pyarrow import.

Making this change would also allow sidestepping a bug present in newer versions of the aws-sdk-cpp that occasionally leads to segfaults simply by using the AWS APIs, at least for the majority of users who are not using the s3fs by default.

Component(s)

C++, Python

vyasr · 2023-10-19T21:34:41Z

I believe @pentschev has put together a patch demonstrating how this could be done.

felipecrv · 2023-10-23T23:07:14Z

@westonpace has worked on the problem of S3 initialization in the past and it seems like we can either init eagerly and avoid some issues or init lazily and suffer bugs. Any solution here needs to consider the issues that lead to the eager init solution.

felipecrv · 2023-10-24T00:15:24Z

Reference #33858

jakirkham · 2023-10-24T00:21:30Z

If Weston is able to review PR ( #38375 ), we would appreciate hearing his insights 🙂

### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou · 2023-10-24T15:43:07Z

@pentschev Can you post a message on this issue, so that you can assign it to you?

pentschev · 2023-10-24T15:52:32Z

Sure, please assign the issue to myself, @pitrou .

The aws-sdk-pinning is proving to be far too problematic to maintain. It causes conflicts in many environments due to its common usage across many other packages in the conda-forge ecosystem that have since updated their pinning to require newer versions than the 1.10.* that we've pinned to. This reversion will unblock most of RAPIDS CI. We will search for alternative fixes to the dask-cuda/distributed issues that we're observing (in particular, resolution of the underlying issues apache/arrow#38364 and aws/aws-sdk-cpp#2681). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ray Douglass (https://github.com/raydouglass) - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - Peter Andreas Entschev (https://github.com/pentschev) URL: #14319

### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <peter@entschev.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

vyasr added the Type: enhancement label Oct 19, 2023

github-actions bot added Component: C++ Component: Python labels Oct 19, 2023

jorisvandenbossche changed the title ~~Stop initializing s3 upon import~~ [Python] Stop initializing s3 upon import Oct 20, 2023

pentschev mentioned this issue Oct 20, 2023

GH-38364: [Python] Initialize S3 on first use #38375

Merged

vyasr mentioned this issue Oct 23, 2023

Remove aws-sdk-pinning and revert to arrow 12.0.1 rapidsai/cudf#14319

Merged

3 tasks

pitrou closed this as completed in #38375 Oct 24, 2023

pitrou added this to the 15.0.0 milestone Oct 24, 2023

pitrou assigned pentschev Oct 24, 2023

bdice mentioned this issue Oct 24, 2023

Backport critical fix for S3 finalization from GH-38375. conda-forge/arrow-cpp-feedstock#1211

Merged

5 tasks

jorisvandenbossche added the backport-candidate label Oct 27, 2023

raulcd modified the milestones: 15.0.0, 14.0.2 Nov 28, 2023

raulcd removed the backport-candidate label Nov 28, 2023

jorisvandenbossche mentioned this issue Dec 20, 2023

Website: Release 14.0.2 blog post apache/arrow-site#443

Merged

amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 21, 2023

samjsharpe mentioned this issue Jan 19, 2024

[Python][S3] Segmentation fault when running multithreading in Docker #39703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Stop initializing s3 upon import #38364

[Python] Stop initializing s3 upon import #38364

vyasr commented Oct 19, 2023

vyasr commented Oct 19, 2023

felipecrv commented Oct 23, 2023

felipecrv commented Oct 24, 2023

jakirkham commented Oct 24, 2023

pitrou commented Oct 24, 2023

pentschev commented Oct 24, 2023

[Python] Stop initializing s3 upon import #38364

[Python] Stop initializing s3 upon import #38364

Comments

vyasr commented Oct 19, 2023

Describe the enhancement requested

Component(s)

vyasr commented Oct 19, 2023

felipecrv commented Oct 23, 2023

felipecrv commented Oct 24, 2023

jakirkham commented Oct 24, 2023

pitrou commented Oct 24, 2023

pentschev commented Oct 24, 2023