Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 IO through KvikIO #16499

Merged
merged 15 commits into from
Nov 7, 2024
Merged

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Aug 6, 2024

Implement remote IO read using KvikIO's S3 backend.

For now, this is an experimental feature for parquet read only. Enable by defining CUDF_KVIKIO_REMOTE_IO=ON.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@madsbk madsbk added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 6, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Aug 6, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 2 times, most recently from dd1b46d to 55cc220 Compare August 12, 2024 12:52
@madsbk madsbk mentioned this pull request Aug 13, 2024
3 tasks
@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Aug 22, 2024
@madsbk madsbk changed the base branch from branch-24.10 to branch-24.12 October 2, 2024 08:00
@github-actions github-actions bot added pylibcudf Issues specific to the pylibcudf package CMake CMake build issue labels Oct 2, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 2 times, most recently from ad2e46d to ea983e1 Compare October 3, 2024 11:34
@github-actions github-actions bot removed the CMake CMake build issue label Oct 3, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 2 times, most recently from 4f94346 to 22c0222 Compare October 3, 2024 17:32
@github-actions github-actions bot added the CMake CMake build issue label Oct 3, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 3 times, most recently from d8e0c8d to a1a7ee1 Compare October 10, 2024 09:48
@madsbk madsbk force-pushed the kvikio-remote-io branch 3 times, most recently from 50f30c2 to f6c710a Compare October 22, 2024 10:29
*/
static bool is_supported_remote_url(std::string const& url)
{
// Regular expression to match "<s3|http|https>://"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about long-term: If we're given an http or https URI, how would we tell whether that's on amazon s3 vs some other cloud service?

Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. if users rely on https automatically working for s3, then we need to change how https works in the future, could their code break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have limited the support to s3:// urls for now.
KvikIO also supports regular http servers but let's address that in a later PR.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff!
few small comments

cpp/src/io/utilities/datasource.cpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/datasource.cpp Show resolved Hide resolved
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me!

cpp/src/io/utilities/datasource.cpp Show resolved Hide resolved
@madsbk
Copy link
Member Author

madsbk commented Nov 7, 2024

@rjzamora do you have anything else?

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @madsbk (very exciting)!

@madsbk
Copy link
Member Author

madsbk commented Nov 7, 2024

Thanks all!

@madsbk
Copy link
Member Author

madsbk commented Nov 7, 2024

/merge

@rapids-bot rapids-bot bot merged commit 08e4853 into rapidsai:branch-24.12 Nov 7, 2024
105 checks passed
@madsbk madsbk deleted the kvikio-remote-io branch November 9, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants