-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] Pass on KMS-related kwargs for s3fs #35938
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
option_map = { | ||
"ServerSideEncryption": "ServerSideEncryption", | ||
"SSEKMSKeyId": "SSEKMSKeyId", | ||
"GrantFullControl": "GrantFullControl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to come up with a dynamic approach to cover all possible options here without churning the APIs too much. I'd like to revisit this when we think about custom upload logic and Syncer deprecation, and for now go with manually selected properties to forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that we should revisit this. I was searching for other possibilities and ACL
is another param that can be passed into s3_additional_kwargs
. Then, GCS also has a bunch of query params that users can specify.
Another thing to note: with the current Syncer
abstraction, users could technically just create an s3fs client and customize all of these settings there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, the custom Syncer
doesn't affect Checkpoint.to_uri
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I do have some questions about getting rid of the cache:
- Should we calculate some before/after timing of the benefit this cache actually had?
- Regarding getting rid of the cache being a "requirement for options to be parsed again": how do users actually pass in these options again? They only get to specify
storage_path
once at the beginning, so even if theKeyID
oraccess_key
get refreshed, the query params would still be the same as before, and we'd run into the same problem. Am I missing something here?
option_map = { | ||
"ServerSideEncryption": "ServerSideEncryption", | ||
"SSEKMSKeyId": "SSEKMSKeyId", | ||
"GrantFullControl": "GrantFullControl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that we should revisit this. I was searching for other possibilities and ACL
is another param that can be passed into s3_additional_kwargs
. Then, GCS also has a bunch of query params that users can specify.
Another thing to note: with the current Syncer
abstraction, users could technically just create an s3fs client and customize all of these settings there.
I'll do some small-scale benchmarking before merge, but I don't expect any significant drawdowns as we call
It's mostly for calls to e.g. |
Before:
After:
Looks like no significant speed reduction |
Hm, some tests are failing because we re-create the mock filesystem. It looks like we need to at least cache that. A few options:
From these options I believe 1 is the best. We can keep the timeout relatively small (say a minute or so). I'll update the PR. |
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, so mock://
is something provided by pyarrow, and it gets wiped if we don't cache if for tests.
We currently only parse and pass limited selection of options to s3fs. One recent request was related to passing KMS settings. This PR extends the s3 uri string to allow configuration of signature version, sse, sse key ID, and ACLs in s3 URIs if s3fs is used. This PR also changes the fs caching logic, which is a requirement for options to be parsed again, e.g. when a key ID is changed in subsequent calls. FS cache keys now include the query string, and cache items are stale after 5 minutes and re-created. As a side-effect, this should fix any problems that come with cached filesystems, e.g. expiring credentials. Signed-off-by: Kai Fricke <kai@anyscale.com>
#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. Signed-off-by: Matthew Deng <matt@anyscale.com>
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. Signed-off-by: Matthew Deng <matt@anyscale.com>
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
We currently only parse and pass limited selection of options to s3fs. One recent request was related to passing KMS settings. This PR extends the s3 uri string to allow configuration of signature version, sse, sse key ID, and ACLs in s3 URIs if s3fs is used. This PR also changes the fs caching logic, which is a requirement for options to be parsed again, e.g. when a key ID is changed in subsequent calls. FS cache keys now include the query string, and cache items are stale after 5 minutes and re-created. As a side-effect, this should fix any problems that come with cached filesystems, e.g. expiring credentials. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
We currently only parse and pass limited selection of options to s3fs. One recent request was related to passing KMS settings. This PR extends the s3 uri string to allow configuration of signature version, sse, sse key ID, and ACLs in s3 URIs if s3fs is used.
This PR also changes the fs caching logic, which is a requirement for options to be parsed again, e.g. when a key ID is changed in subsequent calls. FS cache keys now include the query string, and cache items are stale after 5 minutes and re-created. As a side-effect, this should fix any problems that come with cached filesystems, e.g. expiring credentials.
Related issue number
Closes #35561
Might close #35519 (pending investigation)
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.