Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: force get_object to return Vector{UInt8}? #224

Closed
christopher-dG opened this issue Oct 17, 2020 · 9 comments
Closed

S3: force get_object to return Vector{UInt8}? #224

christopher-dG opened this issue Oct 17, 2020 · 9 comments
Labels
enhancement v2 Issue tracking before the release of v2

Comments

@christopher-dG
Copy link
Member

The return type of get_object seems to depend on what type of file you're looking for. In the case of strings vs bytes it's not so bad, but IMO getting a Dict or Vector for a JSON file is not great, since you lose all formatting (and who says I want it parsed anyways?). Is this behaviour intended? Is there a way to always get a byte vector?

julia> using AWS: @service

julia> @service S3;

julia> S3.get_object("cdg-scratch", "x.txt")
 "hi"

julia> S3.get_object("cdg-scratch", "x.png")
 1245160-element Array{UInt8,1}:
 

julia> S3.get_object("cdg-scratch", "x.json")  # {}
 OrderedCollections.LittleDict{String,Any,Array{String,1},Array{Any,1}}()

julia> S3.get_object("cdg-scratch", "y.json")  # []
 Any[]
@mattBrzezinski
Copy link
Member

This functions the same way as AWSCore.jl does, you do have some control over how the response is returned;

response_dict_type::Type{<:AbstractDict}=LittleDict

Maybe we change the response_dict_type::Type{<:AbstractDict} to response_type::Type and then give users free reign over how they want things returned, defaulting to what already exists?

@christopher-dG
Copy link
Member Author

IMO the "right" default is to return the raw file contents, but it's less disruptive to default to the existing method. But I think changing response_dict_type -> response_type is a good move!

@iamed2
Copy link
Member

iamed2 commented Oct 28, 2020

Boto3 returns a dict with metadata and the readable stream under the 'Body' key. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object

What's the source of the difference here?

@iamed2
Copy link
Member

iamed2 commented Oct 28, 2020

Found the difference: https://github.com/boto/botocore/blob/develop/botocore/data/s3/2006-03-01/service-2.json#L3860

Boto3 respects the input shape and output shape specifications, while AWS.jl applies a heuristic to determine how to return data. In the long term, we can also use the shape specifications. In the short term, I think AWSS3.jl should be made able to provide the interface that you want, and AWS.jl should just need to be configurable to facilitate that.

@c42f
Copy link
Contributor

c42f commented Nov 6, 2020

Yeah, I don't think AWS.jl is the right place to have fancy heuristics for the return type. IMO a higher layer should deal with that.

Boto3 respects the input shape and output shape specifications

Forgive my ignorance but what's the "shape" refer to? Is it the set of keys ContentType, ContentEncoding etc?

To the extent that the MIME type, etc, is stored explicitly on the S3 side, interpreting that into a Julia type seems somewhat reasonable. But I think there's many problems in mapping serialized data to an in-memory representation, primarily because there's a many-to-many mapping between the serialized form and the julia type system.

Actually I've been thinking very hard about this exact problem in DataSets.jl: https://github.com/JuliaComputing/DataSets.jl and I think I'm closing in on some kind of solution there. (It involves recognizing that data on disk has a kind of ad-hoc structural type system, that this can be mapped into Julia types in many ways, and that the mapping involves many choices which could have defaults but must ultimately be configurable.)

Back to the issue at hand, I think AWS.jl can't solve the mapping problem systematically unless it wants to depend on the wide variety of libraries used for parsing all the various MIME types (eg, I assume we'll never depend on PNGFiles.jl just so we can parse image/png).

So I think it'll always be limited to a few "special" types like JSON. But in that case, it's better to default to Vector{UInt8}, and have a way to turn on heuristics or interpretation of the MIME type only on demand.

@mattBrzezinski
Copy link
Member

Yeah, I don't think AWS.jl is the right place to have fancy heuristics for the return type. IMO a higher layer should deal with that.

Boto3 respects the input shape and output shape specifications

Forgive my ignorance but what's the "shape" refer to? Is it the set of keys ContentType, ContentEncoding etc?

The aws-sdk-js defines various input / output shapes such as, https://github.com/aws/aws-sdk-js/blob/master/apis/s3-2006-03-01.normal.json#L4549

@mattBrzezinski
Copy link
Member

Just wanted to pop-in and acknowledge this has not been forgotten. It seems that the proposed change of response_dict_type -> response_type is well favoured by the community.

At a future date I can make this change, unless someone else would like to take the lead. In which case I can advise and do the code review on it going forward. It should be a relatively simple change to make.

@laborg
Copy link
Contributor

laborg commented Mar 30, 2021

In the context of this issue this might help others: Specify "return_raw" as true to circumvent special casing of e.g. JSON files:

S3.get_object("bucket","file.json",Dict("return_raw"=>true) # returns a Vector{UInt8}

@mattBrzezinski mattBrzezinski added the v2 Issue tracking before the release of v2 label Apr 26, 2021
bors bot added a commit that referenced this issue Sep 29, 2021
384: Use `AWS.Response` to handle streaming/raw/parsing r=omus a=omus

The idea is to use a struct in AWS.jl that can be used to handle the automatic parsing that is currently used to turn XML/JSON into a dict while also giving the option of accessing the raw output as a string or stream without all the keywords currently needed to be specified.

Depends on:
- #457

Related:
- #346
- #348
- #438

Closes:
- #224
- #433
- #468 (when using `use_response_type` the passed in I/O is not closed)
- #471
- #433

Update: The tests in this PR run using the deprecated behaviour. Mainly I did that here to prove this change is non-breaking. For the updated tests see #458 

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
@omus
Copy link
Member

omus commented Sep 29, 2021

As of AWS.jl 1.63.0 (#384) you can now do:

@service S3 use_response_type=true
response = S3.get_object("bucket","file.json")
response.body::Vector{UInt8}

@omus omus closed this as completed Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement v2 Issue tracking before the release of v2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants