Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

Open
dchandekstark opened this issue May 2, 2024 · 20 comments
Labels

Comments

@dchandekstark
Copy link
Member

I am working on a storage adapter for S3 (in a somewhat different fashion from https://github.com/stkenny/valkyrie-storage-s3), and I find that Valkyrie::StorageAdapter::File feels awkward in a couple of spots. I wonder whether we could insert a higher level abstraction above it to support find_by in a more "natural" way for cloud providers.

First, #disk_path IMO extends the responsibility of V::SA::F too far, and StreamFile feels too opinionated (and appears to create but not clean up temp files?). Likewise rewind, close, and read seem bound to Ruby's IO class -- nothing wrong with that as such, but if you're dealing with a cloud resource, it's not really "on disk" and it's not really an IO either. What I might suggest for a higher level abstraction is a class that minimally implements #each to yield chunks of the file data; I could also see #stream to get an IO. #size is probably fine too, fwiw.

Here's a partial illustration, where the io used to instantiate the class is an instance of Aws::S3::Object

def each
  io.get do |chunk, _headers|
    yield chunk
  end
end

# @return [IO]
def stream
   io.get.body
end

Interested in any and all feedback. Thanks!

@tpendragon
Copy link
Collaborator

@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path because it's often just so necessary to be able to get that file locally for derivative processing.

I'm interested in StreamFiles not cleaning up, that wasn't the intention if it's true. A brief look at the code does look like that's the case though - folks would have to clear out that tmp dir occasionally.

@tpendragon
Copy link
Collaborator

@dchandekstark Have you looked at valkyrie-shrine for S3 support? That's largely what we've done.

@dchandekstark
Copy link
Member Author

@dchandekstark Have you looked at valkyrie-shrine for S3 support? That's largely what we've done.

@tpendragon I haven't. At this stage, I'm looking for the "simplest thing that could possibly work"(tm), but I'll definitely check it out. I actually contemplated trying to use ActiveStorage in a more or less similar way, but it felt over-engineered -- like I should just switch to using ActiveStorage instead of trying to integrate it into Valkyrie as another storage adapter.

@tpendragon
Copy link
Collaborator

@dchandekstark It's pretty simple. In figgy we have this:

https://github.com/pulibrary/figgy/blob/c575619bfeb42a35ed0545644b21034214f62a75/config/initializers/valkyrie.rb#L129-L145

then that adapter works just like the disk one.

@dchandekstark
Copy link
Member Author

@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path because it's often just so necessary to be able to get that file locally for derivative processing.

I'm thinking of something along the lines of the Rack spec? -- not that I'm well versed in it -- but at least the low-level notion of an enumerable.

I get the disk_path thing, but if stored file doesn't have a path natively, then I think it's the responsibility of the party that wants a path to download the data. Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.

@tpendragon
Copy link
Collaborator

Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.

I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.

@dchandekstark
Copy link
Member Author

I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.

That's fair. I do wish the temp file stuff in StreamFile could be addressed, but I'm not sure how without impacting disk_path in a way that undermines the purpose of it. It's sort of indeterminate how long the path is supposed to persist.

@tpendragon
Copy link
Collaborator

We could use Tempfile maybe, or disk_path could accept a block and clean up after itself when the block ends 🤔

@dchandekstark
Copy link
Member Author

We could use Tempfile maybe, or disk_path could accept a block and clean up after itself when the block ends 🤔

Tempfile + optional block would work I think.

@dchandekstark
Copy link
Member Author

@tpendragon One more consideration on disk_path: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?

@cgalarza
Copy link
Contributor

cgalarza commented May 2, 2024

Hello! I thought I would chime in because I recently ran into a problem where I was using disk_path and had assume that it was a temp file that would get cleaned up on its own, but it didn't and then my server ran out of space, etc etc. I think part of my issue occurred because we deploying in a containerized environment and can't depend on using a tmp directory that a server might cleanup (but I am far from an expert on this).

I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile documentation and it seemed like it is really encouraging manually closing/removing a TempFile. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.

My suggestion would be keep disk_path as is, but add a method to explicitly close the file and add an additional method to take a block.

Just my two cents.

@tpendragon
Copy link
Collaborator

tpendragon commented May 2, 2024

@tpendragon One more consideration on disk_path: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?

That's interesting! We do this for disk files and it does fail for our cloud ones. I think you could either proxy through nginx or apache or whatever for these, or I've considered adding a download_url to StreamFile that no-ops - I think Shrine does something similar?

@tpendragon
Copy link
Collaborator

I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile documentation and it seemed like it is really encouraging manually closing/removing a TempFile. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.

Ahhh this is great, good work @cgalarza !

@tpendragon
Copy link
Collaborator

I think I'm gonna look at adding a block to disk_path and make StreamFile clean up using it, and make close on StreamFile ensure any local tmp file is deleted.

@tpendragon
Copy link
Collaborator

@dchandekstark @cgalarza How's #955 look?

@tpendragon
Copy link
Collaborator

tpendragon commented May 2, 2024

@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - pulibrary/figgy@main...figgy_in_the_cloud might be useful to you?

(We've since decided to keep files local, but it was a good experiment in how flexible Valkyrie really is 😀 )

@dchandekstark
Copy link
Member Author

dchandekstark commented May 2, 2024

@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - pulibrary/figgy@main...figgy_in_the_cloud might be useful to you?

Yes. It also illustrates (in the respond_to? tests) that path is not always (easily) available.

@dchandekstark
Copy link
Member Author

@dchandekstark @cgalarza How's #955 look?

@tpendragon Seems like a good start! I do confess a little concern that explicit temp file naming is potentially unsafe (e.g., what happens when separate threads/processes act on the same file?).

@dchandekstark
Copy link
Member Author

@tpendragon I feel that I should add to this thread that I appreciate what Valkyrie provides with respect to storage. It's simple, yet effective. I can see the attraction to Shrine and even ActiveStorage, but you don't always need or want that much extra functionality associated with your storage capability. With that said, I thought I would try to fit a cloud provider (in this case S3) into the Valkyrie framework more or less directly and adhere to the storage adapter contracts.

@tpendragon
Copy link
Collaborator

I'm all for more options! Looking forward to it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants