Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low-level interface for operations #74

Open
coleshaw opened this issue May 20, 2020 · 0 comments
Open

Low-level interface for operations #74

coleshaw opened this issue May 20, 2020 · 0 comments

Comments

@coleshaw
Copy link
Contributor

coleshaw commented May 20, 2020

A unified interface to handle file / folder / bucket operations, plus checking for validity of a request. See #73 for some history on why we think this would be useful.

In going through #69 , @graft and I have hashed through several use cases with pitfalls, etc., and I'll try to consolidate the thinking into this issue.

First of all, most of the thought process was driven by a bulk-action scenario, in which a user or script might be loading thousands of data entries in bulk into Metis, and then wanting to copy data through Magma. Steps might be:

  1. Upload tons of files into multiple buckets on Metis (i.e. migrating from S3 or a lab's already analyzed data).
  2. Update Magma with metadata records
  3. Copy each of the uploaded files to the appropriate Magma record.

On the Metis side of things, when doing a bulk operation like the above, we would like to analyze the validity of each operation before executing them, so that all operations must be valid before execution. This way we avoid data mismatch in Magma and Metis, where some records might be updated and files copied, but others might not be during a bulk operation. This all-or-nothing behavior prevents data mismatch, so that Magma won't update its file information prematurely.

Shared validation code

Many of the Metis operations perform validations on the action, including things like

  • Is the user authorized to perform X action? i.e. user has access to the bucket
  • Does the state of the file or folder allow X action? i.e. read-only does not allow delete.
  • Is the user proposing a valid action? i.e. the path file exists, the bucket exists, etc.

We should be able to create a low-level operations interface that performs these checks and can also be shared across all of the Metis actions.

Performance

However, these bulk operations could span multiple source / destination buckets and folders, to an arbitrary depth of folder hierarchy. Doing the validity checks of these operations can be very expensive, especially recursively checking a nested folder hierarchy. We want to minimize that as much as possible.

Authorizations

One of the challenges is that authorizations are handled at both the bucket and project levels, so we always need to check if a user is allowed using bucket.allowed? -- cannot add it to the query directly (maybe could do it via a regex on access roles + e-mail?). A user might have access to a bucket based on project permissions, or based on e-mail access (like a collaborator). We have to pass in the hmac from the request for this check.

Files and Folders

Furthermore, folders and files are related to buckets, so we have to fetch the bucket for each revision path before we can fetch any further objects. Folder hierarchies are essentially node graphs, since folders have a single parent folder, and this recursive database search can get expensive. We may want to consider switching to a node-graph datastore instead of a SQL database for this reason, since Metis doesn't have any structured data like Magma. That is one consideration for future optimization.

Files and folders are two distinct entities, and you cannot just "get" a file without knowing its folder. So you have to fetch a file's folder (which might be the bucket's root folder) before fetching the file. You also don't know if a given path is a file or folder a priori -- you have to query the database to check if it exists as a given type.

Current approach

To minimize database hits (especially trying to find the folder hierarchy), we are currently batch parsing all of the revisions to extract unique buckets and folder paths within each one, then setting those objects on each revision. It's unclear if any further optimizations are possible, or if we will run into performance issues with this approach, so we will deploy and see later if we need to refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant