You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A unified interface to handle file / folder / bucket operations, plus checking for validity of a request. See #73 for some history on why we think this would be useful.
In going through #69 , @graft and I have hashed through several use cases with pitfalls, etc., and I'll try to consolidate the thinking into this issue.
First of all, most of the thought process was driven by a bulk-action scenario, in which a user or script might be loading thousands of data entries in bulk into Metis, and then wanting to copy data through Magma. Steps might be:
Upload tons of files into multiple buckets on Metis (i.e. migrating from S3 or a lab's already analyzed data).
Update Magma with metadata records
Copy each of the uploaded files to the appropriate Magma record.
On the Metis side of things, when doing a bulk operation like the above, we would like to analyze the validity of each operation before executing them, so that all operations must be valid before execution. This way we avoid data mismatch in Magma and Metis, where some records might be updated and files copied, but others might not be during a bulk operation. This all-or-nothing behavior prevents data mismatch, so that Magma won't update its file information prematurely.
Shared validation code
Many of the Metis operations perform validations on the action, including things like
Is the user authorized to perform X action? i.e. user has access to the bucket
Does the state of the file or folder allow X action? i.e. read-only does not allow delete.
Is the user proposing a valid action? i.e. the path file exists, the bucket exists, etc.
We should be able to create a low-level operations interface that performs these checks and can also be shared across all of the Metis actions.
Performance
However, these bulk operations could span multiple source / destination buckets and folders, to an arbitrary depth of folder hierarchy. Doing the validity checks of these operations can be very expensive, especially recursively checking a nested folder hierarchy. We want to minimize that as much as possible.
Authorizations
One of the challenges is that authorizations are handled at both the bucket and project levels, so we always need to check if a user is allowed using bucket.allowed? -- cannot add it to the query directly (maybe could do it via a regex on access roles + e-mail?). A user might have access to a bucket based on project permissions, or based on e-mail access (like a collaborator). We have to pass in the hmac from the request for this check.
Files and Folders
Furthermore, folders and files are related to buckets, so we have to fetch the bucket for each revision path before we can fetch any further objects. Folder hierarchies are essentially node graphs, since folders have a single parent folder, and this recursive database search can get expensive. We may want to consider switching to a node-graph datastore instead of a SQL database for this reason, since Metis doesn't have any structured data like Magma. That is one consideration for future optimization.
Files and folders are two distinct entities, and you cannot just "get" a file without knowing its folder. So you have to fetch a file's folder (which might be the bucket's root folder) before fetching the file. You also don't know if a given path is a file or folder a priori -- you have to query the database to check if it exists as a given type.
Current approach
To minimize database hits (especially trying to find the folder hierarchy), we are currently batch parsing all of the revisions to extract unique buckets and folder paths within each one, then setting those objects on each revision. It's unclear if any further optimizations are possible, or if we will run into performance issues with this approach, so we will deploy and see later if we need to refactor.
The text was updated successfully, but these errors were encountered:
A unified interface to handle file / folder / bucket operations, plus checking for validity of a request. See #73 for some history on why we think this would be useful.
In going through #69 , @graft and I have hashed through several use cases with pitfalls, etc., and I'll try to consolidate the thinking into this issue.
First of all, most of the thought process was driven by a bulk-action scenario, in which a user or script might be loading thousands of data entries in bulk into Metis, and then wanting to copy data through Magma. Steps might be:
On the Metis side of things, when doing a bulk operation like the above, we would like to analyze the validity of each operation before executing them, so that all operations must be valid before execution. This way we avoid data mismatch in Magma and Metis, where some records might be updated and files copied, but others might not be during a bulk operation. This all-or-nothing behavior prevents data mismatch, so that Magma won't update its file information prematurely.
Shared validation code
Many of the Metis operations perform validations on the action, including things like
read-only
does not allowdelete
.We should be able to create a low-level operations interface that performs these checks and can also be shared across all of the Metis actions.
Performance
However, these bulk operations could span multiple source / destination buckets and folders, to an arbitrary depth of folder hierarchy. Doing the validity checks of these operations can be very expensive, especially recursively checking a nested folder hierarchy. We want to minimize that as much as possible.
Authorizations
One of the challenges is that authorizations are handled at both the bucket and project levels, so we always need to check if a user is allowed using
bucket.allowed?
-- cannot add it to the query directly (maybe could do it via a regex on access roles + e-mail?). A user might have access to a bucket based on project permissions, or based on e-mail access (like a collaborator). We have to pass in thehmac
from the request for this check.Files and Folders
Furthermore, folders and files are related to buckets, so we have to fetch the bucket for each revision path before we can fetch any further objects. Folder hierarchies are essentially node graphs, since folders have a single parent folder, and this recursive database search can get expensive. We may want to consider switching to a node-graph datastore instead of a SQL database for this reason, since Metis doesn't have any structured data like Magma. That is one consideration for future optimization.
Files and folders are two distinct entities, and you cannot just "get" a file without knowing its folder. So you have to fetch a file's folder (which might be the bucket's root folder) before fetching the file. You also don't know if a given path is a file or folder a priori -- you have to query the database to check if it exists as a given type.
Current approach
To minimize database hits (especially trying to find the folder hierarchy), we are currently batch parsing all of the revisions to extract unique buckets and folder paths within each one, then setting those objects on each revision. It's unclear if any further optimizations are possible, or if we will run into performance issues with this approach, so we will deploy and see later if we need to refactor.
The text was updated successfully, but these errors were encountered: