Skip to content

Commit

Permalink
Merge pull request #1707 from craddm/multi-prov
Browse files Browse the repository at this point in the history
Add additional multiple data provider guidance to docs
  • Loading branch information
JimMadge authored Feb 2, 2024
2 parents f5a954c + 9ae01d9 commit 22325a0
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 0 deletions.
9 changes: 9 additions & 0 deletions docs/source/processes/data_ingress.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,15 @@ If data sets consist of multiple files, collect them in uniquely named directori

If there are multiple data providers uploading data for a single work package, each provider should use a uniquely named directory, or prepend their files with a unique name.

### Avoiding data leakage

If all data providers are uploading to the same storage container, then they may be able to see the files uploaded by other data providers.

Although they will not be able to access or download these files, a potential issue is that sensitive information may be visible in either the file names or directory structure of the uploaded data.

If possible, data providers should avoid the use of any identifying information in the filenames or directory structure of the data that they upload.
This is not always possible, since some data providers may require identifying information to be part of filenames or directory structures.

### Describe the data

Explaining the structure and format of the data will help researchers be most effective.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/roles/project_manager/data_ingress.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,5 @@ If ingress of new data would change the classification of a project, we suggest

At the end of this process they should have classified the work package into one of the Data Safe Haven security tiers.
Follow the guide to [data ingress](data_ingress.md) to bring all necessary code and data into the secure research environment.

If there are multiple data providers, please see the guidance for {ref}`roles_system_manager_multiple_providers`
22 changes: 22 additions & 0 deletions docs/source/roles/system_manager/manage_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,28 @@ The following steps show how to generate a temporary write-only upload token tha
- The data provider should now be able to upload data by following {ref}`these instructions <process_data_ingress>`
- You can validate successful data ingress by logging into the SRD for the SRE and checking the `/data` volume, where you should be able to view the data that the data provider has uploaded

(roles_system_manager_multiple_providers)=

## Multiple data providers

In some projects, there may be more than one data provider responsible for uploading data. Two potential issues that may occur are _name clashes_ and _data leakage_.

If all data providers are uploading to the same storage container, then name clashes may occur. There is no protection against overwriting files during upload.
Thus, if more than one data provider uploads files with the same path, then the earlier upload will be overwritten.
This can be avoided by providing each data provider with their own subfolder on the storage container and ensuring that each uploads only to their subfolder.

If all data providers are uploading to the same storage container, then they may be able to see the files uploaded by other data providers.
Although they will not be able to access or download these files, a potential issue is that sensitive information may be visible in either the file names or directory structure of the uploaded data.

If possible, data providers should avoid using any identifying information in the filenames or directory structure of the data that they upload.
This is not always possible, since some data providers may require identifying information to be part of filenames or directory structures.

An alternative is to provide separate storage containers for upload for each data provider.
These containers should have all the same access restrictions as used for a single ingress storage container.

After the data has been uploaded, the {ref}`role_system_manager` can transfer the uploaded data to a single storage container accessible to {ref}`role_researcher`s from the relevant SRE, as per the normal data ingress process.
The data-provider-specific containers should be deleted once the data has been transferred.

(roles_system_manager_software_ingress)=

## Software Ingress
Expand Down

0 comments on commit 22325a0

Please sign in to comment.