Skip to content

Commit

Permalink
Merge pull request #702 from uc-cdis/feat/dbgap-options
Browse files Browse the repository at this point in the history
Feat/dbgap options
  • Loading branch information
Avantol13 authored Oct 22, 2019
2 parents aeed885 + 0d6fad0 commit eccddc1
Show file tree
Hide file tree
Showing 6 changed files with 463 additions and 65 deletions.
69 changes: 69 additions & 0 deletions docs/dbgap_info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# dbGaP Information (as understood by Gen3)

The [Database for Genotypes and Phenotypes (dbGaP)](https://www.ncbi.nlm.nih.gov/gap/) is used to "archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans".

> NOTE: For official details about dbGaP please visit the official site/documentation.
The largest unit of data that can be submitted to dbGaP is a *Study*. Studies can have sub-studies. Each study is identified by a unique study number (AKA phsid AKA study accession) and additional information (like version), which may look something like `phs001826.v1.p1.c1`. The `.` delimites various pieces of information.

* `phs001826`: unique study identifier
* `v1`: data version
* `p1`: participant set version
* `c1`: consent group version

The combination of these fields is known as a *dbGaP Accession Number*.

More information about this can be found in [this NCBI article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2031016/).

## Authorization

Fence is capable of syncing user access information from dbGaP's *Telemetry Files* (AKA study whitelists). These are typically provided in an SFTP server as CSV's or TXT's. You can see an example of the format in the Fence unit tests (`/tests/dbgap_sync`).

A single *Telemetry File* represents the access allowed for a given dbGaP study accession. The file will contain rows of eRA Commons user IDs and other information. Fence is able to parse all *Telemetry Files* in an SFTP server (given credentials for the SFTP server).

## Consent Groups

Fence contains a configuration for whether or not to parse consent codes (at the time of writing, it is `parse_consent_code` in the `dbGaP` block).

> NOTE: Reference the `config-default.yaml` for current configuration options and further details.
When parsing consent codes, the authorization resources a user is given access to will be in the form `study_id.consent_group` (ex: `phs001826.c2`).

### Consent Group `c999` Handling

The consent group `c999` is interpretted as meaning the user should implicitly have access
to **all** available consent groups within the study. It can additionally be interpretted
as providing access to that study's "exchange area" (in addition to the parent study's
"common exchange area").

Fence will consolidate all known consents for a given study and then provide any user
with `c999` access to all those consents (including `.c999` explicitly, to represent
that study's exchange area).

Fence allows configuring whether or not you want to handle the "common exchange area" logic
mentioned above (at the time of writing, it is `enable_common_exchange_area_access` in the `dbGaP` block).

When turned on, you can provide a list of study identifiers (ex: `phs000123`, `phs000456`) and the resource you want to represent their parent study's common exchange area (ex: `123_and_456_common_exchange_area`) in Fence's configuration file (at the time of writing, it is `study_common_exchange_areas` in the `dbGaP` block).

> NOTE: Again, please see the `config-default.yaml` for more information about available configurations.
For example, `c999` would be handled slightly differently based on configuration. Below, assume a user has access to `c999` consent group:

|| **Consent Cfg == True** | **Consent Codes Cfg == False** |
|---| ------------- | ------------- |
| **Common Exchange Cfg == True** | access to: common exchange area (if phsid in cfg mapping) + study-specifc exchange area + all consent codes | c999 ignored, access to phsid w/o consent |
| **Common Exchange Cfg == False** | access to: study-specifc exchange area + all consent codes | c999 ignored, access to phsid w/o consent |

So the user access granted in a situation with `phs000123.c999` (assuming there exists a
`phs000123.c1` and `phs000123.c2`):

|| **Consent Cfg == True** | **Consent Codes Cfg == False** |
|---| ------------- | ------------- |
| **Common Exchange Cfg == True** | `test_common_exchange_area` + `phs000123.c999` + `phs000123.c1`, `phs000123.c2` | `phs000123`
| **Common Exchange Cfg == False** | `phs000123.c999` + `phs000123.c1`, `phs000123.c2` | `phs000123` |

> NOTE: On the resource level, `phs000123.c999` should refer to resources that exist in that study's specific exchange area. Resources in the parent's common exchange area should be controlled via `test_common_exchange_area`.
## Version Updates

A study can be updated and at that time the patients and consent groups may change and the version number `v1` would get bumped up. At the moment, Fence does not handle these versions, so authorization is effectively either study level, or study+consent level.
50 changes: 43 additions & 7 deletions fence/config-default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -383,15 +383,51 @@ dbGaP:
proxy_user: ''
protocol: 'sftp'
decrypt_key: ''
# parse out the consent from the dbgap accession number such that something
# like "phs000123.v1.p1.c2" becomes "phs000123.c2".
#
# NOTE: when this is "false" the above would become "phs000123"
parse_consent_code: true
# A mapping from the dbgap study to which authorization namespaces the actual data
# lives in. For example, `studyX` data may exist in multiple organizations, so
# we need to know to map authorization to all orgs resources
# A consent of "c999" can indicate access to that study's "exchange area data"
# and when a user has access to one study's exchange area data, they
# have access to the parent study's "common exchange area data" that is not study
# specific. The following config is whether or not to parse/handle "c999" codes
# for access to the common exchange area data
#
# NOTE: When enabled you MUST also provide a mapping to the
# `study_common_exchange_areas` from study -> parent common exchange area resource
enable_common_exchange_area_access: false
# The below configuration is a mapping from studies to their "common exchange area data"
# Fence project name a user gets access to when parsing c999 exchange area codes (and
# subsequently gives access to an Arborist resource representing this common area
# as well)
study_common_exchange_areas:
'example': 'test_common_exchange_area'
# 'studyX': 'test_common_exchange_area'
# 'studyY': 'test_common_exchange_area'
# 'studyZ': 'test_common_exchange_area'
# A mapping from the dbgap study / Fence project to which authorization namespaces the
# actual data lives in. For example, `studyX` data may exist in multiple organizations, so
# we need to know how to map authorization to all orgs resources
study_to_resource_namespaces:
'_default': ['/']
'studyX': ['/orgA/', '/orgB/']
'studyY': ['/orgB/', '/orgC/']
'studyZ': ['/orgD/']
'test_common_exchange_area': ['/dbgap/']
# above are for default support and exchange area support
# below are further examples
#
# 'studyX': ['/orgA/', '/orgB/']
# 'studyX.c2': ['/orgB/', '/orgC/']
# 'studyZ': ['/orgD/']

# Regex to match an assession number that has consent information in forms like:
# phs00301123.c999
# phs000123.v3.p1.c3
# phs000123.c3
# phs00301123.v3.p4.c999
# Will NOT MATCH forms like: phs000123
#
# WARNING: Do not change this without consulting the code that uses it
DBGAP_ACCESSION_WITH_CONSENT_REGEX: '(?P<phsid>phs[0-9]+)(.(?P<version>v[0-9]+)){0,1}(.(?P<participant_set>p[0-9]+)){0,1}.(?P<consent>c[0-9]+)'

# //////////////////////////////////////////////////////////////////////////////////////
# STORAGE BACKENDS AND CREDENTIALS
Expand Down Expand Up @@ -667,4 +703,4 @@ DREAM_CHALLENGE_TEAM: 'DREAM'
DREAM_CHALLENGE_GROUP: 'DREAM'
SYNAPSE_URI: 'https://repo-prod.prod.sagebase.org/auth/v1/'
SYNAPSE_DISCOVERY_URL:
SYNAPSE_AUTHZ_TTL: 86400
SYNAPSE_AUTHZ_TTL: 86400
167 changes: 138 additions & 29 deletions fence/sync/sync_users.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import subprocess as sp
import tempfile
import yaml
import copy
from contextlib import contextmanager
from csv import DictReader
from io import StringIO
Expand Down Expand Up @@ -299,6 +300,10 @@ def __init__(
self.protocol = dbGaP["protocol"]
self.dbgap_key = dbGaP["decrypt_key"]
self.parse_consent_code = dbGaP.get("parse_consent_code", True)
self.enable_common_exchange_area_access = dbGaP.get(
"enable_common_exchange_area_access", False
)
self.study_common_exchange_areas = dbGaP.get("study_common_exchange_areas", {})
self.session = db_session
self.driver = SQLAlchemyDriver(DB)
self.project_mapping = project_mapping or {}
Expand Down Expand Up @@ -457,43 +462,62 @@ def _parse_csv(self, file_dict, sess, encrypted=True):
dbgap_project = phsid[0]
if len(phsid) > 1 and self.parse_consent_code:
consent_code = phsid[-1]
if consent_code != "c999":
dbgap_project += "." + consent_code

# c999 indicates full access to all consents and access
# to a study-specific exchange area
# access to at least one study-specific exchange area implies access
# to the parent study's common exchange area
#
# NOTE: Handling giving access to all consents is done at
# a later time, when we have full information about possible
# consents
self.logger.debug(
f"got consent code {consent_code} from dbGaP project "
f"{dbgap_project}"
)
if (
consent_code == "c999"
and self.enable_common_exchange_area_access
and dbgap_project in self.study_common_exchange_areas
):
self.logger.info(
"found study with consent c999 and Fence "
"is configured to parse exchange area data. Giving user "
f"{username} {privileges} privileges in project: "
f"{self.study_common_exchange_areas[dbgap_project]}."
)
self._add_dbgap_project_for_user(
self.study_common_exchange_areas[dbgap_project],
privileges,
username,
sess,
user_projects,
)

dbgap_project += "." + consent_code

display_name = row.get("user name", "")
tags = {"dbgap_role": row.get("role", "")}

# some dbgap telemetry files have information about a researchers PI
if "downloader for" in row:
tags["pi"] = row["downloader for"]

# prefer name over previous "downloader for" if it exists
if "downloader for names" in row:
tags["pi"] = row["downloader for names"]

user_info[username] = {
"email": row.get("email", ""),
"display_name": display_name,
"phone_number": row.get("phone", ""),
"tags": {"dbgap_role": row.get("role", "")},
"tags": tags,
}

if dbgap_project not in self.project_mapping:
if dbgap_project not in self._projects:

self.logger.debug(
"creating Project in fence from dbGaP study: {}".format(
dbgap_project
)
)

project = self._get_or_create(
sess, Project, auth_id=dbgap_project
)

# need to add dbgap project to arborist
if self.arborist_client:
self._add_dbgap_study_to_arborist(dbgap_project)

if project.name is None:
project.name = dbgap_project
self._projects[dbgap_project] = project
phsid_privileges = {dbgap_project: set(privileges)}
if username in user_projects:
user_projects[username].update(phsid_privileges)
else:
user_projects[username] = phsid_privileges

self._add_dbgap_project_for_user(
dbgap_project, privileges, username, sess, user_projects
)
for element_dict in self.project_mapping.get(dbgap_project, []):
try:
phsid_privileges = {
Expand All @@ -513,6 +537,33 @@ def _parse_csv(self, file_dict, sess, encrypted=True):
self.logger.info(e)
return user_projects, user_info

def _add_dbgap_project_for_user(
self, dbgap_project, privileges, username, sess, user_projects
):
"""
Helper function for csv parsing that adds a given dbgap project to Fence/Arborist
and then updates the dictionary containing all user's project access
"""
if dbgap_project not in self._projects:
self.logger.debug(
"creating Project in fence for dbGaP study: {}".format(dbgap_project)
)

project = self._get_or_create(sess, Project, auth_id=dbgap_project)

# need to add dbgap project to arborist
if self.arborist_client:
self._add_dbgap_study_to_arborist(dbgap_project)

if project.name is None:
project.name = dbgap_project
self._projects[dbgap_project] = project
phsid_privileges = {dbgap_project: set(privileges)}
if username in user_projects:
user_projects[username].update(phsid_privileges)
else:
user_projects[username] = phsid_privileges

@staticmethod
def sync_two_user_info_dict(user_info1, user_info2):
"""
Expand Down Expand Up @@ -934,6 +985,10 @@ def _sync(self, sess):

self.logger.info("dbgap files: {}".format(dbgap_file_list))
permissions = [{"read-storage"} for _ in dbgap_file_list]
if self.parse_consent_code and self.enable_common_exchange_area_access:
self.logger.info(
f"using study to common exchange area mapping: {self.study_common_exchange_areas}"
)
user_projects, user_info = self._parse_csv(
dict(list(zip(dbgap_file_list, permissions))), encrypted=True, sess=sess
)
Expand Down Expand Up @@ -977,10 +1032,15 @@ def _sync(self, sess):
self.sync_two_phsids_dict(user_projects_csv, user_projects)
self.sync_two_user_info_dict(user_info_csv, user_info)

# privilleges in yaml files overide ones in csv files
# privileges in yaml files overide ones in csv files
self.sync_two_phsids_dict(user_yaml.projects, user_projects)
self.sync_two_user_info_dict(user_yaml.user_info, user_info)

if self.parse_consent_code:
self._grant_all_consents_to_c999_users(
user_projects, user_yaml.project_to_resource
)

if user_projects:
self.logger.info("Sync to db and storage backend")
self.sync_to_db_and_storage_backend(user_projects, user_info, sess)
Expand Down Expand Up @@ -1017,6 +1077,53 @@ def _sync(self, sess):
)
exit(1)

def _grant_all_consents_to_c999_users(
self, user_projects, user_yaml_project_to_resources
):
access_number_matcher = re.compile(config["DBGAP_ACCESSION_WITH_CONSENT_REGEX"])
# combine dbgap/user.yaml projects into one big list (in case not all consents
# are in either)
all_projects = set(
list(self._projects.keys()) + list(user_yaml_project_to_resources.keys())
)

self.logger.debug(f"all projects: {all_projects}")

# construct a mapping from phsid (without consent) to all accessions with consent
consent_mapping = {}
for project in all_projects:
phs_match = access_number_matcher.match(project)
if phs_match:
accession_number = phs_match.groupdict()

# TODO: This is not handling the .v1.p1 at all
consent_mapping.setdefault(accession_number["phsid"], set()).add(
".".join([accession_number["phsid"], accession_number["consent"]])
)

self.logger.debug(f"consent mapping: {consent_mapping}")

# go through existing access and find any c999's and make sure to give access to
# all accessions with consent for that phsid
for username, user_project_info in copy.deepcopy(user_projects).items():
for project, _ in user_project_info.items():
phs_match = access_number_matcher.match(project)
if phs_match and phs_match.groupdict()["consent"] == "c999":
# give access to all consents
all_phsids_with_consent = consent_mapping.get(
phs_match.groupdict()["phsid"], []
)
self.logger.info(
f"user {username} has c999 consent group for: {project}. "
f"Granting access to all consents: {all_phsids_with_consent}"
)
# NOTE: Only giving read-storage at the moment (this is same
# permission we give for other dbgap projects)
for phsid_with_consent in all_phsids_with_consent:
user_projects[username].update(
{phsid_with_consent: {"read-storage"}}
)

def _update_arborist(self, session, user_yaml):
"""
Create roles, resources, policies, groups in arborist from the information in
Expand Down Expand Up @@ -1294,6 +1401,8 @@ def _add_dbgap_study_to_arborist(self, dbgap_study):
.get(dbgap_study, default_namespaces)
)

self.logger.debug(f"dbgap study namespaces: {namespaces}")

arborist_resource_namespaces = [
namespace.rstrip("/") + "/programs/" for namespace in namespaces
]
Expand Down
9 changes: 8 additions & 1 deletion tests/dbgap_sync/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,14 @@ def syncer(db_session, request):
"phstest": [{"name": "Test", "auth_id": "Test"}],
}

dbGap = {}
dbGap = yaml_load(
open(
os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
"test-fence-config.yaml",
)
)
).get("dbGaP")
test_db = yaml_load(
open(
os.path.join(
Expand Down
Loading

0 comments on commit eccddc1

Please sign in to comment.