Merge pull request #702 from uc-cdis/feat/dbgap-options

Feat/dbgap options
uc-cdis · Oct 22, 2019 · eccddc1 · eccddc1
2 parents aeed885 + 0d6fad0
commit eccddc1
Show file tree

Hide file tree

Showing 6 changed files with 463 additions and 65 deletions.
diff --git a/docs/dbgap_info.md b/docs/dbgap_info.md
@@ -0,0 +1,69 @@
+# dbGaP Information (as understood by Gen3)
+
+The [Database for Genotypes and Phenotypes (dbGaP)](https://www.ncbi.nlm.nih.gov/gap/) is used to "archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans".
+
+> NOTE: For official details about dbGaP please visit the official site/documentation.
+
+The largest unit of data that can be submitted to dbGaP is a *Study*. Studies can have sub-studies. Each study is identified by a unique study number (AKA phsid AKA study accession) and additional information (like version), which may look something like `phs001826.v1.p1.c1`. The `.` delimites various pieces of information.
+
+* `phs001826`: unique study identifier
+* `v1`: data version
+* `p1`: participant set version
+* `c1`: consent group version
+
+The combination of these fields is known as a *dbGaP Accession Number*.
+
+More information about this can be found in [this NCBI article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2031016/).
+
+## Authorization
+
+Fence is capable of syncing user access information from dbGaP's *Telemetry Files* (AKA study whitelists). These are typically provided in an SFTP server as CSV's or TXT's. You can see an example of the format in the Fence unit tests (`/tests/dbgap_sync`).
+
+A single *Telemetry File* represents the access allowed for a given dbGaP study accession. The file will contain rows of eRA Commons user IDs and other information. Fence is able to parse all *Telemetry Files* in an SFTP server (given credentials for the SFTP server).
+
+## Consent Groups
+
+Fence contains a configuration for whether or not to parse consent codes (at the time of writing, it is `parse_consent_code` in the `dbGaP` block).
+
+> NOTE: Reference the `config-default.yaml` for current configuration options and further details.
+
+When parsing consent codes, the authorization resources a user is given access to will be in the form `study_id.consent_group` (ex: `phs001826.c2`).
+
+### Consent Group `c999` Handling
+
+The consent group `c999` is interpretted as meaning the user should implicitly have access
+to **all** available consent groups within the study. It can additionally be interpretted
+as providing access to that study's "exchange area" (in addition to the parent study's
+"common exchange area").
+
+Fence will consolidate all known consents for a given study and then provide any user
+with `c999` access to all those consents (including `.c999` explicitly, to represent
+that study's exchange area).
+
+Fence allows configuring whether or not you want to handle the "common exchange area" logic
+mentioned above (at the time of writing, it is `enable_common_exchange_area_access` in the `dbGaP` block).
+
+When turned on, you can provide a list of study identifiers (ex: `phs000123`, `phs000456`) and the resource you want to represent their parent study's common exchange area (ex: `123_and_456_common_exchange_area`) in Fence's configuration file (at the time of writing, it is `study_common_exchange_areas` in the `dbGaP` block).
+
+> NOTE: Again, please see the `config-default.yaml` for more information about available configurations.
+
+For example, `c999` would be handled slightly differently based on configuration. Below, assume a user has access to `c999` consent group:
+
+|| **Consent Cfg == True**  | **Consent Codes Cfg == False** |
+|---| ------------- | ------------- |
+| **Common Exchange Cfg == True** | access to: common exchange area (if phsid in cfg mapping) + study-specifc exchange area + all consent codes  | c999 ignored, access to phsid w/o consent |
+| **Common Exchange Cfg == False** | access to: study-specifc exchange area + all consent codes | c999 ignored, access to phsid w/o consent |
+
+So the user access granted in a situation with `phs000123.c999` (assuming there exists a
+`phs000123.c1` and `phs000123.c2`):
+
+|| **Consent Cfg == True**  | **Consent Codes Cfg == False** |
+|---| ------------- | ------------- |
+| **Common Exchange Cfg == True** | `test_common_exchange_area` + `phs000123.c999` + `phs000123.c1`, `phs000123.c2` | `phs000123`
+| **Common Exchange Cfg == False** | `phs000123.c999` + `phs000123.c1`, `phs000123.c2` | `phs000123` |
+
+> NOTE: On the resource level, `phs000123.c999` should refer to resources that exist in that study's specific exchange area. Resources in the parent's common exchange area should be controlled via `test_common_exchange_area`.
+
+## Version Updates
+
+A study can be updated and at that time the patients and consent groups may change and the version number `v1` would get bumped up. At the moment, Fence does not handle these versions, so authorization is effectively either study level, or study+consent level.
diff --git a/fence/config-default.yaml b/fence/config-default.yaml
@@ -383,15 +383,51 @@ dbGaP:
     proxy_user: ''
   protocol: 'sftp'
   decrypt_key: ''
+  # parse out the consent from the dbgap accession number such that something
+  # like "phs000123.v1.p1.c2" becomes "phs000123.c2".
+  #
+  # NOTE: when this is "false" the above would become "phs000123"
   parse_consent_code: true
-  # A mapping from the dbgap study to which authorization namespaces the actual data
-  # lives in. For example, `studyX` data may exist in multiple organizations, so
-  # we need to know to map authorization to all orgs resources
+  # A consent of "c999" can indicate access to that study's "exchange area data"
+  # and when a user has access to one study's exchange area data, they
+  # have access to the parent study's "common exchange area data" that is not study
+  # specific. The following config is whether or not to parse/handle "c999" codes
+  # for access to the common exchange area data
+  #
+  # NOTE: When enabled you MUST also provide a mapping to the
+  # `study_common_exchange_areas` from study -> parent common exchange area resource
+  enable_common_exchange_area_access: false
+  # The below configuration is a mapping from studies to their "common exchange area data"
+  # Fence project name a user gets access to when parsing c999 exchange area codes (and
+  # subsequently gives access to an Arborist resource representing this common area
+  # as well)
+  study_common_exchange_areas:
+    'example': 'test_common_exchange_area'
+    # 'studyX': 'test_common_exchange_area'
+    # 'studyY': 'test_common_exchange_area'
+    # 'studyZ': 'test_common_exchange_area'
+  # A mapping from the dbgap study / Fence project to which authorization namespaces the
+  # actual data lives in. For example, `studyX` data may exist in multiple organizations, so
+  # we need to know how to map authorization to all orgs resources
   study_to_resource_namespaces:
     '_default': ['/']
-    'studyX': ['/orgA/', '/orgB/']
-    'studyY': ['/orgB/', '/orgC/']
-    'studyZ': ['/orgD/']
+    'test_common_exchange_area': ['/dbgap/']
+    # above are for default support and exchange area support
+    # below are further examples
+    #
+    # 'studyX': ['/orgA/', '/orgB/']
+    # 'studyX.c2': ['/orgB/', '/orgC/']
+    # 'studyZ': ['/orgD/']
+
+# Regex to match an assession number that has consent information in forms like:
+#   phs00301123.c999
+#   phs000123.v3.p1.c3
+#   phs000123.c3
+#   phs00301123.v3.p4.c999
+# Will NOT MATCH forms like: phs000123
+#
+# WARNING: Do not change this without consulting the code that uses it
+DBGAP_ACCESSION_WITH_CONSENT_REGEX: '(?P<phsid>phs[0-9]+)(.(?P<version>v[0-9]+)){0,1}(.(?P<participant_set>p[0-9]+)){0,1}.(?P<consent>c[0-9]+)'
 
 # //////////////////////////////////////////////////////////////////////////////////////
 # STORAGE BACKENDS AND CREDENTIALS
@@ -667,4 +703,4 @@ DREAM_CHALLENGE_TEAM: 'DREAM'
 DREAM_CHALLENGE_GROUP: 'DREAM'
 SYNAPSE_URI: 'https://repo-prod.prod.sagebase.org/auth/v1/'
 SYNAPSE_DISCOVERY_URL:
-SYNAPSE_AUTHZ_TTL: 86400
+SYNAPSE_AUTHZ_TTL: 86400
diff --git a/fence/sync/sync_users.py b/fence/sync/sync_users.py
@@ -6,6 +6,7 @@
 import subprocess as sp
 import tempfile
 import yaml
+import copy
 from contextlib import contextmanager
 from csv import DictReader
 from io import StringIO
@@ -299,6 +300,10 @@ def __init__(
             self.protocol = dbGaP["protocol"]
             self.dbgap_key = dbGaP["decrypt_key"]
         self.parse_consent_code = dbGaP.get("parse_consent_code", True)
+        self.enable_common_exchange_area_access = dbGaP.get(
+            "enable_common_exchange_area_access", False
+        )
+        self.study_common_exchange_areas = dbGaP.get("study_common_exchange_areas", {})
         self.session = db_session
         self.driver = SQLAlchemyDriver(DB)
         self.project_mapping = project_mapping or {}
@@ -457,43 +462,62 @@ def _parse_csv(self, file_dict, sess, encrypted=True):
                     dbgap_project = phsid[0]
                     if len(phsid) > 1 and self.parse_consent_code:
                         consent_code = phsid[-1]
-                        if consent_code != "c999":
-                            dbgap_project += "." + consent_code
+
+                        # c999 indicates full access to all consents and access
+                        # to a study-specific exchange area
+                        # access to at least one study-specific exchange area implies access
+                        # to the parent study's common exchange area
+                        #
+                        # NOTE: Handling giving access to all consents is done at
+                        #       a later time, when we have full information about possible
+                        #       consents
+                        self.logger.debug(
+                            f"got consent code {consent_code} from dbGaP project "
+                            f"{dbgap_project}"
+                        )
+                        if (
+                            consent_code == "c999"
+                            and self.enable_common_exchange_area_access
+                            and dbgap_project in self.study_common_exchange_areas
+                        ):
+                            self.logger.info(
+                                "found study with consent c999 and Fence "
+                                "is configured to parse exchange area data. Giving user "
+                                f"{username} {privileges} privileges in project: "
+                                f"{self.study_common_exchange_areas[dbgap_project]}."
+                            )
+                            self._add_dbgap_project_for_user(
+                                self.study_common_exchange_areas[dbgap_project],
+                                privileges,
+                                username,
+                                sess,
+                                user_projects,
+                            )
+
+                        dbgap_project += "." + consent_code
 
                     display_name = row.get("user name", "")
+                    tags = {"dbgap_role": row.get("role", "")}
+
+                    # some dbgap telemetry files have information about a researchers PI
+                    if "downloader for" in row:
+                        tags["pi"] = row["downloader for"]
+
+                    # prefer name over previous "downloader for" if it exists
+                    if "downloader for names" in row:
+                        tags["pi"] = row["downloader for names"]
+
                     user_info[username] = {
                         "email": row.get("email", ""),
                         "display_name": display_name,
                         "phone_number": row.get("phone", ""),
-                        "tags": {"dbgap_role": row.get("role", "")},
+                        "tags": tags,
                     }
 
                     if dbgap_project not in self.project_mapping:
-                        if dbgap_project not in self._projects:
-
-                            self.logger.debug(
-                                "creating Project in fence from dbGaP study: {}".format(
-                                    dbgap_project
-                                )
-                            )
-
-                            project = self._get_or_create(
-                                sess, Project, auth_id=dbgap_project
-                            )
-
-                            # need to add dbgap project to arborist
-                            if self.arborist_client:
-                                self._add_dbgap_study_to_arborist(dbgap_project)
-
-                            if project.name is None:
-                                project.name = dbgap_project
-                            self._projects[dbgap_project] = project
-                        phsid_privileges = {dbgap_project: set(privileges)}
-                        if username in user_projects:
-                            user_projects[username].update(phsid_privileges)
-                        else:
-                            user_projects[username] = phsid_privileges
-
+                        self._add_dbgap_project_for_user(
+                            dbgap_project, privileges, username, sess, user_projects
+                        )
                     for element_dict in self.project_mapping.get(dbgap_project, []):
                         try:
                             phsid_privileges = {
@@ -513,6 +537,33 @@ def _parse_csv(self, file_dict, sess, encrypted=True):
                             self.logger.info(e)
         return user_projects, user_info
 
+    def _add_dbgap_project_for_user(
+        self, dbgap_project, privileges, username, sess, user_projects
+    ):
+        """
+        Helper function for csv parsing that adds a given dbgap project to Fence/Arborist
+        and then updates the dictionary containing all user's project access
+        """
+        if dbgap_project not in self._projects:
+            self.logger.debug(
+                "creating Project in fence for dbGaP study: {}".format(dbgap_project)
+            )
+
+            project = self._get_or_create(sess, Project, auth_id=dbgap_project)
+
+            # need to add dbgap project to arborist
+            if self.arborist_client:
+                self._add_dbgap_study_to_arborist(dbgap_project)
+
+            if project.name is None:
+                project.name = dbgap_project
+            self._projects[dbgap_project] = project
+        phsid_privileges = {dbgap_project: set(privileges)}
+        if username in user_projects:
+            user_projects[username].update(phsid_privileges)
+        else:
+            user_projects[username] = phsid_privileges
+
     @staticmethod
     def sync_two_user_info_dict(user_info1, user_info2):
         """
@@ -934,6 +985,10 @@ def _sync(self, sess):
 
         self.logger.info("dbgap files: {}".format(dbgap_file_list))
         permissions = [{"read-storage"} for _ in dbgap_file_list]
+        if self.parse_consent_code and self.enable_common_exchange_area_access:
+            self.logger.info(
+                f"using study to common exchange area mapping: {self.study_common_exchange_areas}"
+            )
         user_projects, user_info = self._parse_csv(
             dict(list(zip(dbgap_file_list, permissions))), encrypted=True, sess=sess
         )
@@ -977,10 +1032,15 @@ def _sync(self, sess):
         self.sync_two_phsids_dict(user_projects_csv, user_projects)
         self.sync_two_user_info_dict(user_info_csv, user_info)
 
-        # privilleges in yaml files overide ones in csv files
+        # privileges in yaml files overide ones in csv files
         self.sync_two_phsids_dict(user_yaml.projects, user_projects)
         self.sync_two_user_info_dict(user_yaml.user_info, user_info)
 
+        if self.parse_consent_code:
+            self._grant_all_consents_to_c999_users(
+                user_projects, user_yaml.project_to_resource
+            )
+
         if user_projects:
             self.logger.info("Sync to db and storage backend")
             self.sync_to_db_and_storage_backend(user_projects, user_info, sess)
@@ -1017,6 +1077,53 @@ def _sync(self, sess):
                 )
                 exit(1)
 
+    def _grant_all_consents_to_c999_users(
+        self, user_projects, user_yaml_project_to_resources
+    ):
+        access_number_matcher = re.compile(config["DBGAP_ACCESSION_WITH_CONSENT_REGEX"])
+        # combine dbgap/user.yaml projects into one big list (in case not all consents
+        # are in either)
+        all_projects = set(
+            list(self._projects.keys()) + list(user_yaml_project_to_resources.keys())
+        )
+
+        self.logger.debug(f"all projects: {all_projects}")
+
+        # construct a mapping from phsid (without consent) to all accessions with consent
+        consent_mapping = {}
+        for project in all_projects:
+            phs_match = access_number_matcher.match(project)
+            if phs_match:
+                accession_number = phs_match.groupdict()
+
+                # TODO: This is not handling the .v1.p1 at all
+                consent_mapping.setdefault(accession_number["phsid"], set()).add(
+                    ".".join([accession_number["phsid"], accession_number["consent"]])
+                )
+
+        self.logger.debug(f"consent mapping: {consent_mapping}")
+
+        # go through existing access and find any c999's and make sure to give access to
+        # all accessions with consent for that phsid
+        for username, user_project_info in copy.deepcopy(user_projects).items():
+            for project, _ in user_project_info.items():
+                phs_match = access_number_matcher.match(project)
+                if phs_match and phs_match.groupdict()["consent"] == "c999":
+                    # give access to all consents
+                    all_phsids_with_consent = consent_mapping.get(
+                        phs_match.groupdict()["phsid"], []
+                    )
+                    self.logger.info(
+                        f"user {username} has c999 consent group for: {project}. "
+                        f"Granting access to all consents: {all_phsids_with_consent}"
+                    )
+                    # NOTE: Only giving read-storage at the moment (this is same
+                    #       permission we give for other dbgap projects)
+                    for phsid_with_consent in all_phsids_with_consent:
+                        user_projects[username].update(
+                            {phsid_with_consent: {"read-storage"}}
+                        )
+
     def _update_arborist(self, session, user_yaml):
         """
         Create roles, resources, policies, groups in arborist from the information in
@@ -1294,6 +1401,8 @@ def _add_dbgap_study_to_arborist(self, dbgap_study):
             .get(dbgap_study, default_namespaces)
         )
 
+        self.logger.debug(f"dbgap study namespaces: {namespaces}")
+
         arborist_resource_namespaces = [
             namespace.rstrip("/") + "/programs/" for namespace in namespaces
         ]

diff --git a/tests/dbgap_sync/conftest.py b/tests/dbgap_sync/conftest.py
@@ -120,7 +120,14 @@ def syncer(db_session, request):
         "phstest": [{"name": "Test", "auth_id": "Test"}],
     }
 
-    dbGap = {}
+    dbGap = yaml_load(
+        open(
+            os.path.join(
+                os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
+                "test-fence-config.yaml",
+            )
+        )
+    ).get("dbGaP")
     test_db = yaml_load(
         open(
             os.path.join(