Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(authz-migration): add migration script #218

Merged
merged 14 commits into from
Jun 7, 2019
191 changes: 191 additions & 0 deletions bin/migrate_acl_authz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
"""
This script is used to migrate the `acl` field in indexd to the new `authz` field which
will be used in combination with arborist to handle access control on indexd records.

The `authz` field should consist of a list of resource tags (as defined by
arborist---see arborist's readme at https://github.com/uc-cdis/arborist for more info),
with the meaning that a user trying to access the data file pointed to by this record
must have access to all the resources listed. These resources may be projects or consent
codes or some other mechanism for specifying authorization.

In terms of the migration, it isn't discernable from indexd itself whether the items
listed in the `acl` are programs or projects. For this reason we need access to the
sheepdog tables with "core data"/"metadata", so we can look up which is which. Then, if
the record previously had both a program and a project, since the authz field requires
access to all the listed items, only the project should end up in `authz` (since
requiring the program would omit access to users who can access only the project).

Furthermore, there are two ways to represent the arborist resources that go into
`authz`: the path (human-readable string) and the tag (random string, pseudo-UUID). The
tags are what we want to ultimately put into the `authz` field, since these are
persistent whereas the path could change if resources are renamed.
"""

import argparse
import sys

from cdislogging import get_logger
import requests
from sqlalchemy.engine import create_engine
from sqlalchemy.exc import OperationalError

from indexd.index.drivers.alchemy import IndexRecord, IndexRecordAuthz


logger = get_logger("migrate_acl_authz")


def main():
args = parse_args()
sys.path.append(args.path)
try:
from local_settings import settings
except ImportError:
logger.info("Can't import local_settings, import from default")
from indexd.default_settings import settings
driver = settings["config"]["INDEX"]["driver"]
try:
acl_converter = ACLConverter(args.sheepdog, args.arborist)
except EnvironmentError:
logger.error("can't continue without database connection")
sys.exit(1)
with driver.session as session:
records = session.query(IndexRecord)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going to happen if there are almost 2 million records? DCF prod has like 1.7 million. do we need to stream them through differently in chunks or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, maybe this is some fancy generator ? session.query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to document the current discussion: looking into options for performant updates on a large number of records, maybe using something like this

for record in records:
if not record.acl:
logger.info(
"record {} has no acl, setting authz to empty"
.format(record.did)
)
record.authz = []
continue
try:
record.authz = acl_converter.acl_to_authz(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could perhaps skip records that already have authz field filled out? I'm thinking of a situation where this script fails halfway through and want to continue where it left off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it should go through the whole thing anyways, that would let us reset the authz fields if necessary, and for the cases where it's already cached the resource from arborist the time to do the conversion is minimal, just some string manipulation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I am concerned about is that with the current implementation, it doesn't commit anything until everything is done, so if the job gets killed early then it has to go through the whole thing again. Maybe commit every N records or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, and made me think of load handling for this, since DCF has over a million records. But yes, I agree, we should commit every so often

session.add(record)
logger.info(
"updated authz for {} to {}"
.format(record.did, record.authz.resource)
)
except EnvironmentError as e:
msg = "problem adding authz for record {}: {}".format(record.did, e)
logger.error(msg)
logger.info("finished migrating")
return


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--path", default="/var/www/indexd/", help="path to find local_settings.py",
)
parser.add_argument(
"--sheepdog-db", dest="sheepdog", help="URI for the sheepdog database"
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
)
parser.add_argument(
"--arborist-url", dest="arborist", help="URL for the arborist service"
)
return parser.parse_args()


class ACLConverter(object):
def __init__(self, sheepdog_db, arborist_url):
self.arborist_url = arborist_url.rstrip("/")
self.programs = set()
self.projects = dict()
# map resource paths to tags in arborist so we can save http calls
self.arborist_resources = dict()

engine = create_engine(sheepdog_db, echo=False)
try:
connection = engine.connect()
except OperationalError:
raise EnvironmentError(
"couldn't connect to sheepdog db using the provided URI"
)

result = connection.execute("SELECT _props->>'name' as name from node_program;")
for row in result:
self.programs.add(row["name"])

result = connection.execute("""
SELECT
project._props->>'name' AS name,
program._props->>'name' AS program
FROM node_project AS project
JOIN edge_projectmemberofprogram AS edge ON edge.src_id = project.node_id
JOIN node_program AS program ON edge.dst_id = program.node_id;
""")
for row in result:
self.projects[row["name"]] = row["program"]

connection.close()
return

def is_program(self, acl_item):
return acl_item in self.programs

def acl_to_authz(self, record):
path = None
for acl_object in record.acl:
acl_item = acl_object.ace
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
if acl_item == "*":
path = "/open"
elif not path and self.is_program(acl_item):
path = "/programs/{}".format(acl_item)
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
else:
if acl_item not in self.projects:
raise EnvironmentError(
"program or project {} does not exist".format(acl_item)
)
path = "/programs/{}/projects/{}".format(
acl_item, self.projects[acl_item]
)

if not path:
logger.error(
"couldn't get `authz` for record {} from {}; setting as empty"
.format(record.did, record.acl)
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
)
return []

if path not in self.arborist_resources:
url = "{}/resource/".format(self.arborist_url)
failed = False
try:
resource = {"path": path}
response = requests.post(url, timeout=5, json=resource)
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
except requests.exceptions.Timeout:
logger.error(
"couldn't hit arborist to look up resource (timed out): {}".format(url)
)
failed = True
tag = None
try:
if response.status_code == 409:
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
# resource is already there, so we'll just take the tag
tag = response.json()["tag"]
elif response.status_code != 201:
logger.error(
"couldn't hit arborist at {} to create resource (got {}): {}".format(
url, response.status_code, response.json()
)
)
failed = True
else:
# just created the resource for the first time
tag = response.json()["created"]["tag"]
except (ValueError, KeyError) as e:
raise EnvironmentError(
"couldn't understand response from arborist: {}".format(e)
)

if failed or not tag:
raise EnvironmentError("couldn't reach arborist")
self.arborist_resources[path] = tag

tag = self.arborist_resources[path]
return [IndexRecordAuthz(did=record.did, resource=tag)]


if __name__ == "__main__":
main()