Cohort Building Using LOINC Terms in the MIDRC Data Commons¶
+
This notebook briefly demonstrates how to use the MIDRC open APIs to build a cohort of MIDRC imaging studies using LOINC properties derived from MIDRC's LOINC Harmonization process using the LOINC Playbook.
+All cohort selection possible in the MIDRC data explorer UI can also be achieved programmatically using API requests. In this notebook, we'll select a small cohort of imaging studies based on LOINC properties.
+by Chris Meyer, PhD, Director of Data Services and Scientific Support at the Center for Translational Data Science at the University of Chicago
+Presented at the 2024 LOINC Conference on September 20, 2024.
+1) Set up Python environment¶
+
Download an API key file containing your credentials¶
+
-
+
- Navigate to the MIDRC data portal in your browser: https://data.midrc.org. +
- Read and accept the DUA (if you haven't already). +
- Navigate to the user profile page: https://data.midrc.org/identity +
- Click on the button "Create API Key" and save the
credentials.json
file somewhere safe
+
Set local variables¶
+
Change the following cred
variable path to point to your credentials file downloaded from the MIDRC data portal following the instructions above.
cred = "/Users/christopher/Downloads/midrc-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally
+api = "https://data.midrc.org" # The base URL of the data commons being queried. This shouldn't change for MIDRC.
+
Install / Import Python Packages and Scripts¶
+## The packages below may be necessary for users to install according to the imports necessary in the subsequent cells.
+
+import sys
+#!{sys.executable} -m pip install
+#!{sys.executable} -m pip install --upgrade pandas
+#!{sys.executable} -m pip install --upgrade --ignore-installed PyYAML
+#!{sys.executable} -m pip install --upgrade pip
+#!{sys.executable} -m pip install --upgrade gen3
+#!{sys.executable} -m pip install pydicom
+#!{sys.executable} -m pip install --upgrade Pillow
+#!{sys.executable} -m pip install psmpy
+#!{sys.executable} -m pip install python-gdcm --upgrade
+#!{sys.executable} -m pip install pylibjpeg --upgrade
+
## Import Python Packages and scripts
+
+import os, subprocess
+import pandas as pd
+import numpy as np
+import pydicom
+from PIL import Image
+import glob
+#import gdcm
+#import pylibjpeg
+
+# import some Gen3 packages
+import gen3
+from gen3.auth import Gen3Auth
+from gen3.query import Gen3Query
+
Initiate instances of the Gen3 SDK Classes using credentials file for authentication¶
+
Again, make sure the "cred" directory path variable reflects the location of your credentials file (path variables set above).
+auth = Gen3Auth(api, refresh_file=cred) # authentication class
+query = Gen3Query(auth) # query class
+
2) Build Cohorts by Sending Queries to the MIDRC Search APIs¶
General notes on sending queries:¶
-
+
- There are many ways to query and access metadata for cohort building in MIDRC, but this notebook will focus on using the Gen3 graphQL query service "guppy". This is the backend query service that MIDRC's data explorer GUI uses. So, anything you can do in the explorer GUI, you can do with guppy queries, and more! +
- The guppy graphQL service has more functionality than is demonstrated in this simple example. You can find extensive documentation in GitHub here in case you'd like to build your own queries from scratch. +
- The Gen3 SDK (intialized as
query
above in this notebook) has Python wrapper scripts to make sending queries to the guppy graphQL API simpler. The guppy SDK package can be viewed in GitHub here.
+ - Guppy queries focus on a particular type of data (cases, imaging studies, files, etc.), which corresponds to the major tabs in MIDRC's data explorer GUI. +
- Queries include arguments that are akin to selecting filter values in MIDRC's data explorer GUI. +
- To see more documentation about how to use and combine filters with various operator logic (like AND/OR/IN, etc.) see this page. +
+
Set query parameters¶
+
-
+
- Here, we'll send a query to the
imaging_study
guppy index, which corresponds to the "Imaging Studies" tab of MIDRC's data explorer GUI.
+ - The filters defined below can be modified to return different subsets of imaging studies. Here, we'll use a combination of LOINC method (Modality), system (body part), and long common name (descrition) to narrow our selected imaging studies to show the diversity of study descriptions for a single loinc code. +
- If our query request is successful, the API response should be in JSON format, and it should contain a list of patient IDs along with any other patient data we ask for. +
### Set some "imaging_study" query parameters to select Chest X-rays (CXR) imaging studies in MIDRC
+
+## Here we select imaging studies with a LOINC System of "Chest", which is the harmonized BodyPartExamined
+loinc_system = "Chest"
+
+## Here we select imaging studies with a LOINC Method of "XR", which is the harmonized Modality
+loinc_method = "CT"
+loinc_method = "XR"
+
+## Here we select imaging studies with a LOINC Long Common Name of "", which is the harmonized StudyDescription
+loinc_long_common_name = "CT Chest W contrast IV"
+loinc_long_common_name = "XR Chest Single view"
+
## Note: the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.
+
+imaging_studies = query.raw_data_download(
+ data_type="imaging_study",
+ fields=None,
+ filter_object={
+ "AND": [
+ {"=": {"loinc_method": loinc_method}},
+ {"=": {"loinc_system": loinc_system}},
+ {"=": {"loinc_long_common_name": loinc_long_common_name}},
+ ]
+ },
+ sort_fields=[{"submitter_id": "asc"}]
+ )
+
+if len(imaging_studies) > 0 and "submitter_id" in imaging_studies[0]:
+ imaging_studies_ids = [i['submitter_id'] for i in imaging_studies] ## make a list of the imaging study IDs returned
+ case_count = len(list(set([i['case_ids'][0] for i in imaging_studies])))
+ print("Query returned {} imaging studies for {} cases.".format(len(imaging_studies),case_count))
+ print("Data is a list with rows like this:\n\t {}".format(imaging_studies[0:1]))
+else:
+ print("Your query returned no data! Please, check that query parameters are valid.")
+
imaging_studies_df = pd.DataFrame(imaging_studies)
+display(imaging_studies_df)
+
+## Look at diversity of original DICOM Imaging Study Descriptions
+print("For these LOINC Long Common names: {} \nThere are these {} study descriptions:".format(list(set(imaging_studies_df['loinc_long_common_name'])),len(list(set(imaging_studies_df['study_description'])))))
+list(set(imaging_studies_df['study_description']))
+
Add some patient demographics to our query in order to narrow down the selection¶
+
## LOINC terms
+loinc_system = "Chest"
+loinc_method = "XR"
+loinc_long_common_name = "XR Chest Single view"
+
+## Case filters: we will select Hispanic males 70 years of age and older
+ethnicity = "Hispanic or Latino"
+race = ["Asian","Black or African American"]
+sex = "Male"
+age_threshold = 70
+
## Note: the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.
+
+imaging_studies = query.raw_data_download(
+ data_type="imaging_study",
+ fields=None,
+ filter_object={
+ "AND": [
+ {"=": {"loinc_method": loinc_method}},
+ {"=": {"loinc_system": loinc_system}},
+ {"=": {"loinc_long_common_name": loinc_long_common_name}},
+ {"=": {"sex": sex}},
+ {"=": {"ethnicity": ethnicity}},
+ {"IN": {"race": race}},
+ {">=": {"age_at_index": age_threshold}},
+ ]
+ },
+ sort_fields=[{"submitter_id": "asc"}]
+ )
+
+if len(imaging_studies) > 0 and "submitter_id" in imaging_studies[0]:
+ imaging_studies_ids = [i['submitter_id'] for i in imaging_studies] ## make a list of the imaging study IDs returned
+ case_count = len(list(set([i['case_ids'][0] for i in imaging_studies])))
+ print("Query returned {} imaging studies for {} cases.".format(len(imaging_studies),case_count))
+ print("Data is a list with rows like this:\n\t {}".format(imaging_studies[0:1]))
+else:
+ print("Your query returned no data! Please, check that query parameters are valid.")
+
imaging_studies_df = pd.DataFrame(imaging_studies)
+display(imaging_studies_df)
+print("For these LOINC Long Common names: {} \nThere are these {} study descriptions: {}".format(list(set(imaging_studies_df['loinc_long_common_name'])),len(list(set(imaging_studies_df['study_description']))),list(set(imaging_studies_df['study_description']))))
+
3) Send another query to get data file details for our cohort / case ID¶
+
The object_id
field in each imaging study record above contains the file identifiers for all files associated with each imaging study, which could include files like third-party annotations. If we simply want to access all files associated with our list of cases, we can use those object_ids.
However, in this example, we'll ask for specific types of files and get more detailed information about each of the files. This is achieved by querying the data_file
guppy index, which corresponds to the "Data Files" tab of the MIDRC data explorer GUID.
All MIDRC data files, including both images and annotations, are listed in the guppy index "data_file", which is queried in a similar manner to our query of the imaging_study
index above. The query parameter data_type
below determines which guppy (Elasticsearch) index we're querying.
To get only data_file
records that correspond to our imaging study cohort built previously, we'll use the list of study UIDs as a query filter.
Set 'data_file' query parameters¶
+
Here, we'll utilize the property source_node
to filter the list of files for our cohort to only those matching the type of files we're interested in. In this example, we ask only for CR and DX (x-ray) images, which will exclude any other types of files like annotations.
We're also using the property study_uid
as a filter to restrict the data_file
records returned down to those associated with the imaging studies in our cohort built above.
# Build a list of study UIDs to use as a filter in our data_file query
+study_uids = [i['study_uid'] for i in imaging_studies]
+study_uids
+
# Choose the types of data we want using "source_node" as a filter
+source_nodes = ["cr_series_file","dx_series_file"]
+
## Search for specific files associated with our cohort by adding "study_uid" as a filter
+# * Note: "fields" is set to "None" in this query, which by default returns all the properties available
+data_files = query.raw_data_download(
+ data_type="data_file",
+ fields=None,
+ filter_object={
+ "AND": [
+ {"IN": {"study_uid": study_uids}},
+ {"IN": {"source_node": source_nodes}},
+ ]
+ },
+ sort_fields=[{"submitter_id": "asc"}]
+ )
+
+if len(data_files) > 0:
+ object_ids = [i['object_id'] for i in data_files if 'object_id' in i] ## make a list of the file object_ids returned by our query
+ print("Query returned {} data files with {} object_ids.".format(len(data_files),len(object_ids)))
+ print("Data is a list with rows like this:\n\t {}".format(data_files[0:1]))
+else:
+ print("Your query returned no data! Please, check that query parameters are valid.")
+
# object_id (AKA "data GUID") is a globally unique file identifier that points to an actual file object in cloud storage. We'll use the object_ids along with the gen3 command-line tool to download the files these object_ids point to.
+object_ids
+
4) Access data files using their object_id / data GUID (globally unique identifiers)¶
+
In order to download files stored in MIDRC, users need to reference the file's object_id (AKA data GUID or Globally Unique IDentifier).
+Once we have a list of GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files. You can also access individual files in your browser after logging-in and entering the GUID after the files/
endpoint, as in this URL: https://data.midrc.org/files/GUID
where GUID is the actual GUID, e.g.: https://data.midrc.org/files/dg.MD1R/b87d0db3-d95a-43c7-ace1-ab2c130e04ec
+For instructions on how to install and use the gen3-client, please see the MIDRC quick-start guide, which can be found linked here and in the MIDRC data portal header as "Get Started".
+Below we use the gen3 SDK command gen3 drs-pull object
which is documented in detail here.
Use the Gen3 SDK command gen3 drs-pull object
to download an individual file¶
+## Make a new directory for downloaded files
+os.system("rm -r downloads")
+os.system("mkdir -p downloads")
+
## We can use a simple loop to download all files and keep track of successes and failures
+
+success,failure,other=[],[],[]
+count,total = 0,len(object_ids)
+for object_id in object_ids:
+ count+=1
+ cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull object {} --output-dir downloads".format(cred,object_id)
+ stout = subprocess.run(cmd, shell=True, capture_output=True)
+ print("Progress ({}/{}): {}".format(count,total,stout.stdout))
+ if "failed" in str(stout.stdout):
+ failure.append(object_id)
+ elif "successfully" in str(stout.stdout):
+ success.append(object_id)
+ else:
+ other.append(object_id)
+
# Get a list of all downloaded .dcm files
+image_files = glob.glob(pathname='**/*.dcm',recursive=True,)
+image_files
+
View the DICOM Images¶
+
Here we'll use the Python package pydicom
to view the downloaded DICOM images.
Note that some of the files may contain compressed pixel data that require other packages to view; so, for this demo we'll simply skip over those using the following loop.
+for image_file in image_files:
+ print(image_file)
+ ds = pydicom.dcmread(image_file)
+ try:
+ new_image = ds.pixel_array.astype(float)
+ scaled_image = (np.maximum(new_image, 0) / new_image.max()) * 255.0
+ scaled_image = np.uint8(scaled_image)
+ final_image = Image.fromarray(scaled_image)
+ print(type(final_image))
+ display(final_image)
+ except Exception as e:
+ print("Couldn't view {}: {}.".format(image_file,e))
+
View the DICOM Headers¶
+
DICOM files have metadata elements embedded in the images. These can also be read and viewed using the pydicom
package.
ds = pydicom.dcmread(image_files[0],force=True)
+display(ds)
+
# Access individual elements
+display(ds.file_meta)
+display(ds.ImageType)
+display(ds[0x0008, 0x0016])
+
# View the dicom metadata for all files as a DataFrame
+dfs = []
+for image_file in image_files:
+ ds = pydicom.dcmread(image_file)
+ df = pd.DataFrame(ds.values())
+ df[0] = df[0].apply(lambda x: pydicom.dataelem.DataElement_from_raw(x) if isinstance(x, pydicom.dataelem.RawDataElement) else x)
+ df['name'] = df[0].apply(lambda x: x.name)
+ df['value'] = df[0].apply(lambda x: x.value)
+ df = df[['name', 'value']]
+ df = df.set_index('name').T.reset_index(drop=True)
+ df['filename'] = image_file
+ df.drop(columns=['Pixel Data'],inplace=True) # drop the pixel data as it's too large and nonsensical to store in a DataFrame
+ dfs.append(df)
+
# Make a master dataframe for all images using only headers in all dataframes
+headers = list(set.intersection(*map(set,dfs)))
+df = pd.concat([df[headers] for df in dfs])
+df.set_index('filename',inplace=True)
+
display(df)
+
## Export the file metadata as a TSV file
+filename = "MIDRC_DICOM_metadata.tsv"
+df.to_csv(filename, sep='\t')
+
The End¶
+
If you have any questions related to this notebook don't hesitate to reach out to the MIDRC Helpdesk at midrc-support@gen3.org or the author directly at cgmeyer@uchicago.edu
+Happy data wrangling!
+