Skip to content

Commit

Permalink
Merge pull request #129 from monarch-initiative/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
pnrobinson authored Aug 10, 2024
2 parents 212daa0 + 2de632b commit 88b936d
Show file tree
Hide file tree
Showing 28 changed files with 890 additions and 453 deletions.
3 changes: 0 additions & 3 deletions docs/api/creation/iso_age.md

This file was deleted.

Binary file added docs/img/kmf_esrd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/kmf_plot_vstatus.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/visualization/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Visualization

pyphetools offers several options to visualize cohorts.
72 changes: 72 additions & 0 deletions docs/visualization/kaplan_meier_visualizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Kaplan Meier Visualization

A brief introduction to Kaplan-Meier analysis is available [here](https://pubmed.ncbi.nlm.nih.gov/20723767/). The following texts were extracted from that article.

Time-to-event is a clinical course duration variable for each subject having a beginning and an end anywhere along the time line of the complete study. For example, it may begin when the subject is enrolled into a study or when treatment begins, and ends when the end-point (event of interest) is reached or the subject is censored from the study. In preparing Kaplan-Meier survival analysis, each subject is characterized by three variables: 1) their serial time, 2) their status at the end of their serial time (event occurrence or censored), and 3) the study group they are in. For the construction of survival time probabilities and curves, the serial times for individual subjects are arranged from the shortest to the longest, without regard to when they entered the study. By this maneuver, all subjects within the group begin the analysis at the same point and all are surviving until something happens to one of them. The two things that can happen are: 1) a subject can have the event of interest or 2) they are censored.




| SUBJECT | SERIAL TIME (years) | STATUS AT SERIAL TIME (1=event; 0=censored) | Group (1 or 2) |
|---------|---------------------|---------------------------------------------|----------------|
| B | 1 | 1 | 1 |
| E | 2 | 1 | 1 |
| F | 3 | 1 | 1 |
| A | 4 | 1 | 1 |
| D | 4.5 | 1 | 1 |
| C | 5 | 0 | 1 |
| U | 0.5 | 1 | 2 |
| Z | 0.75 | 1 | 2 |
| W | 1 | 1 | 2 |



Censoring means the total survival time for that subject cannot be accurately determined. This can happen when something negative for the study occurs, such as the subject drops out, is lost to follow-up, or required data is not available or, conversely, something good happens, such as the study ends before the subject had the event of interest occur, i.e., they survived at least until the end of the study, but there is no knowledge of what happened thereafter. Thus censoring can occur within the study or terminally at the end.

Currently, pyphetools shows a survival curve for the entire cohort. This is the corresponding Python code.
There are two options. First, we plot the time up to the event represented by the age of onset of an HPO term.

```python
from pyphetools.visualization import KaplanMeierVisualizer, PhenopacketIngestor, SimplePatient
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt # only needed to save file
phenopackets_dir = "../phenopackets/" # directory containing phenopackets to plot
ingestor = PhenopacketIngestor(indir=phenopackets_dir)
ppkt_list = ingestor.get_phenopacket_list()
simple_pt_list = [SimplePatient(ppkt) for ppkt in ppkt_list]
hpo_id = "HP:0003774" # TermId of HPO term for the KM plot
kmv = KaplanMeierVisualizer(simple_patient_list=simple_pt_list, target_tid=stage5crd)
T, E = kmv.get_time_and_event()
# plot Kaplan Meier curve
kmf = KaplanMeierFitter()
kmf.fit(T, E, label="Age at stage 5 kidney disease")
plt.plot()
ax = kmf.plot_survival_function()
ax.set_xlabel("Years");
plt.savefig("kmf_plot.png", format="png"); ## optional
```


<figure markdown>
![Validation results](../img/kmf_esrd.png){ width="1000" }
<figcaption>Kaplan Meier Survival Plot of a cohort of individuals with pathogenic variants in the UMOD gene with respect to age of onset of stage 5 kidney failure.
</figcaption>
</figure>
```
It is also possible to plot a curve for survival, which makes use of the VitalStatus message of the phenopackets. The code is exactly the same
as the above, except that we do not pass the target_tid argument.
```python
# same as above
kmv = KaplanMeierVisualizer(simple_patient_list=simple_pt_list)
# same as above except that we change the title of the plot
kmf.fit(T, E, label="Survival")
```

![Kaplan-Meier Plot]()

<figure markdown>
![Validation results](../img/kmf_plot_vstatus.png){ width="1000" }
<figcaption>Kaplan Meier Survival Plot of a cohort of individuals with pathogenic variants in the UMOD gene.
</figcaption>
</figure>
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ nav:
- Cohort encoder: 'tabular/cohort_encoder.md'
- Validation: 'tabular/validation.md'
- Visualization: 'tabular/visualization.md'
- Visualization:
- Overview: 'visualization/index.md'
- Kaplan Meier: 'visualization/kaplan_meier_visualizer.md'
- Developers:
- For developers: 'developers/developers.md'
- Installation: 'developers/installation.md'
Expand All @@ -75,7 +78,6 @@ nav:
- HpoParser: "api/creation/hpo_parser.md"
- HpTerm: "api/creation/hp_term.md"
- Individual: "api/creation/individual.md"
- IsoAge: "api/creation/iso_age.md"
- MetaData: "api/creation/metadata.md"
- OptionColumnMapper: "api/creation/option_column_mapper.md"
- SexColumnMapper: "api/creation/sex_column_mapper.md"
Expand Down
2 changes: 1 addition & 1 deletion src/pyphetools/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from . import validation


__version__ = "0.9.97"
__version__ = "0.9.98"


__all__ = [
Expand Down
4 changes: 2 additions & 2 deletions src/pyphetools/creation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from .metadata import MetaData
from .mode_of_inheritance import Moi
from .option_column_mapper import OptionColumnMapper
from .pyphetools_age import PyPheToolsAge, IsoAge, HpoAge, GestationalAge, HPO_ONSET_TERMS
from .pyphetools_age import PyPheToolsAge, AgeSorter, HPO_ONSET_TERMS
from .sex_column_mapper import SexColumnMapper
from .simple_column_mapper import SimpleColumnMapper
from .scm_generator import SimpleColumnMapperGenerator
Expand Down Expand Up @@ -62,7 +62,7 @@
"Individual",
"MetaData",
"OptionColumnMapper",
"PyPheToolsAge", "IsoAge", "HpoAge", "GestationalAge", "HPO_ONSET_TERMS",
"PyPheToolsAge", "AgeSorter", "HPO_ONSET_TERMS",
"SexColumnMapper",
"SimpleColumnMapper",
"SimpleColumnMapperGenerator",
Expand Down
70 changes: 33 additions & 37 deletions src/pyphetools/creation/age_column_mapper.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
import re
from collections import defaultdict
from enum import Enum
import abc
import math
import pandas as pd

import typing
from .age_isoformater import AgeIsoFormater
from .pyphetools_age import HPO_ONSET_TERMS, PyPheToolsAge, IsoAge, NoneAge, GestationalAge, HpoAge
from .pyphetools_age import HPO_ONSET_TERMS, PyPheToolsAge
from .constants import Constants
from ..pp.v202 import TimeElement as TimeElement202

ISO8601_REGEX = r"^P(\d+Y)?(\d+M)?(\d+D)?"
# e.g., 14 y 8 m or 8 y
Expand All @@ -32,8 +32,6 @@ class AgeColumnMapper(metaclass=abc.ABCMeta):

def __init__(self, column_name:str, string_to_iso_d=None) -> None:
"""
:param ageEncodingType: Formatting convention used to represent the age
:type ageEncodingType: one of Year (e.g. 42), ISO 8601 (e.g. P42Y2M), year/month (e.g. 42y2m)
:param column_name: Name of the Age column in the original table
:type column_name: str
:param string_to_iso_d: dictionary from free text (input table) to ISO8601 strings
Expand Down Expand Up @@ -140,14 +138,14 @@ class Iso8601AgeColumnMapper(AgeColumnMapper):
def __init__(self, column_name) -> None:
super().__init__(column_name=column_name)

def map_cell(self, cell_contents) -> PyPheToolsAge:
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
contents = self._clean_contents(cell_contents=cell_contents)
match = re.search(ISO8601_REGEX, contents)
if match:
return IsoAge.from_iso8601(contents)
return PyPheToolsAge.get_age_pp201(age_string=contents)
else:
self._erroneous_input_counter[contents] += 1
return NoneAge(contents)
return None


class YearMonthAgeColumnMapper(AgeColumnMapper):
Expand All @@ -157,28 +155,28 @@ class YearMonthAgeColumnMapper(AgeColumnMapper):
def __init__(self, column_name) -> None:
super().__init__(column_name=column_name)

def map_cell(self, cell_contents) -> PyPheToolsAge:
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
contents = self._clean_contents(cell_contents=cell_contents)
try:
match = re.search(YEAR_AND_MONTH_REGEX, contents)
if match:
years = int(match.group(1))
months = int(match.group(2))
age_string = f"P{years}Y{months}M"
return IsoAge(y=years, m=months, age_string=age_string)
return PyPheToolsAge.get_age_pp201(age_string=age_string)
match = re.search(YEAR_REGEX, contents)
if match:
years = int(match.group(1))
age_string = f"P{years}Y"
return IsoAge(y=years, age_string=age_string)
return PyPheToolsAge.get_age_pp201(age_string=age_string)
match = re.search(MONTH_REGEX, contents)
if match:
months = int(match.group(1))
age_string = f"P{months}M"
return IsoAge(m=months, age_string=age_string)
return PyPheToolsAge.get_age_pp201(age_string=age_string)
except ValueError as verr:
print(f"Could not parse {cell_contents} as year/month: {verr}")
return NoneAge(contents)
return None

class MonthAgeColumnMapper(AgeColumnMapper):
"""Mapper for entries such as P1Y2M (ISO 8601 period to represent age)
Expand All @@ -187,15 +185,15 @@ class MonthAgeColumnMapper(AgeColumnMapper):
def __init__(self, column_name) -> None:
super().__init__(column_name=column_name)

def map_cell(self, cell_contents) -> PyPheToolsAge:
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
# assume month encoded by integer or float.
contents = self._clean_contents(cell_contents=cell_contents)
month = str(contents)
if month.isdigit():
full_months = int(month)
days = 0
age_string = AgeIsoFormater.from_numerical_month(full_months)
return IsoAge(m=full_months, age_string=age_string)
return PyPheToolsAge.get_age_pp201(age_string=age_string)
elif month.replace('.', '', 1).isdigit() and month.count('.') < 2:
# a float such as 0.9 (months)
months = float(month)
Expand All @@ -205,15 +203,15 @@ def map_cell(self, cell_contents) -> PyPheToolsAge:
days = int(months * avg_num_days_in_month)
full_months = 0
age_string = f"P{days}D"
return IsoAge(d=days, age_string=age_string)
return PyPheToolsAge.get_age_pp201(age_string=age_string)
else:
remainder = months - floor_months
full_months = int(months - remainder)
days = int(remainder * avg_num_days_in_month)
age_string = f"P{full_months}M{days}D"
return IsoAge(m=full_months, d=days, age_string=age_string)
return PyPheToolsAge.get_age_pp201(age_string=age_string)
else:
return NoneAge("na")
return None



Expand All @@ -222,33 +220,34 @@ class YearAgeColumnMapper(AgeColumnMapper):
def __init__(self, column_name) -> None:
super().__init__(column_name=column_name)

def map_cell(self, cell_contents) -> PyPheToolsAge:
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
"""
Extract an iso8601 string for age recorded as a year (either an int such as 4 or a float such as 4.25 for P4Y3M)
:param age: an int representing years or a float such as 2.5 for two and a half years
:return: an ISO 8601 string such as P2Y6M
"""
if isinstance(cell_contents, int):
return IsoAge(y=cell_contents, age_string=contents)
age_str = f"P{cell_contents}Y"
return PyPheToolsAge.get_age_pp201(age_string=age_str)
elif isinstance(cell_contents, float):
age = str(age)
age = str(cell_contents)
elif not isinstance(cell_contents, str):
raise ValueError(f"Malformed agestring {age}, type={type(age)}")
raise ValueError(f"Malformed agestring {cell_contents}, type={type(cell_contents)}")
contents = self._clean_contents(cell_contents=cell_contents)
int_or_float = r"(\d+)(\.\d+)?"
p = re.compile(int_or_float)
results = p.search(contents).groups()
if len(results) != 2:
return NoneAge(contents)
return None
if results[0] is None:
return NoneAge(contents)
return None
y = int(results[0])
if results[1] is None:
return IsoAge(y=y, age_string=f"P{y}Y")
return PyPheToolsAge.get_age_pp201(age_string=f"P{y}Y")
else:
m = float(results[1]) # something like .25
months = round(12 * m)
return IsoAge(y=y, m=months, age_string=f"P{y}Y{months}M")
return PyPheToolsAge.get_age_pp201(age_string=f"P{y}Y{months}M")


class CustomAgeColumnMapper(AgeColumnMapper):
Expand All @@ -260,12 +259,12 @@ def __init__(self, column_name:str, string_to_iso_d) -> None:
super().__init__(column_name=column_name)
self._string_to_iso_d = string_to_iso_d

def map_cell(self, cell_contents) -> PyPheToolsAge:
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
if cell_contents not in self._string_to_iso_d:
print(f"[WARNING] Could not find \"{cell_contents}\" in custom dictionary")
return NoneAge(cell_contents)
return None
iso8601 = self._string_to_iso_d.get(cell_contents, Constants.NOT_PROVIDED)
return IsoAge.from_iso8601(iso8601)
return PyPheToolsAge.get_age_pp201(age_string=iso8601)

class NotProvidedAgeColumnMapper(AgeColumnMapper):
"""Mapper if there is no information
Expand All @@ -274,11 +273,8 @@ class NotProvidedAgeColumnMapper(AgeColumnMapper):
def __init__(self, column_name:str) -> None:
super().__init__(column_name=column_name)

def map_cell(self, cell_contents) -> str:
if cell_contents is None or math.isnan(cell_contents):
cell_contents = "na"
contents = self._clean_contents(cell_contents=cell_contents)
return NoneAge(age_string=contents)
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
return None


class HpoAgeColumnMapper(AgeColumnMapper):
Expand All @@ -290,10 +286,10 @@ class HpoAgeColumnMapper(AgeColumnMapper):
def __init__(self, column_name:str) -> None:
super().__init__(column_name=column_name)

def map_cell(self, cell_contents) -> PyPheToolsAge:
def map_cell(self, cell_contents) -> typing.Optional[TimeElement202]:
contents = self._clean_contents(cell_contents=cell_contents)
if contents in HPO_ONSET_TERMS:
return HpoAge(hpo_onset_label=contents)
return PyPheToolsAge.get_age_pp201(age_string=contents)
else:
self._erroneous_input_counter[contents] += 1
return NoneAge(cell_contents)
return None
14 changes: 7 additions & 7 deletions src/pyphetools/creation/age_of_death_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
from .age_isoformater import AgeIsoFormater
from .constants import Constants

from pyphetools.pp.v202 import VitalStatus as pptVitalStatus
from pyphetools.pp.v202 import TimeElement as pptTimeElement
from pyphetools.pp.v202 import Age as pptAge
from pyphetools.pp.v202 import VitalStatus as VitalStatus202
from pyphetools.pp.v202 import TimeElement as TimeElement202
from pyphetools.pp.v202 import Age as Age202


class AgeOfDeathColumnMapper:
Expand All @@ -26,7 +26,7 @@ def __init__(self, column_name, string_to_iso_d=None) -> None:
self._column_name = column_name
self._string_to_iso_d = string_to_iso_d

def map_cell_to_vital_status(self, cell_contents) -> Optional[pptVitalStatus]:
def map_cell_to_vital_status(self, cell_contents) -> Optional[VitalStatus202]:

"""
Map a single cell of the table
Expand All @@ -39,9 +39,9 @@ def map_cell_to_vital_status(self, cell_contents) -> Optional[pptVitalStatus]:
if contents not in self._string_to_iso_d:
return None
# Wrap the Age (iso8601) in a TimeElement.
iso_age = pptAge(self._string_to_iso_d.get(contents))
telem = pptTimeElement(iso_age)
vstatus = pptVitalStatus(status=pptVitalStatus.Status.DECEASED, time_of_death=telem)
iso_age = Age202(self._string_to_iso_d.get(contents))
telem = TimeElement202(iso_age)
vstatus = VitalStatus202(status=VitalStatus202.Status.DECEASED, time_of_death=telem)
return vstatus

@property
Expand Down
Loading

0 comments on commit 88b936d

Please sign in to comment.