-
Notifications
You must be signed in to change notification settings - Fork 83
Ependymoma subtyping #490
Ependymoma subtyping #490
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tkoganti , thank you for this contribution!
I think you have added all the data in the ticket, so the output file looks like it is in pretty good shape (pending updated data for some things)
My suggestions are mostly style and efficiency, with an eye toward making it easier for somebody coming in to easily find and change any sections that might need updating later with minimal effort.
My first big suggestion is that you move the file names out of the individual scripts, and into the master shell script that runs them: run-molecular-subtyping-EPN.sh
, specifying the files needed for each script and the output files via command line options. The reason for this is that some of these files are likely to move or be renamed as new releases come in, so having all of the file names in one place can be very helpful when debugging.
If you are familiar with argparse
, that is generally the most standard way to add command line options to a python script. One example of its use is here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/scripts/merged_to_individual_files.py
My other suggestions are mostly about reducing complexity by taking advantage of some built-in functions to reduce the number of lines of code. Some of these are pretty simple, like using string join
methods, and others are more complex, like the pandas "merge" method.
I am happy to help out with any changes you want to implement but might not have experience with. Just let me know!
|
||
EP = pbta_histologies[pbta_histologies["disease_type_new"]=="Ependymoma"] | ||
EP_rnaseq_samples = EP[EP["experimental_strategy"] == "RNA-Seq"][["Kids_First_Biospecimen_ID", "primary_site"]] | ||
EP_rnaseq_samples["disease_group"] = ["infratentorial" if "Posterior Fossa" in primary else "infratentorial" if "Optic" in primary else "supratentorial" if "Frontal Lobe" in primary else "supratentorial" if "Parietal Lobe" in primary else "infratentorial" if "Spinal Cord" in primary else "supratentorial" if "Occipital Lobe" in primary else "infratentorial" if "Tectum" in primary else "infratentorial" if "Spine" in primary else "supratentorial" if "Temporal Lobe" in primary else "infratentorial" if "Spinal" in primary else "None" for primary in EP_rnaseq_samples["primary_site"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is very long, making it hard to parse what is going on.
I would suggest that you break the logic of this line out into a separate function, and then calling that in the list comprehension. If you were to name the function group_disease()
you could reduce this line to:
EP_rnaseq_samples["disease_group"] = [group_disease(primary) for primary in EP_rnaseq_samples["primary_site"]]
Which would make it much more clear what this step is doing.
You can also make use of or
statements to make things a bit more discoverable within the function. Something like:
if("Posterior Fossa" in primary or "Optic" in primary):
return "intratentorial"
elif("Frontal Lobe" in primary ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an R script replacing this now
import rpy2.robjects as robjects | ||
from rpy2.robjects import pandas2ri | ||
pandas2ri.activate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already imported rpy2.robects, so you don't need to do this again. I would also suggest that you don't really need to import pandas2ri
separately, as you only use it once, so
robjects.pandas2ri.activate()
should be sufficient.
from rpy2.robjects import pandas2ri | ||
pandas2ri.activate() | ||
|
||
# Reading in pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds file to subset (All ependymoma samples are stranded, so ignoring polyA gene expression file in subsetting ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate if you could split this comment across multiple lines (Strive for lines no longer than ~90 characters, with 100 as an upper bound. It makes code revisions much easier to track, and improves readability)
|
||
|
||
#Reading in pbta histologies file to subset just the ependymoma samples | ||
pbta_histologies = pd.read_csv("data/pbta-histologies.tsv", sep="\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case these filenames change or the files move, it might be a good idea to have file names be passed into the scripts here as arguments, set in the shell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an R script for this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably remove this version of the script to avoid confusion.
|
||
|
||
|
||
|
||
# In[158]: | ||
|
||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to clear the blank lines?
|
||
|
||
# Looping through EPN sample file with both RNA and DNA BSID's | ||
with open("analyses/molecular-subtyping-EPN/results/EPN_molecular_subtype.tsv", "r") as notebook: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too, I think you may be more efficient to read the whole file in with pandas, rather than going through it line by line; that can also save you a lot of casting strings to ints that you are doing manually now.
CNA=pd.read_csv(zip.open('2019-12-10-gistic-results-cnvkit/broad_values_by_arm.txt'), sep="\t") | ||
CNA = CNA.set_index("Chromosome Arm") | ||
|
||
count=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this count
used anywhere?
I see you adding to it later, but nowhere that the final value is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this now
WGS_dnaseqsamples = WGSPT[["Kids_First_Biospecimen_ID", "Kids_First_Participant_ID", "sample_id"]] | ||
|
||
outnotebook.write("Kids_First_Participant_ID\tsample_id\tKids_First_Biospecimen_ID_DNA\tKids_First_Biospecimen_ID_RNA\tsubtype\n") | ||
count =0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this count used anywhere?
else: | ||
line.append("0") | ||
# Adding all the data collected so far for every sample with RNA and DNA BSID to "line" | ||
line.extend((nfkb_gsva_score, C11orf95_RELA, LTBP3_RELA, PTEN_TAS2R1, C11orf95_YAP1, YAP1_MAMLD1, YAP1_FAM118B, C11orf95_MAML2, breaks_density, RELA_zscore, L1CAM_zscore,ARL4D_zscore, CLDN1_zscore, CXorf67_zscore, TKTL1_zscore, GPBP1_zscore, IFT46_zscore)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to break up the contents of the extend here into logical groups that you can separate onto their own lines.
Something like the below, just to keep the line lengths in check and improve readability. If you do the same with the headers it will be easier to figure out if you accidentally change the order and the header and contents get out of sync.
line.extend((nfkb_gsva_score, C11orf95_RELA, LTBP3_RELA, PTEN_TAS2R1, C11orf95_YAP1, YAP1_MAMLD1, YAP1_FAM118B, C11orf95_MAML2, breaks_density, RELA_zscore, L1CAM_zscore,ARL4D_zscore, CLDN1_zscore, CXorf67_zscore, TKTL1_zscore, GPBP1_zscore, IFT46_zscore)) | |
line.extend((nfkb_gsva_score, C11orf95_RELA, LTBP3_RELA, PTEN_TAS2R1, | |
C11orf95_YAP1, YAP1_MAMLD1, YAP1_FAM118B, C11orf95_MAML2, breaks_density, | |
RELA_zscore, L1CAM_zscore, ARL4D_zscore, CLDN1_zscore, | |
CXorf67_zscore, TKTL1_zscore, GPBP1_zscore, IFT46_zscore)) |
for i in line: | ||
outfile.write(str(i)+"\t") | ||
outfile.write("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these lines, you can do the same as I suggested before, but if you have non-string elements in line (which I assume you do with the call to str()
you will want to apply the str function to each element as well, which you can do with python's map
function.
"\t".join(map(str, line))
Hi @tkoganti ! I just added the script for this file to CI, so make sure you pull down the latest changes before pushing anything back to the server. That will help us catch errors that might come up with edits or updates to the data files. |
Hi again @tkoganti! Your script was failing CI, and after some investigation, it seems like there is a problem with the version of rpy2 that we are using in that it doesn't support some of the features you were using. I tried to fix it by manually converting to a pandas dataframe, and by using pyreadr, but neither of those worked. So I decided to break out the big guns and just the subset step in R and create a gzipped tsv that pandas can read in directly. While I was changing file names, I also made a few changes that should make things a bit more flexible as files change, moving all the file paths to the shell script. I also added a line to the shell script so that all of the paths can be specified relative the location of the file, no matter where the shell script is invoked from. That means that I implemented the |
Hi @jashapiro I am trying to run this script with the changes you made ( |
Hi @tkoganti - can I ask if you are using the project Docker image for development? I am wondering if it is a software version issue. The subset step is not run in CI because we use files in CI that contain a limited amount of samples to save on download time and RAM and those samples may not overlap with this histology very much or at all. The subsetting step will eventually need to be run in the project Docker container. The idea is to run all steps for the project on AWS once we have a final freeze of the data. For that reason, we will want to ensure it runs with the version of the software that is available on that Docker image. Edit: Sorry, I was responding to the earlier version of the comment I read in my email! |
In the changes I made, I also changed the EPR expression rds file that is generated by the Really we should delete the .rds subset file from the repo, as it isn’t used at all anymore. Does that all make sense? Let me know if I need to explain anything in more detail. |
@jashapiro I missed there was a |
Hi @jaclyn-taroni I was not using docker for this development. I saw that all the modules I was using are already in the docker file here (https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/Dockerfile) Please let me know if I need to update any versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jashapiro I only have reviewed the Rscript. The code itself looks fine, there's some documentation things that need updating. Otherwise 👍
c("-i", "--histology"), | ||
type = "character", | ||
default = NULL, | ||
help = "hisology file tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help = "hisology file tsv", | |
help = "histology file tsv", |
disease_type_new == "Ependymoma") %>% | ||
pull(Kids_First_Biospecimen_ID) | ||
|
||
# Subsetting expression columns with column names/BSIDs that are in the list of ependymoma samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Subsetting expression columns with column names/BSIDs that are in the list of ependymoma samples | |
# Subsetting expression columns with column names/BSIDs that are in the list of ependymoma samples |
c("-o", "--outfile"), | ||
type = "character", | ||
default = NULL, | ||
help = "output file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help = "output file" | |
help = "File path and name for output tsv file" |
# -o, --output_file : path for output file | ||
# | ||
# example invocation: | ||
# Rscript scripts/bed_to_segfile.R \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example doesn't seem to be updated for this script and options.
Hi @tkoganti and @jashapiro. Great job on this so far! Sorry it took me a while to get to this, but I just have a few comments:
|
Thanks for the feedback @jharenza!
|
I wanted to note that we have not used the GISTIC gene data in other subtyping analyses, but have instead used the results of |
# Adding breakpoints density for chromosomal instability to the dataframe | ||
EPN_notebook["breaks_density-chromosomal_instability"] = EPN_notebook.apply(lambda x: breakpoint_density.loc[x["Kids_First_Biospecimen_ID_DNA"], "breaks_density"] | ||
if x["Kids_First_Biospecimen_ID_DNA"] is not np.nan else "NA", axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file we were using here ../chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv
has been removed from that analysis (which is why CI is currently failing). There are still files for CNV and SV separately at ../chromosomal-instability/breakpoint-data/cnv-breaks-densities.tsv
and ../chromosomal-instability/breakpoint-data/sv-breaks-densities.tsv
, respectively. Having two data columns for chromosomal instability seems reasonable at this stage, and they can perhaps be useful for determining reliability of this measure.
Note: The file name is set in run-molecular-subtyping-EPN.sh
@jaclyn-taroni good point - @tkoganti can you explore both focal files? |
Hi @jashapiro and @jaclyn-taroni I tried to implement CNV and SV file with breaks density score and I found that these samples were missing from the file - BS_0W8AWY10', |
I think those are likely samples with 0 (or NA) breaks by CNV consensus, which may have been accidentally excluded in the current versions of those files. If so, I think this is fixed by #532, which should be merged into master very soon. Tagging @cansavvy who worked on that analysis. In the mean time, and in case that is not the sole cause of the missing samples, it probably makes sense to use NA for such missing data. |
Hi @jashapiro This file also had missing BSID's from DNA samples - |
I just merged the changes in master into your branch. For some reason doing it on github wasn't working, so I had to do it manually. I don't think the changes will fix the missing data problem, as I don't think the update has hit the master branch yet, but lets see if it gets us past the CI failure. |
Add line continuation characters to bash script Remove no longer used --breakpoints option
I would not have expected the focal-cn file file to have NA in it, but I could be wrong about its contents. I am going to tag @jaclyn-taroni who was more involved in its creation to see if she has thoughts. |
Hi @tkoganti - there will be no neutral calls in the |
I made some substantial changes here in structure, but the results should be largely unchanged. I did some transposing when constructing data frames so we can use the same function (fill_df) to extract data more often, and moved the ID column specification out of the function so that RNA and DNA-derived data are handled the same way. The function then allows a set of samples to be specified, and if the request is for a a sample that does not fall in there, it is set as NA for that column in the output data, otherwise it is filled in with a default value.
Hi @tkoganti- I made some more updates/revisions to the code here to deal with the missing data questions, and while I was doing it I ended up doing a bit of extra refactoring. I did some transposing when constructing data frames so we can use the same function (
I think all of it is correct, but please do check carefully that it all makes sense to you and that the results are as expected. |
Purpose/implementation
ependymoma subtyping
What scientific question is your analysis addressing?
Create file with fusion results, gene expression, GISTIC, breaks density and gsva scores got ependymoma samples according to the ticket here (#245)
What was your approach?
pbta-histologies.tsv
.Ventricles
as primary site)sample_id
column inpbta-histologies.tsv
Which areas should receive a particularly close look?
Results
OpenPBTA/OpenPBTA-analysis/analyses/molecular-subtyping-EPN/results/EPN_all_data.tsv
What types of results are included (e.g., table, figure)?
tables
What is your summary of the results?
The information in final EPN_all_data.tsv will help subgroup samples into different subtypes