Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Ependymoma subtyping #490

Merged
merged 34 commits into from
Feb 18, 2020
Merged

Conversation

tkoganti
Copy link
Collaborator

Purpose/implementation

ependymoma subtyping

What scientific question is your analysis addressing?

Create file with fusion results, gene expression, GISTIC, breaks density and gsva scores got ependymoma samples according to the ticket here (#245)

What was your approach?

  1. Take only ependymoma samples from pbta-histologies.tsv.
  2. Based on primary site, classify them as supratentorial/ infratentorial (Some as "NA" since they s ay Ventricles as primary site)
  3. Match corresponding RNA and DNA sample BSID's based on common sample_id column in pbta-histologies.tsv
  4. Based on the files mentioned above in scientific question, fill up table for ependymoma samples based on specific fusion, gene expression, broad CNA etc.

Which areas should receive a particularly close look?

  1. All scripts should run from OpenPBTA folder. Any feedback on that?
  2. Column headers and file formats for final files okay?

Results

OpenPBTA/OpenPBTA-analysis/analyses/molecular-subtyping-EPN/results/EPN_all_data.tsv

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

The information in final EPN_all_data.tsv will help subgroup samples into different subtypes

@jaclyn-taroni jaclyn-taroni requested review from jashapiro and jaclyn-taroni and removed request for jaclyn-taroni January 30, 2020 18:16
@jaclyn-taroni jaclyn-taroni added the molecular subtyping Related to molecular subtyping of tumors label Jan 30, 2020
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tkoganti , thank you for this contribution!

I think you have added all the data in the ticket, so the output file looks like it is in pretty good shape (pending updated data for some things)

My suggestions are mostly style and efficiency, with an eye toward making it easier for somebody coming in to easily find and change any sections that might need updating later with minimal effort.

My first big suggestion is that you move the file names out of the individual scripts, and into the master shell script that runs them: run-molecular-subtyping-EPN.sh, specifying the files needed for each script and the output files via command line options. The reason for this is that some of these files are likely to move or be renamed as new releases come in, so having all of the file names in one place can be very helpful when debugging.

If you are familiar with argparse, that is generally the most standard way to add command line options to a python script. One example of its use is here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/scripts/merged_to_individual_files.py

My other suggestions are mostly about reducing complexity by taking advantage of some built-in functions to reduce the number of lines of code. Some of these are pretty simple, like using string join methods, and others are more complex, like the pandas "merge" method.

I am happy to help out with any changes you want to implement but might not have experience with. Just let me know!


EP = pbta_histologies[pbta_histologies["disease_type_new"]=="Ependymoma"]
EP_rnaseq_samples = EP[EP["experimental_strategy"] == "RNA-Seq"][["Kids_First_Biospecimen_ID", "primary_site"]]
EP_rnaseq_samples["disease_group"] = ["infratentorial" if "Posterior Fossa" in primary else "infratentorial" if "Optic" in primary else "supratentorial" if "Frontal Lobe" in primary else "supratentorial" if "Parietal Lobe" in primary else "infratentorial" if "Spinal Cord" in primary else "supratentorial" if "Occipital Lobe" in primary else "infratentorial" if "Tectum" in primary else "infratentorial" if "Spine" in primary else "supratentorial" if "Temporal Lobe" in primary else "infratentorial" if "Spinal" in primary else "None" for primary in EP_rnaseq_samples["primary_site"]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is very long, making it hard to parse what is going on.

I would suggest that you break the logic of this line out into a separate function, and then calling that in the list comprehension. If you were to name the function group_disease() you could reduce this line to:

EP_rnaseq_samples["disease_group"] = [group_disease(primary) for primary in EP_rnaseq_samples["primary_site"]]

Which would make it much more clear what this step is doing.

You can also make use of or statements to make things a bit more discoverable within the function. Something like:

if("Posterior Fossa" in primary or "Optic" in primary):
  return "intratentorial"
elif("Frontal Lobe" in primary ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an R script replacing this now

Comment on lines 23 to 25
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already imported rpy2.robects, so you don't need to do this again. I would also suggest that you don't really need to import pandas2ri separately, as you only use it once, so

robjects.pandas2ri.activate()

should be sufficient.

from rpy2.robjects import pandas2ri
pandas2ri.activate()

# Reading in pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds file to subset (All ependymoma samples are stranded, so ignoring polyA gene expression file in subsetting )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate if you could split this comment across multiple lines (Strive for lines no longer than ~90 characters, with 100 as an upper bound. It makes code revisions much easier to track, and improves readability)



#Reading in pbta histologies file to subset just the ependymoma samples
pbta_histologies = pd.read_csv("data/pbta-histologies.tsv", sep="\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case these filenames change or the files move, it might be a good idea to have file names be passed into the scripts here as arguments, set in the shell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an R script for this now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove this version of the script to avoid confusion.

Comment on lines 37 to 46




# In[158]:





Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to clear the blank lines?



# Looping through EPN sample file with both RNA and DNA BSID's
with open("analyses/molecular-subtyping-EPN/results/EPN_molecular_subtype.tsv", "r") as notebook:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too, I think you may be more efficient to read the whole file in with pandas, rather than going through it line by line; that can also save you a lot of casting strings to ints that you are doing manually now.

CNA=pd.read_csv(zip.open('2019-12-10-gistic-results-cnvkit/broad_values_by_arm.txt'), sep="\t")
CNA = CNA.set_index("Chromosome Arm")

count=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this count used anywhere?
I see you adding to it later, but nowhere that the final value is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this now

WGS_dnaseqsamples = WGSPT[["Kids_First_Biospecimen_ID", "Kids_First_Participant_ID", "sample_id"]]

outnotebook.write("Kids_First_Participant_ID\tsample_id\tKids_First_Biospecimen_ID_DNA\tKids_First_Biospecimen_ID_RNA\tsubtype\n")
count =0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this count used anywhere?

else:
line.append("0")
# Adding all the data collected so far for every sample with RNA and DNA BSID to "line"
line.extend((nfkb_gsva_score, C11orf95_RELA, LTBP3_RELA, PTEN_TAS2R1, C11orf95_YAP1, YAP1_MAMLD1, YAP1_FAM118B, C11orf95_MAML2, breaks_density, RELA_zscore, L1CAM_zscore,ARL4D_zscore, CLDN1_zscore, CXorf67_zscore, TKTL1_zscore, GPBP1_zscore, IFT46_zscore))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to break up the contents of the extend here into logical groups that you can separate onto their own lines.
Something like the below, just to keep the line lengths in check and improve readability. If you do the same with the headers it will be easier to figure out if you accidentally change the order and the header and contents get out of sync.

Suggested change
line.extend((nfkb_gsva_score, C11orf95_RELA, LTBP3_RELA, PTEN_TAS2R1, C11orf95_YAP1, YAP1_MAMLD1, YAP1_FAM118B, C11orf95_MAML2, breaks_density, RELA_zscore, L1CAM_zscore,ARL4D_zscore, CLDN1_zscore, CXorf67_zscore, TKTL1_zscore, GPBP1_zscore, IFT46_zscore))
line.extend((nfkb_gsva_score, C11orf95_RELA, LTBP3_RELA, PTEN_TAS2R1,
C11orf95_YAP1, YAP1_MAMLD1, YAP1_FAM118B, C11orf95_MAML2, breaks_density,
RELA_zscore, L1CAM_zscore, ARL4D_zscore, CLDN1_zscore,
CXorf67_zscore, TKTL1_zscore, GPBP1_zscore, IFT46_zscore))

Comment on lines 162 to 164
for i in line:
outfile.write(str(i)+"\t")
outfile.write("\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these lines, you can do the same as I suggested before, but if you have non-string elements in line (which I assume you do with the call to str() you will want to apply the str function to each element as well, which you can do with python's map function.

"\t".join(map(str, line))

@jharenza
Copy link
Collaborator

jharenza commented Feb 2, 2020

Hi @tkoganti! This looks like a great start! Can I suggest creating a notebook with some additional comments/checks as you go through each step/series of steps, similar to those in the embryonal subtyping here? I think this will help a lot with review :). Thanks!

@jashapiro
Copy link
Member

Hi @tkoganti ! I just added the script for this file to CI, so make sure you pull down the latest changes before pushing anything back to the server. That will help us catch errors that might come up with edits or updates to the data files.

@jashapiro
Copy link
Member

Hi again @tkoganti! Your script was failing CI, and after some investigation, it seems like there is a problem with the version of rpy2 that we are using in that it doesn't support some of the features you were using. I tried to fix it by manually converting to a pandas dataframe, and by using pyreadr, but neither of those worked. So I decided to break out the big guns and just the subset step in R and create a gzipped tsv that pandas can read in directly.

While I was changing file names, I also made a few changes that should make things a bit more flexible as files change, moving all the file paths to the shell script. I also added a line to the shell script so that all of the paths can be specified relative the location of the file, no matter where the shell script is invoked from.

That means that I implemented the argparse changes that I had suggested, but I left the rest of your code as similar to the original as I could. If you have questions about the changes that I did make, or about any of my previous review questions, let me know!

@tkoganti
Copy link
Collaborator Author

tkoganti commented Feb 5, 2020

Hi @jashapiro I am trying to run this script with the changes you made (02_ependymoma_generate_all_data.py) and I see that when you changed to argparse, you are reading expression file (.rds file) with pd.read_csv (fpkm_df = pd.read_csv(args.expression, sep = "\t")). I was getting an error at that line since that file should be read using pyreadr?

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Feb 5, 2020

Hi @tkoganti - can I ask if you are using the project Docker image for development? I am wondering if it is a software version issue.

The subset step is not run in CI because we use files in CI that contain a limited amount of samples to save on download time and RAM and those samples may not overlap with this histology very much or at all. The subsetting step will eventually need to be run in the project Docker container. The idea is to run all steps for the project on AWS once we have a final freeze of the data. For that reason, we will want to ensure it runs with the version of the software that is available on that Docker image.

Edit: Sorry, I was responding to the earlier version of the comment I read in my email!

@jashapiro
Copy link
Member

Hi @jashapiro I am trying to run this script with the changes you made (02_ependymoma_generate_all_data.py) and I see that when you changed to argparse, you are reading expression file (.rds file) with pd.read_csv (fpkm_df = pd.read_csv(args.expression, sep = "\t")). I was getting an error at that line since that file should be read using pyreadr?

In the changes I made, I also changed the EPR expression rds file that is generated by the 00 script (now in R) to a tsv file. This was to get around limitations in pyreader and Rpy2 that couldn’t be easily solved with the versions that we have in the docker container. So the shell script now specifies that the expression file is a .tsv.gz file, and that is what is read in by the 02 script. See line 21 of the shell script.

Really we should delete the .rds subset file from the repo, as it isn’t used at all anymore.

Does that all make sense? Let me know if I need to explain anything in more detail.

@tkoganti
Copy link
Collaborator Author

tkoganti commented Feb 6, 2020

@jashapiro I missed there was a .tsv.gz file too in the subsetted data. I removed the .rds file now and the script will use .tsv.gz file. Please look at the new pull request(b5d1a77) and let me know your feedback.

@tkoganti
Copy link
Collaborator Author

tkoganti commented Feb 6, 2020

Hi @jaclyn-taroni I was not using docker for this development. I saw that all the modules I was using are already in the docker file here (https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/Dockerfile) Please let me know if I need to update any versions.

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jashapiro I only have reviewed the Rscript. The code itself looks fine, there's some documentation things that need updating. Otherwise 👍

c("-i", "--histology"),
type = "character",
default = NULL,
help = "hisology file tsv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help = "hisology file tsv",
help = "histology file tsv",

disease_type_new == "Ependymoma") %>%
pull(Kids_First_Biospecimen_ID)

# Subsetting expression columns with column names/BSIDs that are in the list of ependymoma samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Subsetting expression columns with column names/BSIDs that are in the list of ependymoma samples
# Subsetting expression columns with column names/BSIDs that are in the list of ependymoma samples

c("-o", "--outfile"),
type = "character",
default = NULL,
help = "output file"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help = "output file"
help = "File path and name for output tsv file"

# -o, --output_file : path for output file
#
# example invocation:
# Rscript scripts/bed_to_segfile.R \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't seem to be updated for this script and options.

@jharenza
Copy link
Collaborator

Hi @tkoganti and @jashapiro. Great job on this so far! Sorry it took me a while to get to this, but I just have a few comments:

  • In EPN_all_data.tsv, for disease group, if None, can we call it undetermined or ambiguous to be a bit more accurate?
  • You could possibly add CDKN2A focal status based on ST-EPN-RELA harboring:

Common CN changes: CDKN2A deletions. Chr9 or Chr9p loss
This gene is on 9p21.3, and if chr9p loss is 0, then it is still possible just the gene is deleted, so having the focal information would be good here.

  • I see 1q loss, but not 1q gain, which is mentioned for PF-EPN-A, so I would add that:

1q gain most frequent CNA

  • Is the EPN_molecular_subtype.tsv table a placeholder for the molecular subtypes? It seems that it is a selection of columns from EPN_all_data.tsv, but does not include subtypes. I think this table as named would consist of the patient ID, biospecimen IDs, and the molecular subtypes: ST-EPN-RELA, ST-EPN-YAP1, PF-EPN-A, and PF-EPN-B. Was this the next step/ another PR, or did you need some guidance here?

@tkoganti
Copy link
Collaborator Author

tkoganti commented Feb 12, 2020

Thanks for the feedback @jharenza!

  1. I will add 1q gain column. I meant to add it earlier but missed it accidentally.
  2. I will use focal_data_by_genes.txt file from GISTIC to add the CDKN2A focal changes
  3. So I wrote another script to categorize these samples to different groups. There were total of 93 samples and after using fusion, chromosomal broad values and expression Z-scores, I was able to categorize some samples but there were still 42 samples that did not belong to any category after I set these rules. Two columns I have not used is NFKB_pathway_GSEAscore and breaks_density-chromosomal_instability. Do you have any suggestions for cut-off values for these columns?

@jaclyn-taroni
Copy link
Member

I will use focal_data_by_genes.txt file from GISTIC to add the CDKN2A focal changes

I wanted to note that we have not used the GISTIC gene data in other subtyping analyses, but have instead used the results of analyses/focal-cn-file-preparation which don't take into account the width of a copy number alteration. GISTIC, based on my cursory understanding, "rewards" recurrence and we've run it on the entire cohort. It seems that EPN samples make up ~10% of the available WGS samples. Now that I've added GISTIC to the container (#531; related to #529), I am curious about comparing results when we limit the cohort to a specific histology. I don't disagree with using focal_data_by_genes.txt here, but wanted to document how I'm thinking about potential limitations.

Comment on lines 106 to 108
# Adding breakpoints density for chromosomal instability to the dataframe
EPN_notebook["breaks_density-chromosomal_instability"] = EPN_notebook.apply(lambda x: breakpoint_density.loc[x["Kids_First_Biospecimen_ID_DNA"], "breaks_density"]
if x["Kids_First_Biospecimen_ID_DNA"] is not np.nan else "NA", axis=1)
Copy link
Member

@jashapiro jashapiro Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file we were using here ../chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv has been removed from that analysis (which is why CI is currently failing). There are still files for CNV and SV separately at ../chromosomal-instability/breakpoint-data/cnv-breaks-densities.tsv and ../chromosomal-instability/breakpoint-data/sv-breaks-densities.tsv, respectively. Having two data columns for chromosomal instability seems reasonable at this stage, and they can perhaps be useful for determining reliability of this measure.

Note: The file name is set in run-molecular-subtyping-EPN.sh

@jharenza
Copy link
Collaborator

@jaclyn-taroni good point - @tkoganti can you explore both focal files?

@tkoganti
Copy link
Collaborator Author

Hi @jashapiro and @jaclyn-taroni

I tried to implement CNV and SV file with breaks density score and I found that these samples were missing from the file -

BS_0W8AWY10',
'BS_2E81A3FT',
'BS_4PAE19ZC',
'BS_5ZRZC3ZM',
'BS_62XRDTM6',
'BS_6NPSZJ4C',
'BS_6Z4WHYDG',
'BS_99PPRCW4',
'BS_9GJHMA3J',
'BS_9JVGXA2W',
'BS_AGD2ATY1',
'BS_B4DY7ET3',
'BS_BBHFKBE7',
'BS_QSMFVHSB',
'BS_R82MYPKB',
'BS_R8ZKDFWH',
'BS_W5P7SPDH',
'BS_XMNBSBGN

@jashapiro
Copy link
Member

I tried to implement CNV and SV file with breaks density score and I found that these samples were missing from the file -

BS_0W8AWY10',
'BS_2E81A3FT',
'BS_4PAE19ZC',

I think those are likely samples with 0 (or NA) breaks by CNV consensus, which may have been accidentally excluded in the current versions of those files. If so, I think this is fixed by #532, which should be merged into master very soon. Tagging @cansavvy who worked on that analysis.

In the mean time, and in case that is not the sole cause of the missing samples, it probably makes sense to use NA for such missing data.

@tkoganti
Copy link
Collaborator Author

Hi @jashapiro This file also had missing BSID's from DNA samples - analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz
I used "NA" for now and pushed all the changes mentioned above

@jashapiro
Copy link
Member

I just merged the changes in master into your branch. For some reason doing it on github wasn't working, so I had to do it manually. I don't think the changes will fix the missing data problem, as I don't think the update has hit the master branch yet, but lets see if it gets us past the CI failure.

Add line continuation characters to bash script
Remove no longer used --breakpoints option
@jashapiro
Copy link
Member

Hi @jashapiro This file also had missing BSID's from DNA samples - analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz
I used "NA" for now and pushed all the changes mentioned above

I would not have expected the focal-cn file file to have NA in it, but I could be wrong about its contents. I am going to tag @jaclyn-taroni who was more involved in its creation to see if she has thoughts.

@jaclyn-taroni
Copy link
Member

Hi @tkoganti - there will be no neutral calls in the analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz, so if I BSID is present in the consensus SEG file but not this file, a status would not be missing but neutral.

I made some substantial changes here in structure, but the results should be largely unchanged.

I did some transposing when constructing data frames so we can use the same function (fill_df)  to extract data more often, and moved the ID column specification out of the function so that RNA and DNA-derived data are handled the same way.

The function then allows a set of samples to be specified, and if the request is for a a sample that does not fall in there, it is set as NA for that column in the output data, otherwise it is filled in with a default value.
@jashapiro
Copy link
Member

Hi @tkoganti-

I made some more updates/revisions to the code here to deal with the missing data questions, and while I was doing it I ended up doing a bit of extra refactoring.

I did some transposing when constructing data frames so we can use the same function (fill_df()) to extract data more often: in most cases the samples are now the row indexes of the data frame. I also and moved the ID column specification out of the function so that RNA and DNA-derived data are handled the same way by the function.

fill_df() then allows a set of included_samples to be specified, and if the request is for a a sample that does not fall in that set, the returned values is 'NA', otherwise it is filled in with a default value. This covers the case that came up in consensus_seg_annotated_cn_autosomes.tsv.gz where some missing samples should be missing, but others just had no change.

I think all of it is correct, but please do check carefully that it all makes sense to you and that the results are as expected.

@kgaonkar6 kgaonkar6 mentioned this pull request Feb 18, 2020
5 tasks
@jaclyn-taroni jaclyn-taroni merged commit 3b2ae77 into AlexsLemonade:master Feb 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
molecular subtyping Related to molecular subtyping of tumors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants