Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Discussion: Manuscript copy editing #261

Closed
sjspielman opened this issue May 6, 2022 · 7 comments
Closed

Discussion: Manuscript copy editing #261

sjspielman opened this issue May 6, 2022 · 7 comments

Comments

@sjspielman
Copy link
Member

sjspielman commented May 6, 2022

I have several overall questions for manuscript organization to solidify as part of copy editing, which I am bringing up here for discussion. Note I imagine future discussion items will come up and will not be limited to this issue!
Tagging @jaclyn-taroni @jharenza @jashapiro for discussion

  • We need to further solidify the LGG vs LGAT naming, and ensure it is used in BOTH text and figures/captions/legends. I may not actually be clear on this I am realizing reading over the manuscript more carefully. Are these "synonyms" or is LGG a grouping within LGAT? I had always thought synonyms, but the new SEGA situation has thrown me so I wonder if I am incorrect.
  • EDIT: We also need to solidify use of acronyms in general. There is a lot of back and forth with "HGG" vs "non-midline high-grade gliomas", as well as "DMG" vs "diffuse midline glioma." We need to pick: For tumors with standard acronyms, do we use the acronym or the full name?
  • Should we include software versions in the main text (methods) or only in Key Resources? I think keeping everything in Key Resources will be cleaner and less error-prone.
  • Do we want to refer to software in code format? What about internal variables like cancer_group etc? We need to pick a cohesive strategy. I suggest it all should be referred to in code format BUT with the caveat that I think we should limit as much as possible referring to internal variable names if the text is not technical.
  • I think we should always explicitly refer (possibly link to?) the given analysis module when describing results. I think this should also be done only in Methods, not in the main text (currently 1 module is referenced in main txt methods. to refer to module when we discuss the result? I think this should be done in Methods but not in main text
  • TP53 items:
    • When updating and compiling figures, I had been changing "loss" --> "lost" to ensure it is consistent with "activated." Grammatically, loss goes with activation, and lost goes with activated. I am realizing I didn't formally discuss this with anyone! We should pick an overall pairing here to use throughout the text, and I feel pretty strongly it should not be the original activated/loss.
    • After carefully going through the paper, I am confused about the TP53 module. What I think I now understand (and will update phrasing accordingly in the paper if I understand correctly) is that a classifier (but which classifier produced the ROC?) was applied on samples that we had to directly annotate ground truth based on genomic alterations. This suggests to me the true activation status we are comparing inferred classifications to is also error prone, and this caveat is not discussed in the paper. This is just a general discussion item then to make sure I understand the module correctly and that we are appropriately covering the potential biases.
  • There are a several of tables in Methods where subtyping is discussion, but these tables don't have official titles/labels, captions, etc. They also don't mention cancer groups at all, and cancer groups largely replaced short histology as the plotting variable. Something has to be cleaned up here, perhaps making these actual tables. That said, I found these tables quite hard to follow, but captions may help that confusion.
  • Speaking more generally about cancer groups vs. short histology, I am not sure our current description is still right because there is a strong focus on short histology for plotting. Here is the text current in there:
The short_histology is an abbreviated version of either the broad_histology or integrated_diagnosis for plotting purposes. 
Except for LGG samples, the integrated_diagnosis field in the pbta-histologies.tsv file was derived to match a standardized 
2016 WHO diagnosis [[10](file:///Users/spielman/Downloads/manuscript-1.0.688-24614fb.html#ref-7sgrMadR)] based on 
pathology_diagnosis, molecular subtyping, and in some cases, additional pathology review. 
The harmonized_diagnosis is the final integrated_diagnosis, if one exists, or a diagnosis derived from the 
pathology_diagnosis and pathology_free_text_diagnosis in the absence of molecular data. 
The cancer_group is a grouping narrower than broad_histology derived within the 
[molecular subtyping integrate module](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-integrate) 
for plotting and analysis purposes.
  • Similarly, I suggest we add some text in Results to briefly describe broad histology vs cancer group to scaffold the rest of the paper.
@jharenza
Copy link
Collaborator

jharenza commented May 6, 2022

Hi @sjspielman ! thanks for starting this! Tried to answer below:

We need to further solidify the LGG vs LGAT naming, and ensure it is used in BOTH text and figures/captions/legends. I may not actually be clear on this I am realizing reading over the manuscript more carefully. Are these "synonyms" or is LGG a grouping within LGAT? I had always thought synonyms, but the new SEGA situation has thrown me so I wonder if I am incorrect.

LGAT are low-grade astrocytic tumors. The WHO calls them astrocytic tumors. All gliomas are derived from the astrocyte as cell of origin, and LGG, or low-grade glioma, is what people typically refer to these as, so they are synonyms, yes. We had decided to just use LGG (same with HGAT/HGG to avoid reader confusion), so if some of that was missed, we should update! SEGA is a type of LGG/LGAT.

Should we include software versions in the main text (methods) or only in Key Resources? I think keeping everything in Key Resources will be cleaner and less error-prone.

I am good with KR - had not removed all of those, but go for it!

Do we want to refer to software in code format?

Do you mean commands like bedtools intersect? I think yes

What about internal variables like cancer_group etc? We need to pick a cohesive strategy. I suggest it all should be referred to in code format BUT with the caveat that I think we should limit as much as possible referring to internal variable names if the text is not technical.

I had not ever done that in the past, and it is probably more readable not using code format for variables within text but perhaps we can use for methods?

I think we should always explicitly refer (possibly link to?) the given analysis module when describing results. I think this should also be done only in Methods, not in the main text (currently 1 module is referenced in main txt methods. to refer to module when we discuss the result? I think this should be done in Methods but not in main text

This is also new for me, adding so many links within a paper. I am not sure how the journal handles it. For ex, I do not believe they will have text hyperlinks to things - that we would have to just print the URL and it may link out, which may get ugly. Perhaps we can refer to the module by name as you suggest, within the methods? I think we do a good job in the intro/methods describing that all of this is in the analysis repo so I don't think we need another link to the exact analysis module - that is probably overkill. I also do not think we need to reference any specific modules in the results text - it will take away from the "story" by adding all of the technical jargon. Everything should be in STAR Methods. I think it is a bit too technical as is right now, and we may need to soften that up. During the first review of the PPTC PDX paper, I had it written as just results and a reviewer said it was "too dry", so was trying to add more biology and story here.

TP53 items: When updating and compiling figures, I had been changing "loss" --> "lost" to ensure it is consistent with "activated." Grammatically, loss goes with activation, and lost goes with activated. I am realizing I didn't formally discuss this with anyone! We should pick an overall pairing here to use throughout the text, and I feel pretty strongly it should not be the original activated/loss.

I have no strong preference for any of this, so we can do whatever you think is best :)

After carefully going through the paper, I am confused about the TP53 module. What I think I now understand (and will update phrasing accordingly in the paper if I understand correctly) is that a classifier (but which classifier produced the ROC?) was applied on samples that we had to directly annotate ground truth based on genomic alterations. This suggests to me the true activation status we are comparing inferred classifications to is also error prone, and this caveat is not discussed in the paper. This is just a general discussion item then to make sure I understand the module correctly and that we are appropriately covering the potential biases.

To be honest, this is not my expertise, as it was written by @gwaygenomics so maybe he has a good explanation. But what I understand is the classifier was trained on TCGA, but the ROC was generated using known genomic alterations from the TCGA MAF (so only SNVs) which were categorized as TP by @gwaygenomics. Here, we are assessing which alterations are known from SNV, CNV, fusion and using them to generate the ROC, which was separate of running the classifier on the samples. The reason we see a poor accuracy for polyA is small N and small N with true alterations. Does that help?

There are a several of tables in Methods where subtyping is discussion, but these tables don't have official titles/labels, captions, etc. They also don't mention cancer groups at all, and cancer groups largely replaced short histology as the plotting variable. Something has to be cleaned up here, perhaps making these actual tables. That said, I found these tables quite hard to follow, but captions may help that confusion.

This was something tough @jaclyn-taroni and I discussed early on an probably needs trimming! We weren't sure how much detail to add and maybe we can even summarize each of these with a minimal table in the STAR Methods rather than writing out all of the rules.

Speaking more generally about cancer groups vs. short histology, I am not sure our current description is still right because there is a strong focus on short histology for plotting. Here is the text current in there:
The short_histology is an abbreviated version of either the broad_histology or integrated_diagnosis for plotting purposes.
Except for LGG samples, the integrated_diagnosis field in the pbta-histologies.tsv file was derived to match a standardized
2016 WHO diagnosis [10] based on
pathology_diagnosis, molecular subtyping, and in some cases, additional pathology review.
The harmonized_diagnosis is the final integrated_diagnosis, if one exists, or a diagnosis derived from the
pathology_diagnosis and pathology_free_text_diagnosis in the absence of molecular data.
The cancer_group is a grouping narrower than broad_histology derived within the
molecular subtyping integrate module
for plotting and analysis purposes.

yes, this wasn't used for plotting anymore, so should be updated

@sjspielman
Copy link
Member Author

sjspielman commented May 6, 2022

Thanks @jharenza !!

I conclude from this:

  • We'll want to the revisit oncoprint LGG/LGAT legend naming. This plot (in particular this panel!) will need updating anyways to implement SEGA, so that's easy! I also see some little text wonkiness in the plot from the manual compiling so in general this figure needs a refresh.
    • Edit: That said, does this mean we should also be saying "low grade glioma" instead of "low grade astrocytic tumor"? Not sure. See also Update LGG plots OpenPBTA-analysis#1372
    • Edit 2 (I'm on a roll..!) - I may have re-confused myself because for example, Figure 1B has a panel for LGAT with an LGG category. This suggests NOT synonyms?
  • I'll update "Table S5" --> "Key Resources Table". (for sure you got most already!)
  • I'll change to "lost" instead of "loss" for TP53. The UMAP panels and 4G will need to be updated as well accordingly.
  • I'll update wording to focus on cancer groups and away from short histology

@sjspielman
Copy link
Member Author

For the modules,

Perhaps we can refer to the module by name as you suggest, within the methods?

This is pretty much what I was thinking! Just open or end each little blurb with "relevant code is in the XYZ analysis module." I agree linking is overall.

@gwaybio
Copy link
Contributor

gwaybio commented May 10, 2022

Happy to review any language update PRs for that section @sjspielman

@sjspielman
Copy link
Member Author

Noting #268 has now been created to actually solidify LGG and HGG terminology.

@jaclyn-taroni
Copy link
Member

Is this issue still helping us? Should we file smaller "bites" for accomplishing anything we agreed to instead? I'm going to direct these questions to @sjspielman to start!

@sjspielman
Copy link
Member Author

Most of these bullets have been dealt with, so I will close this issue in favor of the following issues to track remaining and/or pending things to be aware of -

  1. LGG/HGG wording: Discussion: LGG and HGG wording throughout manuscript #268
  2. Code format: Manuscript code format #284
  3. Versions: Cite software versions in KR table only #285

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants