-
Notifications
You must be signed in to change notification settings - Fork 83
Template markdown file for tracking data information/descriptions #336
Conversation
…k the source and description of data files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, made some minor suggestions - see what you think!
Co-Authored-By: Jo Lynne <jharenza@gmail.com>
Co-Authored-By: Jo Lynne <jharenza@gmail.com>
Co-Authored-By: Jo Lynne <jharenza@gmail.com>
Changes seem good to me! I figured the specifics for each data file could be added by those who are most familiar with the files, and this markdown would get the ball rolling in that direction. Thanks everyone for quick feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will approve with the last change!
I think this should live in the Thoughts related to but outside of the scope of this pull request: We can also move the data formats section to doc (+ the reorganization mentioned here: #334 (comment)), but I think that's a separate PR. We might consider including both the notion of origin and associated analysis, but perhaps associated analysis in its own markdown document that isn't included in the download. |
Agree! |
@sjspielman I made some changes last night — are those consistent with your goals for this document? @jharenza I filled in everything that I felt comfortable filling in — can someone on your side fill in the rest and can you check what I filled in for accuracy and clarity? Thank you! |
Add missing fields
-add workflows -note: `WGS.hg38.mutect2.unpadded.bed` should be renamed to `WGS.hg38.mutect2.vardict.unpadded.bed` in the next release, but kept as is for now since this description is for v11 files
I'm going to get this merged because we expect this to get updated as part of the pull request that includes the v12 release. |
### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
* Release V12 data ### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8) * Update release-notes.md fix link * Update data-files-description.md fix GISTIC table sectioning * Update data-files-description.md fix spacing on data description table * Update data-files-description.md fix more spacing in data file description file * Update download-data.sh add new release date to download script * Update the TMB file descriptions * Update TMB file formats section * Update fusion section of data formats Also more specific description of the by sample file * Add GISTIC file to data-formats * Update download-data.sh * Update download-data.sh * data description md is also included in md5sum * TMB exon -> coding sequence * Coding TMB CDS, not exon
Purpose/implementation Section
The purpose of this PR is to initiate a framework for tracking the source, usage, and description of all data associated with this project. The goal is NOT (currently) to track all plots, files, etc. in
analyses/
but rather to describe the bulk of data indata/
.What scientific question is your analysis addressing?
The goal is to increase the transparency and reproducibility of this project while lowering the cost-of-entry for new contributors.
What was your approach?
A template markdown file was created for the purposes of tracking and describing data.
What GitHub issue does your pull request address?
Issue #334
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Format of the markdown file, name and location of the markdown file, whether the table is sufficient to describe data (ie should there be more/fewer columns).
Is there anything that you want to discuss further?
We should discuss whether the
README.md
orCONTRIBUTING.md
file should be modified to direct contributors that they should keep their data well-documented in this markdown.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
N/A
What is your summary of the results?
N/A