Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

CNV consensus (6 of 6): Merge consensus files and name columns #403

Merged
merged 55 commits into from
Jan 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
3f55855
add to Snakefile
Dec 18, 2019
9a923a3
resolve conflict
Dec 19, 2019
d38289c
Merge remote-tracking branch 'upstream/master'
Jan 4, 2020
3a20aa0
updating fork
Jan 4, 2020
76ebc97
add final consensus merging step
Jan 4, 2020
5cbad49
add column naming
Jan 4, 2020
4804bac
add column naming
Jan 4, 2020
217232c
add column naming
Jan 4, 2020
797bf53
change so that output files doesn't have extra indentation at the end
Jan 4, 2020
98714fa
change so that output files doesn't have extra indentation at the end
Jan 4, 2020
60a143f
Merge branch 'master' into merge_consensus_files
nhatduongnn Jan 5, 2020
9467ad5
changed output path and name
Jan 6, 2020
b52693f
changed output path and name
Jan 6, 2020
9a3ed65
Update analyses/copy_number_consensus_call/Snakefile
nhatduongnn Jan 6, 2020
7c2a454
Update analyses/copy_number_consensus_call/Snakefile
nhatduongnn Jan 6, 2020
e3ad43e
Update analyses/copy_number_consensus_call/Snakefile
nhatduongnn Jan 6, 2020
8979e04
Update analyses/copy_number_consensus_call/Snakefile
nhatduongnn Jan 6, 2020
dd15b95
Update analyses/copy_number_consensus_call/src/scripts/compare_varian…
nhatduongnn Jan 6, 2020
8a2e7aa
Update analyses/copy_number_consensus_call/src/scripts/compare_varian…
nhatduongnn Jan 6, 2020
be894fe
Update analyses/copy_number_consensus_call/Snakefile
nhatduongnn Jan 6, 2020
96ddea1
changed Snakemake, reduced redundancy
Jan 6, 2020
ab76a58
changed Snakemake, reduced redundancy
Jan 6, 2020
0d05ad7
changed minor error
Jan 6, 2020
d8dc0e4
update Snakefile
Jan 6, 2020
d3fba93
add README.md
Jan 7, 2020
0e57d91
Merge branch 'master' into merge_consensus_files
nhatduongnn Jan 7, 2020
20d5cff
Update README.md
nhatduongnn Jan 7, 2020
3f7aabc
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
4899a5d
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
7c7f99b
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
7ca720c
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
e4525bd
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
f7ae54a
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
77f9b09
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
b88d54d
Update README.md
nhatduongnn Jan 7, 2020
399f910
Update README.md
nhatduongnn Jan 7, 2020
cf979a5
Update README.md
nhatduongnn Jan 7, 2020
93dcb9f
add result consensus file
Jan 7, 2020
cdb818f
make changes to analyses/README.md
Jan 7, 2020
ab1ec46
Update analyses/README.md
nhatduongnn Jan 7, 2020
39d2f26
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
c50da32
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
2b5c5bd
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
4b504f2
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
97ed47f
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
08d610d
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
058be07
Update analyses/copy_number_consensus_call/Snakefile
nhatduongnn Jan 7, 2020
ac1b0e7
Update analyses/copy_number_consensus_call/README.md
nhatduongnn Jan 7, 2020
5c3e70b
cut the last column
Jan 7, 2020
007599d
Merge branch 'merge_consensus_files' of https://github.com/fingerfen/…
Jan 7, 2020
61eb681
add the result file with out the last column
Jan 7, 2020
4e14e9d
Merge branch 'master' into merge_consensus_files
nhatduongnn Jan 7, 2020
10586db
Merge branch 'master' into merge_consensus_files
nhatduongnn Jan 7, 2020
52dc3e2
Merge branch 'master' into merge_consensus_files
nhatduongnn Jan 7, 2020
9e35786
Merge branch 'master' into merge_consensus_files
nhatduongnn Jan 8, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
| [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds`, `pbta-gene-expression-rsem-fpkm.stranded.rds`, `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`, `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download)
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds`, `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-sv-manta.tsv.gz` | *In progress*; will produce consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) | N/A
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) | `cnv_consensus.tsv`
| [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
| [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-gene-expression-rsem-fpkm.polya.rds`, `pbta-gene-expression-rsem-fpkm.stranded.rds` | Maps from copy number variant caller segments to gene identifiers; will eventually be updated to use consensus copy number calls ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `cnvkit_annotated_cn_autosomes.tsv.gz`, `cnvkit_annotated_cn_x_and_y.tsv.gz`, `controlfreec_annotated_cn_autosomes.tsv.gz`, `controlfreec_annotated_cn_x_and_y.tsv.gz`
| [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz`, `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `pbta-fusion-putative-oncogenic.tsv`, `pbta-fusion-recurrent-fusion-byhistology.tsv`, `pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)
Expand Down
61 changes: 61 additions & 0 deletions analyses/copy_number_consensus_call/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Copy Number Consensus Call

## Overview

The PBTA data set contains CNVs called from different callers, ie. Manta, CNVkit, and Freec.
The goal is to use all of these callers to reduce false positives and come up with a final consensus list of CNVs.
This analysis uses information from the following files generated from the 3 callers

* `pbta-cnv-cnvkit.seg.gz`
* `pbta-cnv-controlfreec.tsv.gz`
* `pbta-sv-manta.tsv.gz`

## Running the pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving this section above the methods, as it is likely to be needed more often by people coming into the analysis/rerunning it.


To run the entire pipeline, make sure to have the latest release of the three input files mentioned in the Overview section.
Go to OpenPBTA-analysis/analyses/copy_number_consensus_call and run `bash run_consensus_call.sh`

## Methods

This pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:

1) Parse through the 3 input files and put CNVs of the **same caller and sample** in the same files.
2) Remove any sample/caller combination files with **more than 2500** CNVs called.
We belive these to be noisy/poor quality samples (this came from what GISTIC uses as a cutoff for noisy samples).
3) Create a `config_snakemake.yaml` that contains all of the samples names to run the Snakemake pipeline
4) Run the Snakemake pipeline to perform analysis **per sample**.
5) Filter for any CNVs that are over a certain **SIZE_CUTOFF** (default 3000 bp)
6) Filter for any **significant** CNVs called by Freec (default pval = 0.01)
7) Filter out any CNVs that overlap 50% or more with **IGLL, telomeric, centromeric, seg_dup regions**
8) Merge any CNVs of the same sample and call method if they **overlap or within 10,000 bp** (We consider CNV calls within 10,000 bp the same CNV)
9) Reformat the columns of the files (So the info are easier to read)
10) **Call consensus** by comparing CNVs from 2 call methods at a time.

Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV.

11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file
12) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv`

## Example Output File

```
chrom start end manta_CNVs cnvkit_CNVs freec_CNVs CNV_type Biospecimen file_names
chr11 771036 866778 NULL 770516:866778:3 771036:871536:3 DUP BS_007JTNB8 BS_007JTNB8.cnvkit_freec.dup.bed
chr13 99966948 99991872 NULL 99954829:99994557:3 99966948:99991872:3 DUP BS_007JTNB8 BS_007JTNB8.cnvkit_freec.dup.bed
chr14 103515996 103563240 NULL 103515996:103563363:3 103511784:103541532:3,103543140:103563240:3 DUP BS_007JTNB8 BS_007JTNB8.cnvkit_freec.dup.bed
```

* The 1st line of the file is the header which contains the column names. There are 9 columns in total
* The 2nd line is the first CNV of the file.
* Column 1 is the **consensus** CNV chromosome
* Column 2 is the **consensus** CNV start location
* Column 3 is the **consensus** CNV end location
* Columns 4, 5, and 6 contain the calls of Manta, CNVkit, and Freec that make up the **consensus** CNV described in columns 1, 2, and 3.
* ie. If there is info in column 4, that means one or more CNVs called from Manta made up the current **consensus** CNV described in columns 1, 2, and 3.
* Columns 4, 5, and 6 have the following format: `START:END:COPY_NUMBER,START:END:COPY_NUMBER`
* Note that if there is more than one original CNV call corresponding to a given consensus CNV from a given caller, the information for each of the CNV calls will be comma separated.
* In the example output above column 6 of line 4 contains `103511784:103541532:3,103543140:103563240:3` which means 2 CNVs called by FreeC helped to make up the **consensus** CNV on line 4.
One has the start and end coordinates of `103511784:103541532` **on the same chromosome** and has a copy number of `3` and another has the coordinates `103543140:103563240` and has a copy number of `3`.
* Column 7 is the CNVtype. This will be one of DUP or DEL, corresponding to duplications or deletions, respectively. Note that this does not describe the number of copies, only the direction of the copy number change.
* Column 8 is the Sample name
* Column 9 contains the name of of the two-caller consensus files (`manta-cnvkit` `manta-freec` `cnvkit-freec`) that made up the final **consensus** CNV.
44 changes: 39 additions & 5 deletions analyses/copy_number_consensus_call/Snakefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
## Define the ending file(s) that we want
OUTPUT= expand("../../scratch/interim/{sample}.{combined_caller}.{dupdel}.bed",
sample=config["samples"],
combined_caller=["manta_cnvkit", "manta_freec", "cnvkit_freec"],
dupdel=["dup", "del"]),
OUTPUT = "results/cnv_consensus.tsv",

## Define the wildcards to use in the file names.
wildcard_constraints:
Expand Down Expand Up @@ -161,4 +158,41 @@ rule compare_cnv_methods:
"--sample {params.sample_name}"



rule combine_merge_paired_cnv:
input:
## Define the location of the input file
manta_cnvkit="../../scratch/interim/{sample}.manta_cnvkit.{dupdel}.bed",
manta_freec="../../scratch/interim/{sample}.manta_freec.{dupdel}.bed",
cnvkit_freec="../../scratch/interim/{sample}.cnvkit_freec.{dupdel}.bed"
output:
## Define the output files' names
merged="../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed"
threads: 1
shell:
## Combine the input file, sort and output to one file
## Columns 4, 5, and 6 hold the original CNV calls from Manta, CNVkit, and Freec, respectively.
## We want to retain info in these columns when merging these files so we use COLLAPSE to keep the information in these columns
## Columns 7 and 8 are the CNVtype (DEL, DUP) and Sample_name, respectively.
## AT THIS POINT, these columns of the input files hold the same values, thus we perform DISTINCT, which is to take the unique of columns 7 and 8.
## As for column 9, this column holds the files that were merged to get a specific CNV. We want to keep all information here so we COLLAPSE it.
"cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} "
"| sort -k1,1 -k2,2n "
"| bedtools merge -c 4,5,6,7,8,9 -o collapse,collapse,collapse,distinct,distinct,collapse "
"> {output.merged}"


rule merge_all:
input:
## Take all of the del and dup files of ALL samples as input. If there are 200 samples, there will be 400 files total
bedfiles = expand("../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed",
sample = config["samples"],
dupdel = ["dup", "del"])
output:
## Define the final output file
"results/cnv_consensus.tsv"
shell:
## Add a header to the file and combine all of the files using cat.
"echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tBiospecimen\tfile_names' "
" | cat - {input.bedfiles} "
" | cut -f 1-8 "
" > {output}"
Loading