CNV consensus (6 of 6): Merge consensus files and name columns #403

nhatduongnn · 2020-01-04T22:13:47Z

Purpose/implementation Section

To add the last steps to the consensus calling pipeline. These steps merge any overlapping consensus segments and name the columns

What GitHub issue does your pull request address?

#128

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.
This analysis is recorded in the table in analyses/README.md.

jashapiro · 2020-01-06T01:00:12Z

analyses/copy_number_consensus_call/Snakefile

-               sample=config["samples"], 
-               combined_caller=["manta_cnvkit", "manta_freec", "cnvkit_freec"],
-               dupdel=["dup", "del"]),
+OUTPUT= expand("../../scratch/endpoints/{sample}.del_dup.merged.final.bed",


If these are the final output files, we should probably put them in a results directory within the analysis directory. Perhaps results/cnv_consensus/{sample}.cnv_consensus.csv or similar.

You are right, I forgot to put them in the "results" folder. I have made the changes to reflect this.

jashapiro

Hi @fingerfen, thanks for this update!

I have a number of comments associated with some of the individual steps, but here I want to step back a bit and think about what the final file we want out of this is going to look like.

I don't think we will want hundreds of individual .bed files, which seems to be the current output. Instead, what I think we ultimately want is a single .csv file that includes all samples. The files you are already producing include that information, so I think that it should be pretty straightforward to accomplish this.

If I am reading everything correctly, you may be able to reduce this to the following steps:

combine and merge the paired call duplication and deletion files, using the rule at line 162 (which I suggest renaming). Adding the bedtools merge at this stage will reduce redundancy in the code.
cat all the files together into a single csv file, adding the header along the way. Making use of snakemake's expand() here, you should be able to do this in a short bit of code. Something like:

rule merge_all:
    input:
        bedfiles = expand("../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed",
                          sample = config["samples"],
                          dupdel = ["dup", "del"])
    output:
        "results/cnv_consensus.tsv"
    shell:
       "echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tsample\tfile_names' "
       " | cat - {input.bedfiles} "
       " > {output}"

Let me know if that makes sense, or if I am missing something in your current code.

If that does work out, then it will make sense to include the final results/cnv_consensus.tsv file in this PR, so you should make sure you have downloaded the current data release and you can run your pipeline on the full data set.

Along with that, I would really like to have a README file included in this PR that summarizes the approach and output of this analysis. You will also probably want to get started with a description of what you did to add to the methods section, which you can submit as a PR to https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md.

Finally, I also want to make sure you are aware of issue #392, where there is some discussion of the final output format that people might be looking for. I think the current output is a good start, but we may want to think about if there are good summaries we could add to the final calls. I would be happy to hear if you have any thoughts you could add to that discussion!

analyses/copy_number_consensus_call/Snakefile

jashapiro · 2020-01-06T01:23:16Z

analyses/copy_number_consensus_call/Snakefile

+    threads: 1
+    shell:
+        ## Name the headers
+        "python3 {input.script} --file {input.combined_dup_del} > {output.final_dup_del}"


This seems like a very heavy solution for just adding a single header line to the files. Is there a reason to go through python here, rather than just using echo and cat? Something like:

"echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tsample\tfile_names' " " | cat - {input.combined_dup_del} > {output.final_dup_del}"

There is no reason. It completely escaped my mind that I could do this using echo and cat. Your suggestion is way better. I will apply this change. Thanks again!

analyses/copy_number_consensus_call/Snakefile

jashapiro · 2020-01-06T01:35:30Z

analyses/copy_number_consensus_call/Snakefile

+    threads: 1
+    shell:
+        ## Merge the consensus files together
+        ## Put the dup CNVs calls and del CNVs calls together into one file


Can you explain in the comments the reasons for the additional bedtools arguments? In particular, the collapse and distinct statements?

I definitely can. I will briefly explain it here to see if it makes sense for you.

Columns 4, 5, and 6 hold the original CNV calls from Manta, CNVkit, and Freec, respectively. When we merge manta_cnvkit, manta_freec, and cnvkit_freec files together, manta_cnvkit and manta_freec will have information in column 4 (manta column). Thus, when using bedtools merging, we want to COLLAPSE these information and keep all information that is in column 4. The same applies to columns 5 and 6.

Columns 7 and 8 are the CNVtype (DEL, DUP) and Sample_name, respectively. Among the manta_cnvkit, manta_freec, and cnvkit_freec files, at this point, we should have the same CNVtypes and Sample_name. Thus we perform DISTINCT, which is to take the unique of columns 7 and 8.

As for column 9, this column holds the files that were merged to get a specific CNV. Thus we want to keep all information here so we COLLAPSE it.

For columns 4, 5 and 6, will any of this result in duplication? For example, if the same original manta call was used in manta_cnvkit and manta_freec, will it now show up more than once? If so, I think we may want to clean that up in the final file, but that may require a bit of extra logic, so we would put it off to a separate PR.

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

jashapiro · 2020-01-06T14:40:11Z

analyses/copy_number_consensus_call/Snakefile

+        "cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} | sort -k1,1 -k2,2n > {output.dup_del_file}"
+
+
+rule merge:


With the changes above in the combine_files rule, this rule would change to just the cat of dup and del files. As such, we might want to combine this with the column names rule, though see earlier comments about combining the .bed files.

analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

nhatduongnn · 2020-01-06T21:22:55Z

@jashapiro, thank you for your review and suggestions. Once again, they were very helpful. I think I have applied all of the changes you suggested. Please have a look. I am currently working on the README.md and downloading the newest data release to run the entire pipeline on it from the beginning as suggested. I will let you know when the code finish running the newly downloaded data, hopefully by tomorrow.

I will also be making changes to the methods.md file in a different PR as you suggested.

Thank you

jashapiro · 2020-01-06T21:48:32Z

Thanks @fingerfen, this all looks good from a first look, and it passed CI. Let me know if you run into any trouble with running it on the full data set, and I look forward to seeing the README when you get it done.

nhatduongnn · 2020-01-07T17:16:12Z

@jashapiro I have updated the analyses/README.md. I have also made the suggested changes this analysis's README.md. I also added the final output file cnv_consensus.tsv. Please have a look.

For some reason CI is failing, I checked the error and I believe it doesn't have anything to do with this analysis. I think I will have to wait for the error to be resolved.

jashapiro · 2020-01-07T18:18:20Z

For some reason CI is failing, I checked the error and I believe it doesn't have anything to do with this analysis. I think I will have to wait for the error to be resolved.

I think CI was just getting cancelled as you added each new change. The latest commit passed CI, so everything looks good on the CI front.

jashapiro

This looks great! I have a few clarification questions, and one where I suggest a change to a column name to correspond better with naming conventions of the project.

The other thing I was thinking about is whether to retain the filenames in the final output. My inclination is not to, as those files won't be coming along, so most users won't have them. I think the easiest way to remove them is to use cut; adding the following to the merge_all rule pipe should handle things, but check that I am right!

cut -f 1-8

analyses/README.md

analyses/copy_number_consensus_call/README.md

analyses/copy_number_consensus_call/Snakefile

analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

…OpenPBTA-analysis into merge_consensus_files

jashapiro · 2020-01-07T22:23:35Z

I think we are really close here, and if CI passes, we can merge this commit. However, before we call this analysis "done" I would like to add one more step to your analysis to clean things up just a bit.

I looked at the output file, and it does look like there is some duplicated information in there that could cause confusion down the line, but should be pretty straightforward to fix at this stage.
In particular, when we look at the caller-specific CNVs used for each consensus call, there can be duplicated entries, as well as extraneous NULL values that we should probably delete.

For example, look at the consensus CNV

chr15	64681606	101072404	64635340:101704930:NA,64635340:101704930:NA,NULL,NULL,NULL	 ... BS_00TRPEQX

The manta_cnv calls there are two of the same location, followed by NULL. In the other callers there are also duplicated CNVs.

It would be nice to write a script that parses through and removes duplicates and NULLs from fields 4,5 and 6 of each record to apply before combining all of the samples together into the final file.

If you have time to do that soon, let me know, or I can tackle it. To be clear, this would be a separate pull request (7 of 6, so to speak), as I think this one has enough in it already.

nhatduongnn · 2020-01-07T22:47:20Z

@jashapiro I agree that the NULL should be removed and I can write a script to address that no problem. My main concern here are the duplications. Thanks for catching that. It is not supposed to contain duplicated segments. I am curious as to why it has duplicated information and I would like to go back and find out the reason for it. I will take a look at it tonight and hopefully update you about the situation tomorrow.

nhatduongnn · 2020-01-08T05:46:17Z

So I found out the reason why there are duplicates. It happens during the final bedtools merging step where columns 4, 5, and 6 get COLLAPSED together. In this step, there are 3 files that get merged, manta-cnvkit, manta-freec, and cnvkit-manta. I went back to these files to find any CNVs that overlapped with chr15 64681606 101072404

When inspecting these files, the manta-cnvkit file contained a CNV that overlap with the coordinate above and it has 64635340:101704930:NA in column 4. The manta-freec file also has 1 CNV with 64635340:101704930:NA. The cnvkit-manta file contained 3 CNVs that overlaps with the coordinate above and they all have NULL in column 4. Since I merged these 3 files and COLLAPSED column 4, it resulted in 64635340:101704930:NA,64635340:101704930:NA,NULL,NULL,NULL. This all makes sense.

I think a "solution" to this is to change COLLAPSE to DISTINCT in the merging step. However, I don't think this would get rid of the duplicated segments in more complicated situations. With this, I think it might be useful to make a script to remove duplicates and NULLs from columns 4, 5, and 6.

Please let me know what you think! :)

jashapiro · 2020-01-08T14:35:36Z

@fingerfen I suspected that was the cause, so I am glad to see you came to the same conclusion. I also agree that using DISTINCT would likely cause a different set of problems, so I think the correct solution is to include deduplication of those columns in a separate script along with NULL removal, after the bedtools merge.
I am going to approve this PR, so we can put that step in separately.

jashapiro

There is a deduplication step for columns 4,5,6 that needs to happen still, but this looks good to merge for now.

Duong and others added 10 commits December 18, 2019 03:51

add to Snakefile

3f55855

resolve conflict

9a923a3

Merge remote-tracking branch 'upstream/master'

d38289c

updating fork

3a20aa0

add final consensus merging step

76ebc97

add column naming

5cbad49

add column naming

4804bac

add column naming

217232c

change so that output files doesn't have extra indentation at the end

797bf53

change so that output files doesn't have extra indentation at the end

98714fa

jaclyn-taroni requested a review from jashapiro January 5, 2020 16:06

Merge branch 'master' into merge_consensus_files

60a143f

jashapiro reviewed Jan 6, 2020

View reviewed changes

Duong added 2 commits January 6, 2020 10:03

changed output path and name

9467ad5

changed output path and name

b52693f

jashapiro reviewed Jan 6, 2020

View reviewed changes

nhatduongnn and others added 11 commits January 6, 2020 14:05

Update analyses/copy_number_consensus_call/Snakefile

9a3ed65

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/Snakefile

7c2a454

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/Snakefile

e3ad43e

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/Snakefile

8979e04

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

dd15b95

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

8a2e7aa

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/Snakefile

be894fe

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

changed Snakemake, reduced redundancy

96ddea1

changed Snakemake, reduced redundancy

ab76a58

changed minor error

0d05ad7

update Snakefile

d8dc0e4

add README.md

d3fba93

nhatduongnn and others added 4 commits January 7, 2020 12:04

Update README.md

399f910

Update README.md

cf979a5

add result consensus file

93dcb9f

make changes to analyses/README.md

cdb818f

jashapiro reviewed Jan 7, 2020

View reviewed changes

nhatduongnn and others added 14 commits January 7, 2020 16:16

Update analyses/README.md

ab1ec46

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

39d2f26

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

c50da32

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

2b5c5bd

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

4b504f2

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

97ed47f

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

08d610d

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/Snakefile

058be07

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/README.md

ac1b0e7

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

cut the last column

5c3e70b

Merge branch 'merge_consensus_files' of https://github.com/fingerfen/…

007599d

…OpenPBTA-analysis into merge_consensus_files

add the result file with out the last column

61eb681

Merge branch 'master' into merge_consensus_files

4e14e9d

Merge branch 'master' into merge_consensus_files

10586db

Merge branch 'master' into merge_consensus_files

52dc3e2

Merge branch 'master' into merge_consensus_files

9e35786

jashapiro approved these changes Jan 8, 2020

View reviewed changes

jaclyn-taroni merged commit 7917a7f into AlexsLemonade:master Jan 8, 2020

jharenza mentioned this pull request Jan 13, 2020

Proposed Analysis: Copy number consensus calls #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNV consensus (6 of 6): Merge consensus files and name columns #403

CNV consensus (6 of 6): Merge consensus files and name columns #403

nhatduongnn commented Jan 4, 2020

jashapiro Jan 6, 2020 •

edited

Loading

nhatduongnn Jan 6, 2020 •

edited

Loading

jashapiro left a comment

jashapiro Jan 6, 2020

nhatduongnn Jan 6, 2020

jashapiro Jan 6, 2020

nhatduongnn Jan 6, 2020

jashapiro Jan 6, 2020

jashapiro Jan 6, 2020

nhatduongnn commented Jan 6, 2020

jashapiro commented Jan 6, 2020

nhatduongnn commented Jan 7, 2020 •

edited

Loading

jashapiro commented Jan 7, 2020 •

edited

Loading

jashapiro left a comment

jashapiro commented Jan 7, 2020

nhatduongnn commented Jan 7, 2020

nhatduongnn commented Jan 8, 2020

jashapiro commented Jan 8, 2020

jashapiro left a comment

		"cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} \| sort -k1,1 -k2,2n > {output.dup_del_file}"


		rule merge:

CNV consensus (6 of 6): Merge consensus files and name columns #403

CNV consensus (6 of 6): Merge consensus files and name columns #403

Conversation

nhatduongnn commented Jan 4, 2020

Purpose/implementation Section

What GitHub issue does your pull request address?

jashapiro Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

nhatduongnn Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 6, 2020

Choose a reason for hiding this comment

nhatduongnn Jan 6, 2020

Choose a reason for hiding this comment

jashapiro Jan 6, 2020

Choose a reason for hiding this comment

nhatduongnn Jan 6, 2020

Choose a reason for hiding this comment

jashapiro Jan 6, 2020

Choose a reason for hiding this comment

jashapiro Jan 6, 2020

Choose a reason for hiding this comment

nhatduongnn commented Jan 6, 2020

jashapiro commented Jan 6, 2020

nhatduongnn commented Jan 7, 2020 • edited Loading

jashapiro commented Jan 7, 2020 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro commented Jan 7, 2020

nhatduongnn commented Jan 7, 2020

nhatduongnn commented Jan 8, 2020

jashapiro commented Jan 8, 2020

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 6, 2020 •

edited

Loading

nhatduongnn Jan 6, 2020 •

edited

Loading

nhatduongnn commented Jan 7, 2020 •

edited

Loading

jashapiro commented Jan 7, 2020 •

edited

Loading