-
Notifications
You must be signed in to change notification settings - Fork 83
CNV consensus (6 of 6): Merge consensus files and name columns #403
CNV consensus (6 of 6): Merge consensus files and name columns #403
Conversation
sample=config["samples"], | ||
combined_caller=["manta_cnvkit", "manta_freec", "cnvkit_freec"], | ||
dupdel=["dup", "del"]), | ||
OUTPUT= expand("../../scratch/endpoints/{sample}.del_dup.merged.final.bed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are the final output files, we should probably put them in a results directory within the analysis directory. Perhaps results/cnv_consensus/{sample}.cnv_consensus.csv
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I forgot to put them in the "results" folder. I have made the changes to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fingerfen, thanks for this update!
I have a number of comments associated with some of the individual steps, but here I want to step back a bit and think about what the final file we want out of this is going to look like.
I don't think we will want hundreds of individual .bed
files, which seems to be the current output. Instead, what I think we ultimately want is a single .csv
file that includes all samples. The files you are already producing include that information, so I think that it should be pretty straightforward to accomplish this.
If I am reading everything correctly, you may be able to reduce this to the following steps:
- combine and merge the paired call duplication and deletion files, using the rule at line 162 (which I suggest renaming). Adding the
bedtools merge
at this stage will reduce redundancy in the code. cat
all the files together into a single csv file, adding the header along the way. Making use of snakemake'sexpand()
here, you should be able to do this in a short bit of code. Something like:
rule merge_all:
input:
bedfiles = expand("../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed",
sample = config["samples"],
dupdel = ["dup", "del"])
output:
"results/cnv_consensus.tsv"
shell:
"echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tsample\tfile_names' "
" | cat - {input.bedfiles} "
" > {output}"
Let me know if that makes sense, or if I am missing something in your current code.
If that does work out, then it will make sense to include the final results/cnv_consensus.tsv
file in this PR, so you should make sure you have downloaded the current data release and you can run your pipeline on the full data set.
Along with that, I would really like to have a README file included in this PR that summarizes the approach and output of this analysis. You will also probably want to get started with a description of what you did to add to the methods section, which you can submit as a PR to https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md.
Finally, I also want to make sure you are aware of issue #392, where there is some discussion of the final output format that people might be looking for. I think the current output is a good start, but we may want to think about if there are good summaries we could add to the final calls. I would be happy to hear if you have any thoughts you could add to that discussion!
threads: 1 | ||
shell: | ||
## Name the headers | ||
"python3 {input.script} --file {input.combined_dup_del} > {output.final_dup_del}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very heavy solution for just adding a single header line to the files. Is there a reason to go through python here, rather than just using echo
and cat
? Something like:
"echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tsample\tfile_names' "
" | cat - {input.combined_dup_del} > {output.final_dup_del}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason. It completely escaped my mind that I could do this using echo and cat. Your suggestion is way better. I will apply this change. Thanks again!
threads: 1 | ||
shell: | ||
## Merge the consensus files together | ||
## Put the dup CNVs calls and del CNVs calls together into one file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain in the comments the reasons for the additional bedtools arguments? In particular, the collapse and distinct statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely can. I will briefly explain it here to see if it makes sense for you.
Columns 4, 5, and 6 hold the original CNV calls from Manta, CNVkit, and Freec, respectively. When we merge manta_cnvkit, manta_freec, and cnvkit_freec
files together, manta_cnvkit
and manta_freec
will have information in column 4 (manta column). Thus, when using bedtools merging, we want to COLLAPSE these information and keep all information that is in column 4. The same applies to columns 5 and 6.
Columns 7 and 8 are the CNVtype (DEL, DUP) and Sample_name, respectively. Among the manta_cnvkit, manta_freec, and cnvkit_freec
files, at this point, we should have the same CNVtypes and Sample_name. Thus we perform DISTINCT, which is to take the unique of columns 7 and 8.
As for column 9, this column holds the files that were merged to get a specific CNV. Thus we want to keep all information here so we COLLAPSE it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For columns 4, 5 and 6, will any of this result in duplication? For example, if the same original manta call was used in manta_cnvkit and manta_freec, will it now show up more than once? If so, I think we may want to clean that up in the final file, but that may require a bit of extra logic, so we would put it off to a separate PR.
analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py
Outdated
Show resolved
Hide resolved
analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py
Outdated
Show resolved
Hide resolved
"cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} | sort -k1,1 -k2,2n > {output.dup_del_file}" | ||
|
||
|
||
rule merge: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes above in the combine_files
rule, this rule would change to just the cat
of dup and del files. As such, we might want to combine this with the column names rule, though see earlier comments about combining the .bed
files.
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
@jashapiro, thank you for your review and suggestions. Once again, they were very helpful. I think I have applied all of the changes you suggested. Please have a look. I am currently working on the README.md and downloading the newest data release to run the entire pipeline on it from the beginning as suggested. I will let you know when the code finish running the newly downloaded data, hopefully by tomorrow. I will also be making changes to the methods.md file in a different PR as you suggested. Thank you |
Thanks @fingerfen, this all looks good from a first look, and it passed CI. Let me know if you run into any trouble with running it on the full data set, and I look forward to seeing the README when you get it done. |
@jashapiro I have updated the For some reason CI is failing, I checked the error and I believe it doesn't have anything to do with this analysis. I think I will have to wait for the error to be resolved. |
I think CI was just getting cancelled as you added each new change. The latest commit passed CI, so everything looks good on the CI front. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I have a few clarification questions, and one where I suggest a change to a column name to correspond better with naming conventions of the project.
The other thing I was thinking about is whether to retain the filenames in the final output. My inclination is not to, as those files won't be coming along, so most users won't have them. I think the easiest way to remove them is to use cut
; adding the following to the merge_all rule pipe should handle things, but check that I am right!
cut -f 1-8
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…OpenPBTA-analysis into merge_consensus_files
I think we are really close here, and if CI passes, we can merge this commit. However, before we call this analysis "done" I would like to add one more step to your analysis to clean things up just a bit. I looked at the output file, and it does look like there is some duplicated information in there that could cause confusion down the line, but should be pretty straightforward to fix at this stage. For example, look at the consensus CNV
The manta_cnv calls there are two of the same location, followed by It would be nice to write a script that parses through and removes duplicates and If you have time to do that soon, let me know, or I can tackle it. To be clear, this would be a separate pull request (7 of 6, so to speak), as I think this one has enough in it already. |
@jashapiro I agree that the NULL should be removed and I can write a script to address that no problem. My main concern here are the duplications. Thanks for catching that. It is not supposed to contain duplicated segments. I am curious as to why it has duplicated information and I would like to go back and find out the reason for it. I will take a look at it tonight and hopefully update you about the situation tomorrow. |
So I found out the reason why there are duplicates. It happens during the final bedtools merging step where columns 4, 5, and 6 get COLLAPSED together. In this step, there are 3 files that get merged, When inspecting these files, the I think a "solution" to this is to change COLLAPSE to DISTINCT in the merging step. However, I don't think this would get rid of the duplicated segments in more complicated situations. With this, I think it might be useful to make a script to remove duplicates and NULLs from columns 4, 5, and 6. Please let me know what you think! :) |
@fingerfen I suspected that was the cause, so I am glad to see you came to the same conclusion. I also agree that using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a deduplication step for columns 4,5,6 that needs to happen still, but this looks good to merge for now.
Purpose/implementation Section
To add the last steps to the consensus calling pipeline. These steps merge any overlapping consensus segments and name the columns
What GitHub issue does your pull request address?
#128
analyses/README.md
.