CNV consensus (5 of 6):Consensus call #357

nhatduongnn · 2019-12-19T02:40:51Z

Purpose/implementation Section

Implement the consensus calling part of the pipeline

What GitHub issue does your pull request address?

issue #128

Which areas should receive a particularly close look?

The compare_variant_calling_updated.py script.

Is there anything that you want to discuss further?

I think this script is the bottle neck of the pipeline. As discussed before, I could go back to update this to use Dictionary after the pipeline is in. A singular call only take 1-2 second to run, but it takes a while with 940 samples

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.
This analysis is recorded in the table in analyses/README.md.

jashapiro

The body of this code looks good, though I have a couple of questions and suggestions for clarity.

There are some other comments where I am suggesting some structural changes around the outside to make it a bit clearer what is going on, and hopefully to improve maintainability. (Note that I am writing this after some of the individual comments, so there may be some discrepancies.)
To be clear, this is partly about style. As such, you might choose not to implement this now, but we could circle back to it if you have time later.

The main thing is that the python script requires three specific inputs and outputs, but that specific information is obscured in the body when we start to refer to things by index numbers in lists. My suggestions are designed to make that a bit more transparent.

The first thing is to split the code into three functions: input, consensus and output.

The input function will take a file name and return a list of data.
The consensus function will take two lists and output a consensus list
The output function will take a consensus list and output file name and write the output file.

Then, rather than using lists of lists, I would suggest using a dict to store the three caller inputs, or just storing them each in their own variables, since there are only three. Then you can use keys and values (or variable names and strings) to keep track of what's what within the functions.

As I said, this is a set of suggestions and not required for the analysis. If you do not plan to implement this, or you want to come back around to it later, feel free to reply with that, and we will go ahead and work on getting this merged with minimal changes.

Note that CI may fail if you make any new changes today, as we are updating a bunch of other parts of the repo at the moment. Make sure you update your branch, but if you see errors from other modules, just wait a bit and we can rerun once other parts are fixed.

jashapiro · 2019-12-19T16:18:44Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+
+## Put the input and output file paths into their own lists that is to be iterated over
+## This order is important
+list_of_files = [args.manta, args.cnvkit, args.freec]


just for clarity and parallelism, you may want to call these something like input_list and output_list

jashapiro · 2019-12-19T16:23:00Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+            ## If the 1st column has '1' instead of 'chr1' for the chromosome number, add in 'chr'
+            if fin_input_content[0][0].find('chr') == -1:
+                for c,chromo in enumerate(fin_input_content):
+                    fin_input_content[c][0] = 'chr' + fin_input_content[c][0]


This should not occur in these, since we changed #328 so we are getting consistent bed files.

jashapiro · 2019-12-19T16:26:03Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+list_of_output_files = [args.manta_cnvkit, args.manta_freec, args.cnvkit_freec]
+
+## Create a list to store the content of the 3 caller CNV files
+list_of_list = []


can we name this something more descriptive?

jashapiro · 2019-12-19T16:34:12Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+## Loop through list_index
+for j, jval in enumerate(list_index):
+    for k,kval in enumerate(list_index[j+1:]):


While this way of doing things is flexible, I feel like it adds some abstraction that might be confusing. Since we are only doing three comparisons, and we know what they are from the inputs, I might suggest the following:

Turn the body of this double loop into a function that takes two lists as input, and outputs the consensus list for that pair.

Call that function three times, once for each pair of inputs, storing the outputs not in a an unnamed list, but in variables that describe their content: i.e. manta_freec_consensus.

jashapiro · 2019-12-19T16:37:34Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+                if jval == 0 and kval == 1:
+                    overlap_chrom = [m[0],str(chrom_start),str(chrom_end),str(cnv_list1).strip(','),list2_chr_str_end.strip(','),'NULL',m[-1]]
+
+                ## if jval == 0 (manta) and kval == 2 (freec), put info in column 4 and 6, 5th column is null
+                elif jval == 0 and kval == 2:
+                    overlap_chrom = [m[0],str(chrom_start),str(chrom_end),str(cnv_list1).strip(','),'NULL',list2_chr_str_end.strip(','),m[-1]]
+
+                ## if jval == 1 (cnvkit) and kval == 2 (freec), put info in column 5 and 6, 4th column is null
+                elif jval == 1 and kval == 2:
+                    overlap_chrom = [m[0],str(chrom_start),str(chrom_end),'NULL',str(cnv_list1).strip(','),list2_chr_str_end.strip(','),m[-1]]


If you make a function as I suggest, this would need to change: perhaps the function could simply take as an argument the output name, such as manta_freec and then use that as the variable for these cases.

jashapiro · 2019-12-19T16:39:00Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+        with open( list_of_output_files[i] , 'x') as file:
+            sys.stderr.write('$$$ Created new file succesfully\n')


Rather than only throwing an error, can we have an option to overwrite as needed? I am not sure whether snakemake deletes files before rerunning a step when the input files change.

I just made a few small files and did a test. It seems that Snakemake DOES delete the output file, then regenerate that file if the input is at a later time stamp than the output.

I put this step in so that the script doesn't override any info that the user might not want to be deleted. But I think in the context of this pipeline, we could just add a step to override the output file. What do you think?

In that case I don't think it matters much. I would personally have an option for overwriting, but I don't feel strongly.

jashapiro · 2019-12-19T16:44:38Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+                single_name = list_of_output_files[i].split('/')[-1]
+                sample_name = args.sample


Why are we adding the file name to each line? Could we simplify this to just the consensus pair name?

My thought is that people might be using the final consensus file for different purposes. Maybe people put CNVs from different files together. So with this, even if different CNVs from different files are put together, we still know which CNVs are from which files/samples. It is just an extra column IN CASE anybody needs it. I can definitely take it out if you don't think it is necessary.

Makes sense. I just thought the combination of sample id and callers would probably be sufficient. But I am fine with leaving it as is.

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

nhatduongnn · 2019-12-23T08:18:42Z

I am implementing the suggested changes and will make an update shortly. Sorry for the delay.

jashapiro

Hi Nhat-

Since you started to take my suggestions, I decided to make a few more minor ones to make things "more pythonic" Overall, the preference is to do things as directly as possible, without too many levels of indirection. The best example of this is the final output loop that I modified below, but I am sure there are other places that the same principal could be applied.

Again, this is mostly a style thing, but can also make the code more compact and readable.

A note: I will be away for the next week and a half, so I won't be able to review changes after today until the new year. I think there will be other people around though, so hopefully things will keep moving!

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

jashapiro · 2019-12-24T12:11:21Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+        for k in output_file_content:
+
+            ## Add the sample name and file name to each line
+            single_name = output_path.split('/')[-1]


Move this line outside the loop, as the path never changes.
you can also replace it with the safer os.path.basename(output_path) to avoid separator variation.

jashapiro · 2019-12-24T12:16:02Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+
+            ## Add the sample name and file name to each line
+            single_name = output_path.split('/')[-1]
+            file.write('\t'.join(k) + '\t'+ sample_name + '\t' + single_name +'\n')


Minor style point.. I would probably write this as, but your version is perfectly clear.

file.write('\t'.join(k.extend([sample_name, single_name]) + '\n')

jashapiro · 2019-12-24T12:26:23Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+## Put the output file paths into their own lists that is to be iterated over
+## This order is important
+output_list = [args.manta_cnvkit, args.manta_freec, args.cnvkit_freec]
+
+## Make a list of iteraby index for the MAIN for-loop below
+## This give a list as followed: [0,1,2]
+list_index = list(range(0,len(dict_of_input_content)))


Rather than doing these lists and indexes, we could simplify this to something like this to keep it simpler

caller_pairs = [('manta', 'cnvkit'), ('manta', 'freec'), ('cnvkit', 'freec')]

Then you can use a single loop below, something like:

for caller1, caller2 in caller_pairs: list1 = dict_of_input_content[caller1] list2 = dict_of_input_content[caller2] ....

Thank you for the suggestion @jashapiro. The reason why I did it my way was because I was trying to avoid hardcoding as much as possible. My intention was that when someone want to change the code, they can go in and be able to change the code more easily.

I guess what I am struggling with is when to hardcode this information and when not to. However, I do agree that your method is simpler and definitely more clear. I will change the script to reflect that.

In this case it might be even better to try itertools.combinations. Does itertools.combinations(['manta', 'cnvkit', 'freec'], 2) do what you want in terms of getting caller pairs that you can iterate though?

I also wonder if you could use itertools.combinations over dict_of_input_content's keys to get something that's both clear and more general. I don't quite understand what jval and kval are so I can't provide feedback that's highly specific to this code.

Link for convenience to itertools.combinations: https://docs.python.org/3/library/itertools.html#itertools.combinations

The only potential issue I see with this is if the order is not as expected: some of the surrounding code (in other scripts/snakefile) does depend on filenames being as expected (with the expected content!).

Thank you for the suggestion @jashapiro. The reason why I did it my way was because I was trying to avoid hardcoding as much as possible. My intention was that when someone want to change the code, they can go in and be able to change the code more easily.

I think this is a good goal, but given that there is already some hard-coding in here (notably the arguments), it is going to be hard to get rid of it all.

If removing hard coding were the full goal, I would probably simplify this script to take in two files and output the merged file, then move the combinations logic to snakemake. That would lose some efficiency because you would have to read in data more than once, but it is easier to extend. Having already moved the main merge logic to a function, this would be easy to do in the future if desired.

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

nhatduongnn · 2019-12-24T19:53:58Z

Thanks @jashapiro and @cgreene for the helpful suggestions as always. Since it's winter break, I have more time to work on this so I want to get it to a point where the scripts are clean, straightforward, and that everyone is happy with. Thus I am happy to implement any changes that you think will make the code better. I am also learning a lot by doing it.

With this new commit, I believe I have addressed all of your comments. Please have a look.

And thanks @jashapiro for letting me know about your availability. I only have one more step to PR and then probably another one to add in a header(column names) for these bed files. After that, this should be done! I should be readily available to address any problems over this winter break.

Happy holidays to everyone and safe travels! :)

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

jashapiro · 2019-12-24T21:34:29Z

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

+manta_dict = read_input_file(args.manta)
+cnvkit_dict = read_input_file(args.cnvkit)
+freec_dict = read_input_file(args.freec)
+
+## Put the input dictionaries into a bigger dictionary
+dict_of_input_content = {'manta':manta_dict, 'cnvkit':cnvkit_dict, 'freec':freec_dict}


Thinking a bit more on the idea of easy extensibility, how about something like this:

input_callers = ['manta', 'cnvkit', 'freec'] input_content = dict() for caller in input_callers: input_content[caller] = read_input_file(getattr(args, caller))

Then you don't need to separately define input_file_names below (delete current line 276)

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

…A-analysis into consensus_call

jashapiro

I think this looks good. Sorry for the long delay in approving it! I'm mostly back from vacation now, so the next round should be faster!

Duong and others added 7 commits December 18, 2019 03:51

add to Snakefile

3f55855

resolve conflict

9a923a3

add step 5

24a051a

change Snakemake

6441e79

change Snakemake

83e0c88

change Snakemake

e6e18f8

add files

218c5a0

nhatduongnn changed the title ~~Consensus call~~ CNV consensus (5 of 6):Consensus call Dec 19, 2019

jashapiro reviewed Dec 19, 2019

View reviewed changes

nhatduongnn and others added 5 commits December 20, 2019 14:55

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

36a2a01

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

b39f742

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

41604f2

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Merge branch 'master' into consensus_call

e693c7d

Merge remote-tracking branch 'upstream/master' into consensus_call

1355266

Duong added 2 commits December 23, 2019 19:51

implemented dict

430c622

implemented dict

4b5ae8c

jashapiro reviewed Dec 24, 2019

View reviewed changes

nhatduongnn and others added 4 commits December 24, 2019 10:11

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

929a7e4

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

17a5f2b

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

minor changes

09976b2

clean up code and add itertools

cc96d79

jashapiro reviewed Dec 24, 2019

View reviewed changes

analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py Outdated Show resolved Hide resolved

jashapiro reviewed Dec 24, 2019

View reviewed changes

nhatduongnn and others added 5 commits December 27, 2019 18:59

Update analyses/copy_number_consensus_call/src/scripts/compare_varian…

c1cc92e

…t_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

applying suggested changes

1bbb3c9

applying suggested changes

cb10268

applying suggested changes

76bd06d

Merge branch 'master' into consensus_call

711c475

Duong and others added 3 commits January 2, 2020 17:16

Merge remote-tracking branch 'upstream/master' into consensus_call

a634444

Merge branch 'consensus_call' of https://github.com/fingerfen/OpenPBT…

c580653

…A-analysis into consensus_call

Merge branch 'master' into consensus_call

6340dab

jashapiro approved these changes Jan 4, 2020

View reviewed changes

cgreene merged commit 1d6cf2c into AlexsLemonade:master Jan 4, 2020

jharenza mentioned this pull request Jan 13, 2020

Proposed Analysis: Copy number consensus calls #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNV consensus (5 of 6):Consensus call #357

CNV consensus (5 of 6):Consensus call #357

nhatduongnn commented Dec 19, 2019

jashapiro left a comment

jashapiro Dec 19, 2019

jashapiro Dec 19, 2019

jashapiro Dec 19, 2019

jashapiro Dec 19, 2019

jashapiro Dec 19, 2019

jashapiro Dec 19, 2019

nhatduongnn Dec 20, 2019

jashapiro Dec 20, 2019

jashapiro Dec 19, 2019

nhatduongnn Dec 20, 2019

jashapiro Dec 20, 2019

nhatduongnn commented Dec 23, 2019

jashapiro left a comment

jashapiro Dec 24, 2019

jashapiro Dec 24, 2019

jashapiro Dec 24, 2019

nhatduongnn Dec 24, 2019

cgreene Dec 24, 2019

cgreene Dec 24, 2019

jashapiro Dec 24, 2019

jashapiro Dec 24, 2019

nhatduongnn commented Dec 24, 2019

jashapiro Dec 24, 2019

jashapiro left a comment

		with open( list_of_output_files[i] , 'x') as file:
		sys.stderr.write('$$$ Created new file succesfully\n')

		single_name = list_of_output_files[i].split('/')[-1]
		sample_name = args.sample

CNV consensus (5 of 6):Consensus call #357

CNV consensus (5 of 6):Consensus call #357

Conversation

nhatduongnn commented Dec 19, 2019

Purpose/implementation Section

What GitHub issue does your pull request address?

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nhatduongnn commented Dec 23, 2019

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nhatduongnn commented Dec 24, 2019

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment