Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

CNV consensus (5 of 6):Consensus call #357

Merged
merged 26 commits into from
Jan 4, 2020

Conversation

nhatduongnn
Copy link
Contributor

Purpose/implementation Section

Implement the consensus calling part of the pipeline

What GitHub issue does your pull request address?

issue #128

Which areas should receive a particularly close look?

The compare_variant_calling_updated.py script.

Is there anything that you want to discuss further?

I think this script is the bottle neck of the pipeline. As discussed before, I could go back to update this to use Dictionary after the pipeline is in. A singular call only take 1-2 second to run, but it takes a while with 940 samples

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.
  • This analysis is recorded in the table in analyses/README.md.

@nhatduongnn nhatduongnn changed the title Consensus call CNV consensus (5 of 6):Consensus call Dec 19, 2019
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this code looks good, though I have a couple of questions and suggestions for clarity.

There are some other comments where I am suggesting some structural changes around the outside to make it a bit clearer what is going on, and hopefully to improve maintainability. (Note that I am writing this after some of the individual comments, so there may be some discrepancies.)
To be clear, this is partly about style. As such, you might choose not to implement this now, but we could circle back to it if you have time later.

The main thing is that the python script requires three specific inputs and outputs, but that specific information is obscured in the body when we start to refer to things by index numbers in lists. My suggestions are designed to make that a bit more transparent.

The first thing is to split the code into three functions: input, consensus and output.

  • The input function will take a file name and return a list of data.
  • The consensus function will take two lists and output a consensus list
  • The output function will take a consensus list and output file name and write the output file.

Then, rather than using lists of lists, I would suggest using a dict to store the three caller inputs, or just storing them each in their own variables, since there are only three. Then you can use keys and values (or variable names and strings) to keep track of what's what within the functions.

As I said, this is a set of suggestions and not required for the analysis. If you do not plan to implement this, or you want to come back around to it later, feel free to reply with that, and we will go ahead and work on getting this merged with minimal changes.

Note that CI may fail if you make any new changes today, as we are updating a bunch of other parts of the repo at the moment. Make sure you update your branch, but if you see errors from other modules, just wait a bit and we can rerun once other parts are fixed.


## Put the input and output file paths into their own lists that is to be iterated over
## This order is important
list_of_files = [args.manta, args.cnvkit, args.freec]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for clarity and parallelism, you may want to call these something like input_list and output_list

Comment on lines 91 to 94
## If the 1st column has '1' instead of 'chr1' for the chromosome number, add in 'chr'
if fin_input_content[0][0].find('chr') == -1:
for c,chromo in enumerate(fin_input_content):
fin_input_content[c][0] = 'chr' + fin_input_content[c][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not occur in these, since we changed #328 so we are getting consistent bed files.

list_of_output_files = [args.manta_cnvkit, args.manta_freec, args.cnvkit_freec]

## Create a list to store the content of the 3 caller CNV files
list_of_list = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this something more descriptive?

Comment on lines 109 to 111
## Loop through list_index
for j, jval in enumerate(list_index):
for k,kval in enumerate(list_index[j+1:]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this way of doing things is flexible, I feel like it adds some abstraction that might be confusing. Since we are only doing three comparisons, and we know what they are from the inputs, I might suggest the following:

  • Turn the body of this double loop into a function that takes two lists as input, and outputs the consensus list for that pair.
  • Call that function three times, once for each pair of inputs, storing the outputs not in a an unnamed list, but in variables that describe their content: i.e. manta_freec_consensus.

Comment on lines 220 to 229
if jval == 0 and kval == 1:
overlap_chrom = [m[0],str(chrom_start),str(chrom_end),str(cnv_list1).strip(','),list2_chr_str_end.strip(','),'NULL',m[-1]]

## if jval == 0 (manta) and kval == 2 (freec), put info in column 4 and 6, 5th column is null
elif jval == 0 and kval == 2:
overlap_chrom = [m[0],str(chrom_start),str(chrom_end),str(cnv_list1).strip(','),'NULL',list2_chr_str_end.strip(','),m[-1]]

## if jval == 1 (cnvkit) and kval == 2 (freec), put info in column 5 and 6, 4th column is null
elif jval == 1 and kval == 2:
overlap_chrom = [m[0],str(chrom_start),str(chrom_end),'NULL',str(cnv_list1).strip(','),list2_chr_str_end.strip(','),m[-1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make a function as I suggest, this would need to change: perhaps the function could simply take as an argument the output name, such as manta_freec and then use that as the variable for these cases.

Comment on lines 242 to 243
with open( list_of_output_files[i] , 'x') as file:
sys.stderr.write('$$$ Created new file succesfully\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than only throwing an error, can we have an option to overwrite as needed? I am not sure whether snakemake deletes files before rerunning a step when the input files change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made a few small files and did a test. It seems that Snakemake DOES delete the output file, then regenerate that file if the input is at a later time stamp than the output.

I put this step in so that the script doesn't override any info that the user might not want to be deleted. But I think in the context of this pipeline, we could just add a step to override the output file. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I don't think it matters much. I would personally have an option for overwriting, but I don't feel strongly.

Comment on lines 251 to 252
single_name = list_of_output_files[i].split('/')[-1]
sample_name = args.sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding the file name to each line? Could we simplify this to just the consensus pair name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that people might be using the final consensus file for different purposes. Maybe people put CNVs from different files together. So with this, even if different CNVs from different files are put together, we still know which CNVs are from which files/samples. It is just an extra column IN CASE anybody needs it. I can definitely take it out if you don't think it is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I just thought the combination of sample id and callers would probably be sufficient. But I am fine with leaving it as is.

nhatduongnn and others added 5 commits December 20, 2019 14:55
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
@nhatduongnn
Copy link
Contributor Author

I am implementing the suggested changes and will make an update shortly. Sorry for the delay.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nhat-

Since you started to take my suggestions, I decided to make a few more minor ones to make things "more pythonic" Overall, the preference is to do things as directly as possible, without too many levels of indirection. The best example of this is the final output loop that I modified below, but I am sure there are other places that the same principal could be applied.

Again, this is mostly a style thing, but can also make the code more compact and readable.

A note: I will be away for the next week and a half, so I won't be able to review changes after today until the new year. I think there will be other people around though, so hopefully things will keep moving!

for k in output_file_content:

## Add the sample name and file name to each line
single_name = output_path.split('/')[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line outside the loop, as the path never changes.
you can also replace it with the safer os.path.basename(output_path) to avoid separator variation.


## Add the sample name and file name to each line
single_name = output_path.split('/')[-1]
file.write('\t'.join(k) + '\t'+ sample_name + '\t' + single_name +'\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style point.. I would probably write this as, but your version is perfectly clear.

file.write('\t'.join(k.extend([sample_name, single_name]) + '\n')

Comment on lines 251 to 257
## Put the output file paths into their own lists that is to be iterated over
## This order is important
output_list = [args.manta_cnvkit, args.manta_freec, args.cnvkit_freec]

## Make a list of iteraby index for the MAIN for-loop below
## This give a list as followed: [0,1,2]
list_index = list(range(0,len(dict_of_input_content)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing these lists and indexes, we could simplify this to something like this to keep it simpler

caller_pairs = [('manta', 'cnvkit'), ('manta', 'freec'), ('cnvkit', 'freec')]

Then you can use a single loop below, something like:

for caller1, caller2 in caller_pairs:
    list1 = dict_of_input_content[caller1]
    list2 = dict_of_input_content[caller2]
    ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @jashapiro. The reason why I did it my way was because I was trying to avoid hardcoding as much as possible. My intention was that when someone want to change the code, they can go in and be able to change the code more easily.

I guess what I am struggling with is when to hardcode this information and when not to. However, I do agree that your method is simpler and definitely more clear. I will change the script to reflect that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it might be even better to try itertools.combinations. Does itertools.combinations(['manta', 'cnvkit', 'freec'], 2) do what you want in terms of getting caller pairs that you can iterate though?

I also wonder if you could use itertools.combinations over dict_of_input_content's keys to get something that's both clear and more general. I don't quite understand what jval and kval are so I can't provide feedback that's highly specific to this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link for convenience to itertools.combinations: https://docs.python.org/3/library/itertools.html#itertools.combinations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only potential issue I see with this is if the order is not as expected: some of the surrounding code (in other scripts/snakefile) does depend on filenames being as expected (with the expected content!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @jashapiro. The reason why I did it my way was because I was trying to avoid hardcoding as much as possible. My intention was that when someone want to change the code, they can go in and be able to change the code more easily.

I think this is a good goal, but given that there is already some hard-coding in here (notably the arguments), it is going to be hard to get rid of it all.

If removing hard coding were the full goal, I would probably simplify this script to take in two files and output the merged file, then move the combinations logic to snakemake. That would lose some efficiency because you would have to read in data more than once, but it is easier to extend. Having already moved the main merge logic to a function, this would be easy to do in the future if desired.

nhatduongnn and others added 4 commits December 24, 2019 10:11
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
…t_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
@nhatduongnn
Copy link
Contributor Author

Thanks @jashapiro and @cgreene for the helpful suggestions as always. Since it's winter break, I have more time to work on this so I want to get it to a point where the scripts are clean, straightforward, and that everyone is happy with. Thus I am happy to implement any changes that you think will make the code better. I am also learning a lot by doing it.

With this new commit, I believe I have addressed all of your comments. Please have a look.

And thanks @jashapiro for letting me know about your availability. I only have one more step to PR and then probably another one to add in a header(column names) for these bed files. After that, this should be done! I should be readily available to address any problems over this winter break.

Happy holidays to everyone and safe travels! :)

Comment on lines 269 to 274
manta_dict = read_input_file(args.manta)
cnvkit_dict = read_input_file(args.cnvkit)
freec_dict = read_input_file(args.freec)

## Put the input dictionaries into a bigger dictionary
dict_of_input_content = {'manta':manta_dict, 'cnvkit':cnvkit_dict, 'freec':freec_dict}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more on the idea of easy extensibility, how about something like this:

input_callers = ['manta', 'cnvkit', 'freec']
input_content = dict()
for caller in input_callers:
    input_content[caller] = read_input_file(getattr(args, caller))

Then you don't need to separately define input_file_names below (delete current line 276)

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Sorry for the long delay in approving it! I'm mostly back from vacation now, so the next round should be faster!

@cgreene cgreene merged commit 1d6cf2c into AlexsLemonade:master Jan 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants