-
Notifications
You must be signed in to change notification settings - Fork 83
Updated analysis: SNV consensus calling (for things other than substitutions) #990
Comments
I hadn't thought about all the possible representation for indels , but this is a great point of interest for consensus calling! I looked at what people have done before and found this normalization code from bcbio:
I think we already do MNV to SNV in the consensus calling , in addition , they decompose biallelic substitutions and finally vt normalize to normalize the indel calls in vcf from all callers. I also found some documentation and the pseudo-code here Would this be something that can be of interest to run upstream before we process the variant calls in the repo to fix this issue of indel representation (and possibly will also fix MNVs upstream) or potentially use their pseudo-code to implement it for maf inputs? Similar to the above bcbio normalization, normalization is also being developed as a step in our internal D3b consensus calling by @migbro as follows ( let me know if I missed something) :
He did share an output for the normalization are here's an example normalization a couple deletions below 🤔 . Raw mutect2 maf
Normalized mutect2 vcf in maf format
|
Thanks @kgaonkar6! I think you see some of the issues I was thinking about. I think the MNV issue should be fine; we were sort of saved by the fact that strelka2 does not call any MNVs, so we had to decompose the others right off. But INDELs remain an issue that I guess we never really resolved. It does seem to me that applying a normalization algorithm to all of the vcf files would be worth doing for these data, if it was not done already. That said, I'm definitely confused by some of the output from normalization you are presenting from @migbro.
If the answers to my confusion involve the original VCF files, then it seems like we will have to have you do a normalization step, which would potentially improve some of the indel consensus calls. Once that is done we would probably want to do some quick analysis of the hotspot calls to see what effects requiring 2 or 3 callers to agree might have. I remain concerned about including variants even in the "scavenged" set that are only called by |
@migbro and @yuankunzhu - do we have this performed pre-d3b consensus? |
Not yet. I currently have in review a workflow to bring our vcfs to a new spec. This means the following will happen:
|
Ah! I knew that estimate was a little too good, multiply that by 4. so more like 4 hours is the rosy estimate to generate outputs start-to-finish. |
Thanks for your work on this @migbro! I noted elsewhere, but apparently not in this issue that we will also need to update the TCGA set that we are using for comparison here. So the time estimate is going to be even longer, unfortunately. |
Sure thing. The PR is approved, and @zhangb1 and @yuankunzhu are discussing today how to organize and run this request. As for adding TCGA to that request, I think we'd have to meet to see if we can make that happen, and we'd to loop in someone like @jharenza or someone else who was familiar with what inputs Teja used when she worked with these file sto see if it's doable. |
Ok, a quick update on this. TCGA is done in terms of normalizing vcfs and creating per-caller maf files. There are two sets of outputs for each caller, one considered "protected" the other "public". This is because of a recent effort to attempt to employ "germline masking." For us, that entails dropping calls that have a norm depth < 8 and/or gnomad AD > 0.001 for the public version (except hotspots). Hotspots using v2 cancer hotspots are also marked, you can of course add more or just ignore them. The protected version will just have a soft filter - meaning calls meeting that criteria just have a not in the As for this protected vs public thing, D3b, for our public cavatica projects and provisional pedcbioportal projects will be using those public outputs. I am not sure if openPBTA is expected/encouraged to follow suit, but just putting it out there in case someone is wondering why there are two sets of outputs, etc. No one seems to have thought of the ramifications for groups outside of D3b yet, or given me any guidance on that 😬 Maybe we ask Adam R. about it? |
Ok, these files:
have been uploaded here: These mafs are those public ones that I was talking about, as @jharenza has recommended that I merge those and add, which is nice to have that kind of synergy (-2 points for business jargon, I know!). |
thank you both so much!! |
@migbro Can you clarify a couple of the added columns in the new MAF files? Specifically, what are There are also a few things that were removed, most notably Entrez IDs. I don't know if any packages use the Entrez IDs, but that is something we will have to look out for. (I note that validation & ExAC info is now removed as well, but those seemed to all be NA in the old data, so that seems totally reasonable!) Noting also that we will need to update https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md with the new file format. |
I assumed that the column names were consistent, but that doesn't seem to be the case! Should the columns be in the MAF files at all, or can they be harmonized? |
Hi @jashapiro , sure! So the added columns were columns that I subjectively thought might be helpful output that each caller does that others might not. As far as I know, the MAF spec does allow for users to add additional columns on top of the "core" set. Descriptions are from the vcf headers below |
Thanks @migbro! Yeah, there are some small updates to the consensus scripts that I needed to make, but nothing major. I am dropping those columns for the consensus, but I don't think there is any problem leaving them in the individual caller files, as long as they are documented. |
@migbro One followup: it does seem that the As I said, I don't expect to use it, but if the column is expected by others it could definitely cause some trouble at times. Was this removal intentional? |
Hi @jashapiro , yeah, that's an interesting point. So, it was intentional because we haven't been filling it in. I felt comfortable dropping it following what cBioPortal deemed as minimal for a maf https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format, which is admittedly, not much! I can ping Kyle Hernandez from U Chicago to get his opinion on it, who was heavily involved in creating the GDC spec |
@jashapiro , yes, you are right that is a required column (one of 30), and Kyle has pointed me to a definitive source for that. I will do the following today:
|
@migbro Thanks for looking into it! Yes, it should be fine to overwrite the files. Just let me know when the new ones are up! |
@jashapiro it's up! |
@jharenza The consensus files based on the updated MAFs are now uploaded to https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip I have not checked them in detail, but they seem to be about the right size: a bit smaller than previously, but based on some of the shrinkage in the input files, that seems reasonable? I hadn't fully processed how much VarDict results shrank! From 4.1 GB to ~270 MB!!! @migbro Am I understanding correctly that this is the result of on filtering on sequencing depth and population frequency? I'm kind of in shock that the change was that dramatic. I wouldn't have expected it to fall below all other callers from where it started, since I would expect that those filters should be the same across callers by site, and any true VarDict calls would be in at least one other caller. If we weren't so far down the line in analysis, I might consider adding it back to the consensus... I'm running the caller comparison notebook now though, so that may give some insight. I don't actually think the change will be hard, so we might think about it for v20, depending on results. |
@jashapiro I know right?! I took a peek at the CBTN portion, and summarized the
High gnomad > 0.001 filter is def the big culler of variants. |
Yeah, but what I don't understand is why so many got culled from vardict that didn't get culled other places. Here is an upset plot from the previous data: (Other comparisons can be seen here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1033/files?short_path=726dc81#diff-726dc818248c6115d51dc51d7f200a1a19b23ecceaf77365ddb2c993057ad0e9) If we focus on Strelka2, mutect2 and VarDict, we see that before they shared ~3.7 million calls. After, they share only 460K. So it seems like calls are being removed from VarDict that are surviving in mutect & Strelka2. This seems really odd to me, because those should all share the same gnomad frequencies. By contrast, the mutect + strelka set goes up from 500K to 3.4 million, again suggesting that there are things lost from only VarDict. Notably, the mutect + strelka + lancet set (our current consensus) stays about the same, so this won't have a big downstream effect at the moment. This makes me less inclined to revisit the consensus criteria. |
Thanks for looking into this! In looking at the pre-pass file from vardict on a hunch, it seems that VarDict has it's own |
So in short, I'd say your current consensus method is probably safe since you are excluding VarDict. In the meantime, I will modify how the annotation is done for VarDict first, but generally will likely strip any existing |
A quick update, so after adding in a pre-strip INFO step, the numbers look more "sane." A single sample example:
Post strip:
For comparison, strelka2:
I'll re-run vardict with this new wf and update the vardict maf when completed. |
Ok, vardict has been updated |
thanks @migbro! I'll create the CI files for vardict. @jashapiro just wanted to check if these tmb files in the link https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip should be renamed to the file naming convention used in v18? filenames currrent link filenames in v18: It seems to me the CI script will still work since it looks for all pbta-snv* with the current names though. |
Yes, we should keep the file names the same as before! Sorry about that: I didn't confirm that the script output file names matched. |
With the changes to normalized vcfs and updated files associated with the release of v19 #1026, I believe this can be closed. While it is possible that more improvements could be made, I think that would require substantial time and effort to fully assess and implement. |
What analysis module should be updated and why?
snv-callers
, and in particular the generation of the consensus MAF file.What changes need to be made? Please provide enough detail for another participant to make the update.
As noted in another review (#961 (review)), the current SNV consensus script may not handle combining indel variants correctly. Matching of individual calls across callers is currently performed only on the basis of start location, on the erroneous assumption that all calls are a single nucleotide in length. This is true for SNVs (and MNVs, all of which are reduced to SNVs before merging), but does not apply to the small indels that might occur.
INDEL variants should probably be treated entirely separately from substitutions, but the logic of how to do that is something that requires some investigation, as there are many possible edge cases.
The simple case is that start and end positions match across all three callers. Those we are almost certainly capturing, but they are probably being mixed up a bit with cases where...
Start matches, but callers choose different end positions. (And vice versa). These should be rare, but are certainly possible if one caller decides a deletion is 5 bp and another 6bp, for example. I guess I would go with majority rule among the callers, if possible, in this case?
Overlapping deletion calls with no shared start and end. I expect this to happen in cases where the first base(s) of a deletion is repeated at the end: For example, a
CATATAT -> CAT
deletion could start at position 2 or position 4 and different callers might make different, equally correct, decisions. The nice thing in this case is that they should match as to the reference and alt alleles. (Both areATAT -> _
, but this might not always be the case in situations like this.'Overlapping' insertions could be quite difficult to detect! A change of
CAT -> CAAAT
might put the inserted bases before or after the originalA
, and I don't know that we would be able to easily identify that...This is not the complete list of edge cases.
We should start by pulling out all indel calls from strelka, mutect and lancet so we can investigate the properties a bit more carefully, then decide on a set of heuristics to guide the generation of an INDEL consensus.
I will also note that one option is to simply "trust" one caller for INDELs and use its calls exclusively, avoiding many of these potential issues. This has the advantage of being easy, but has obvious downsides as well!
What input data should be used? Which data were used in the version being updated?
maf files from strelka, mutect and lancet.
When do you expect the revised analysis will be completed?
1-2 weeks? Depends on results of initial investigations, and discussions about methods.
Who will complete the updated analysis?
@jashapiro, hopefully with input/discussion from others! (Tagging @kgaonkar6 in particular, as we will want to implement similar logic in #961 )
The text was updated successfully, but these errors were encountered: