Reproducing copy number excluded regions #438

jashapiro · 2020-01-15T16:02:19Z

What data file(s) does this issue pertain to?

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/src/scripts/IGLL_telo_centromeric_region.txt

Put your question or report your issue here.

I am trying to reproduce the generation of the IGLL_telo_centromeric_region.txt file that is used for Copy number consensus generation #128. This file includes regions that are to be excluded from analysis due to high error rates. The telomeres and centromeres can be reproduced from UCSC data files, but I am confused by the immunoglobulin regions. The documentation points to http://penncnv.openbioinformatics.org/en/latest/misc/faq/, but it is not clear from where the enumerations of those regions is defined. Moreover, the regions defined there are for hg18:

chr22:20715572-21595082
chr14:105065301-106352275
chr2:88937989-89411302
chr14:21159897-22090937

However, the regions in the IGLL_telo_centromeric_region.txt file do not seem to correspond to a liftOver of those regions to hg38

Applying liftOver hg18->hg38, I get the following regions:

chr22:22031174-22922910
chr14:105527919-106873021
chr2:88857361-89330430
chr14:21621904-22552154

The nearest equivalent regions in IGLL_telo_centromeric_region.txt seem to be these:

chr22:21990603-22947816
chr2:88854372-89330679
chr14:21676543-22556766

(There are only three here, presumably because the other chr14 region falls near a telomere and is excluded that way?)

In addition, IGLL_telo_centromeric_region.txt includes the region chr21:3100000-7000000 which is listed as a stalk (acrocentric arm) the UCSC cytoband file, but no other stalk regions are excluded, so I was not sure why this one was.

Can @fingerfen or @xiehongbo provide some additional information?

The text was updated successfully, but these errors were encountered:

jashapiro · 2020-01-16T18:54:54Z

From @xiehongbo via email:

Kai used transcription start and end site as his coordinates. I actually inspected the repeat element and extend the region to cover more low complex regions. I also have more which is know to have complex genomic features. It is all about estimation. Also those regions matters with SNP arrays.

In hg18 build here are 6 regions we used:

chr2:88935000-89418000 IgKappa
chr6:29775000-33225000 HLA*
chr7:141636000-142225000 TCRbeta
chr14:21214600-22095500 TCRalpha
chr14:105046000-106368585 IgHeavy
chr22:20675000-21620000 IgLambda

Using hg18->hg38 liftover, these correspond to:

chr2:88854372-89330679
chr6:29699244-33149245
chr14:21676543-22556766
chr14:105508618-106881350
chr22:21990603-22947816

Which are the regions as found in the provided file.

I will update the analysis to include this as a starting point.

@hongboxie

These regions are the ones defined by @hongboxie here: AlexsLemonade#438 (comment) Converted from hg18 to hg38

@hongboxie

* add to Snakefile * updating fork * changed output path and name * implement segmean * implement segmean * add result file * add result files * add trailing line * fix .py * change Snakefile comment * change README.md * change README.md * Updates to file organization Removing `src` directory to unnest `scripts` and adding `ref` directory for genomic info files. * add alternative segdup generation Link and script to process downloaded file for segmental duplciations. * Updates to blacklist generation * Add IG regions These regions are the ones defined by @hongboxie here: #438 (comment) Converted from hg18 to hg38 * Add step to potentially fix overlapping dup del segments. * Notebook to look at consensus calls for overlaps * Add overlap pruning * Update output files Note that ordering has changed, but the actual differences between these files should be relatively small other than that. There are changes to the cnv_consensus.tsv file where segments that are not contained within the defined CNV are discarded but might have been retained before. * update readme * Add telomere definition file * Update blacklist generation script * Remove accidentally included notebook * Tried to clarify complicated bedtools step. * Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py Co-Authored-By: Candace Savonen <cansav09@gmail.com> * Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py Co-Authored-By: Candace Savonen <cansav09@gmail.com> * Add more clarifying comments * Add full exclusion list and remove outdated files * Update readmes * Updated output files. * Re-add previous blacklist * More descriptive excluded file name * Update filename Co-authored-by: Candace Savonen <cansav09@gmail.com> Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>

@hongboxie

* add to Snakefile * updating fork * changed output path and name * implement segmean * implement segmean * add result file * add result files * add trailing line * fix .py * change Snakefile comment * change README.md * change README.md * Updates to file organization Removing `src` directory to unnest `scripts` and adding `ref` directory for genomic info files. * add alternative segdup generation Link and script to process downloaded file for segmental duplciations. * Updates to blacklist generation * Add IG regions These regions are the ones defined by @hongboxie here: #438 (comment) Converted from hg18 to hg38 * Add step to potentially fix overlapping dup del segments. * Notebook to look at consensus calls for overlaps * Add overlap pruning * Update output files Note that ordering has changed, but the actual differences between these files should be relatively small other than that. There are changes to the cnv_consensus.tsv file where segments that are not contained within the defined CNV are discarded but might have been retained before. * update readme * Add telomere definition file * Update blacklist generation script * Remove accidentally included notebook * Tried to clarify complicated bedtools step. * Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py Co-Authored-By: Candace Savonen <cansav09@gmail.com> * Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py Co-Authored-By: Candace Savonen <cansav09@gmail.com> * Add more clarifying comments * Add full exclusion list and remove outdated files * Update readmes * Updated output files. * Re-add previous blacklist * Add chromosome lengths file * Create file of neutral regions * Use hg.38.chrom.sizes * More descriptive excluded file name * Update filename * Sort chromosomes and remove alt from callable. * Fix sed command * Finish the rule to combine neutral regions. * Add output of bad callers * Bad caller summary notebook * Add output of neutral segments to the seg file Neutral segments (copy number 2) are included if they fall within a "callable region" which is one not covered by a large excluded region. When we add these back, we still exclude specimens where more than two callers 'failed' with high numbers of segments * remove working notebooks * Bug fixes * Unset X and Y copy number calls * Update README * Add callable regions to analyses/README.md * Simplify output file description in readme * Simplify file reading we don't need data types here, so keeping everything as strings simplifies, and removes potential errors from unexpected conversions from int to float * comment out status message * Move segfile step into snakemake * Fix filename in snakemake * Update results. * Update scratch dir handling Put all intermediate files in a defined scratch sub directory. * Update analyses/copy_number_consensus_call/scripts/bed_to_segfile.R Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com> * remove unused option. Co-authored-by: Candace Savonen <cansav09@gmail.com> Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>

jashapiro added the data label Jan 15, 2020

jaclyn-taroni added snv Related to or requires SNV data cnv Related to or requires CNV data updated analysis in progress Someone is working on this issue, but feel free to propose an alternative approach! and removed snv Related to or requires SNV data labels Jan 18, 2020

jaclyn-taroni assigned jashapiro Jan 18, 2020

jaclyn-taroni mentioned this issue Jan 18, 2020

Proposed Analysis: chromosomal instability burden, recurrently altered genes #394

Closed

jashapiro added a commit to jashapiro/OpenPBTA-analysis that referenced this issue Jan 22, 2020

Add IG regions

253ff4b

These regions are the ones defined by @hongboxie here: AlexsLemonade#438 (comment) Converted from hg18 to hg38

jashapiro mentioned this issue Jan 22, 2020

Generate CNV exclusion list #467

Merged

3 tasks

jaclyn-taroni closed this as completed in #467 Jan 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing copy number excluded regions #438

Reproducing copy number excluded regions #438

jashapiro commented Jan 15, 2020

jashapiro commented Jan 16, 2020

Reproducing copy number excluded regions #438

Reproducing copy number excluded regions #438

Comments

jashapiro commented Jan 15, 2020

What data file(s) does this issue pertain to?

Put your question or report your issue here.

jashapiro commented Jan 16, 2020