Version3 (#9)

-add new callers and regions -improve existing callers -add grch37/hg19 support
PacificBiosciences · Oct 31, 2023 · 63cc3ff · 63cc3ff
1 parent 2fe2cdc
commit 63cc3ff
Show file tree

Hide file tree

Showing 99 changed files with 28,623 additions and 60,484 deletions.
diff --git a/README.md b/README.md
@@ -1,8 +1,11 @@
 # Paraphase: HiFi-based caller for highly homologous genes
 
-Many medically relevant genes fall into 'dark' regions where variat calling is limited due to high sequence homology with paralogs or pseudogenes. Paraphase is a Python tool that takes HiFi BAMs as input (whole-genome or enrichment), phases complete haplotypes for genes of the same family, determines copy numbers and makes phased variant calls. 
+Many medically relevant genes fall into 'dark' regions where variant calling is limited due to high sequence homology with paralogs or pseudogenes. Paraphase is a Python tool that takes HiFi aligned BAMs as input (whole-genome or enrichment), phases haplotypes for genes of the same family, determines copy numbers and makes phased variant calls. 
 
-Paraphase supports the following genes:
+![Paraphase diagram](docs/figures/paraphase_diagram.png)
+Paraphase takes all reads from a gene family, realigns to just the gene of interest and then phases them into haplotypes. This solves the problem of alignment difficulty due to sequence homology and allows us to examine all copies of genes in a gene family and call copy number changes and other variants.
+
+Paraphase supports 161 segmental duplication regions in GRCh38, listed in the [config](paraphase/data/38/config.yaml) file. Among these, there are 11 medically relevant regions that are also supported in GRCh37/hg19:
 - SMN1/SMN2 (spinal muscular atrophy)
 - RCCX module
   - CYP21A2 (21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia)
@@ -15,17 +18,17 @@ Paraphase supports the following genes:
 - NEB (Nemaline myopathy)
 - F8 (intron 22 inversion, Hemophilia A)
 - CFC1 (heterotaxy syndrome)
+- OPN1LW/OPN1MW (color vision deficiencies)
+- HBA1/HBA2 (Alpha-Thalassemia)
 
-Please check out our paper on its application to the gene SMN1 for more details about Paraphase.   
+Please check out our [paper](https://www.cell.com/ajhg/fulltext/S0002-9297(23)00001-0) on its application to the gene SMN1 for more details about Paraphase.   
 Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;0(0). doi:10.1016/j.ajhg.2023.01.001
 
-For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when haplotypes are highly similar to each other in Exons 1-6. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.
-
-Currently Paraphase only works on GRCh38. Support for GRCh37 will be adde in the future.
+For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when gene copies are highly similar to each other. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.
 
 ## Contact
 
-There is a need for building consensus on how to report variants in segmental duplication regions, which could be complicated due to the frequent presence of copy number changes. If you have suggestions or need assistance, please don't hesitate to reach out by email or open a GitHub issue.
+If you have suggestions or need assistance, please don't hesitate to reach out by email or open a GitHub issue.
 
 Xiao Chen: xchen@pacificbiosciences.com
 
@@ -62,69 +65,50 @@ paraphase -l list.txt -o output_directory -r genome_fasta
 ```
 
 Required parameters:
-- `-b`: Input BAM file or `-l`: text file listing BAM files one per line
+- `-b`: Input BAM file or `-l`: text file listing BAM files one per line (a BAI file needs to exist in the same directory)
 - `-o`: Output directory
 - `-r`: Path to the reference genome fasta file
 
+Please note that the input BAM should be one that's aligned to the ENTIRE reference genome (either GRCh38 or GRCh37/hg19), and this reference should NOT include ALT contigs. The fasta file of this reference genome should be provided to Paraphase with `-r`. 
+
 Optional parameters:
-- `-g`: Gene(s) to analyze. All supported genes will be analyzed if not specified.
+- `-g`: Gene(s) to analyze, separated by comma. All supported genes will be analyzed if not specified.
 - `-t`: Number of threads.
-- `-d`: File listing average genome depth per sample, with two columns, sample ID and depth values, separated by tab or space. This saves run time by skipping the step to calculate genome depth.
-- `--novcf`: no vcf output if specified.
-- `--samtools`: path to samtools
+- `--genome`: Genome reference build. Default is `38`. If `37` or `19` is specified, Paraphase will run the analysis for GRCh37 or hg19, respectively (note that only 11 medically relevant [regions](paraphase/data/19/config.yaml) are supported now for GRCh37/hg19).
+- `gene1only`: If specified, variants calls will be made against the main gene only for SMN1, PMS2, STRC, NCF1 and IKBKG, see [below](#interpreting-the-output).
+- `--novcf`: If specified, no VCF files will be produced.
+- `--samtools`: path to samtools. If the paths to samtools or minimap2 are not already in the PATH environment variable, they can be provided through the `--samtools` and `--minimap2` parameters.
 - `--minimap2`: path to minimap2
 
-The paths to samtools and minimap2 can be provided through the `--samtools` and `--minimap2` parameters.
-
 ## Interpreting the output
 
 Paraphase produces a few output files in the directory specified by `-o`, with the sample ID as the prefix.
-- `_realigned_tagged.bam`: This BAM file can be loaded into IGV for visualization of haplotypes, see [haplotype visualization](docs/visualization.md).  
-- `.vcf`: A VCF file is written for each haplotype, and there is also a `_variants.vcf` file containing merged variants from all haplotypes.
-- `.json`: Main output file, summerizes haplotypes and variant calls for each sample. Details of the fields are explained below for each gene.
-
-### SMN1
-
-- `smn1_cn`: copy number of SMN1, a `null` call indicates that Paraphase finds only one haplotype but depth does not unambiguously support a copy number of one or two.
-- `smn2_cn`: copy number of SMN2, a `null` call indicates that Paraphase finds only one haplotype but depth does not unambiguously support a copy number of one or two.
-- `smn2_del78_cn`: copy number of SMN2Δ7–8 (SMN2 with a deletion of Exon7-8)
-- `smn1_read_number`: number of reads containing c.840C
-- `smn2_read_number`: number of reads containing c.840T
-- `smn2_del78_read_number`: number of reads containing the known deletion of Exon7-8 on SMN2
-- `smn1_haplotypes`: phased SMN1 haplotypes
-- `smn2_haplotypes`: phased SMN2 haplotypes
-- `smn2_del78_haplotypes`: phased SMN2Δ7–8 haplotypes
-- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
-- `haplotype_details`: lists information about each haplotype 
-  - `variants`: The variants contained in the haplotype, excluding those in homopolymer regions.
-  - `boundary`: The boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
-  - `haplogroup`: The haplogroup that the haplotype is assigned to
-
-### RCCX, PMS2, NCF1, CFC1, STRC, IKBKG, NEB & F8
-
-- `total_cn`: total copy number of the family (sum of gene and paralog/pseudogene)
-- `gene_cn`: copy number of the gene of interest, when the gene and pseudogene can be easily distinguished with known sequence differences, as in PMS2/NCF1/STRC/IKBKG
-- `final_haplotypes`: phased haplotypes
-- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
 
-Multiple copies of the repeat are phased inito alleles with read-based phasing in the case of RCCX/IKBKG/NEB. Additional output entries include:
-- `alleles_final`: haplotypes phased into alleles
-
-### RCCX
-
-More info fields on phasing haplotypes into alleles and annotation of CYP21A2:
-- `annotated_alleles`: allele annotation for the CYP21A2 gene. This is only based on common gene-pseudogene (CYP21A2-CYP21A1P) conversions (P31L, IVS2-13A/C>G, G111Vfs, I173N, I237N, V238E, M240K, V282L, Q319X and R357W). Please refer to the vcfs for most thorough variant calling and annotation.
-- `ending_hap`: the last copy of RCCX on each allele. Only these copies contain parts of TNXB (while the other copies contain TNXA)
-
-### IKBKG
-
-- `deletion_haplotypes`: haplotypes carrying the 11.8kb deletion
-
-### F8
-
-Additional output is included to report SVs that occur between the repeat regions:
-- `sv_called`: reports deletions/duplications between int22h-1 and int22h-2, or inversions between int22h-1 and int22h-3
+1. `.vcf` in `sampleID_vcfs` folder. A VCF file is written for each haplotype per gene family. There is also a `_variants.vcf` file containing merged variants from all haplotypes for each gene family. Note that this is not a diploid vcf as there are usually more than two copies of genes in a gene family in a sample.
 
+As genes of the same family can be highly similar to each other in sequence and not easy to differentiate (at the sequence level or even at the functional level), variant calls are made against one selected "main" gene from the gene family (e.g. the functional gene is selected when the family has a gene and a pseudogene). In this way, all copies of the gene family can be evaluated for pathogenic variants and one can calculate the copy number of the functional genes in the family and hence infer the disease/carrier status.
 
+Exceptions are SMN1 (paralog SMN2), PMS2 (pseudogene PMS2CL), STRC (pseudogene STRCP1), NCF1 (pseudogenes NCF1B and NCF1C) and IKBKG (pseudogene IKBKGP1), where gene differentiation is possible. In these families, haplotypes are assigned to each gene in the family, i.e. gene or paralog/pseudogene, and variants are called against the gene (or paralog/pseudogene) for the gene (or paralog/pseudogene) haplotypes, respectively. Variants calls can be made against the main gene only for these five families if `--gene1only` is specified. 
 
+2. `_realigned_tagged.bam`: This BAM file can be loaded into IGV for visualization of haplotypes (group reads by `HP` tag and color alignments by `YC` tag). All haplotypes are aligned against the main gene of interest. Tutorials/Examples are provided for medically relevant genes (See below).  
 
+3. `.json`: Output file summarizing haplotypes and variant calls for each gene family in each sample. In brief, a few generally used fields are explained below.
+- `final_haplotypes`: phased haplotypes for all gene copies in a gene family
+- `total_cn`: total copy number of the family (sum of gene and paralog/pseudogene)
+- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
+- `haplotype_details`: lists information about each haplotype 
+  - `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
+- `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.
+- `region_depth`: median depth of the gene family (include all copies of gene and paralog/pseudogene) 
+
+Tutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below: 
+- [SMN1/SMN2](docs/SMN1_SMN2.md)
+- [RCCX module (CYP21A2)](docs/RCCX.md)
+- [PMS2](docs/PMS2.md)
+- [STRC](docs/STRC.md)
+- [OPN1LW/OPN1MW](docs/OPN1LW_OPN1MW.md)
+- [HBA1/HBA2](docs/HBA1_HBA2.md)
+- [IKBKG](docs/IKBKG.md)
+- [F8](docs/F8.md)
+- [NEB](docs/NEB.md)
+- [NCF1](docs/NCF1.md)
diff --git a/docs/F8.md b/docs/F8.md
@@ -0,0 +1,9 @@
+# F8
+
+In F8, our goal is to detect Intron 22 inversion involved in [Hemophilia A](https://www.ncbi.nlm.nih.gov/books/NBK1404/). Paraphase does this by phasing haplotypes for the homology region where inversion breakpoints happen (a region that encodes F8A1, F8A2 and F8A3), and then checking the sequences flanking each haplotype for signals suggesting inversion. Another possible structural variant (SV) in F8, deletion of Exon1-22 (whose breakpoint also falls into the same homology region), is also called by Paraphase, although this variant is relatively easy to call with a standard depth based CNV caller. 
+
+## Fields in the `json` file
+
+- `sv_called`: reports deletion between int22h-1 and int22h-2 (which suggests Exon1-22 deletion), or inversion between int22h-1 and int22h-3 (which suggests Intron 22 inversion)
+
+Note that the inversion and the deletion are also reported in the VCF as SVs. 
diff --git a/docs/HBA1_HBA2.md b/docs/HBA1_HBA2.md
@@ -0,0 +1,19 @@
+# HBA1/HBA2
+
+For this [region](https://www.ncbi.nlm.nih.gov/books/NBK1435/), Paraphase calls the total copy number of HBA1 and HBA2. Variants are called in the VCF, against HBA2 reference sequence.
+
+## Fields in the `json` file
+
+- `genotype`: reports the genotype of this family. Possible alleles include `aa`, `aaa` (duplication), `-a` (deletion) or `--` (double deletion).
+- `alleles_final`: when possible, different copies of HBA are phased into alleles with read based phasing. 
+
+## Visualizing haplotypes
+
+To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Green and purple represent two alleles, i.e. all haplotypes in green are on one one allele and all haplotypes in purple are on the other allele. 
+
+Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray. 
+
+![HBA example](figures/HBA.png)
+
+- The top panel shows a sample with two copies of HBA1 and two copies of HBA2, one on each allele. 
+- The bottom panel shows a sample with a `-a` allele, where there is a deletion, leaving only one copy of HBA (`hba_del_hap1`).
diff --git a/docs/IKBKG.md b/docs/IKBKG.md
@@ -0,0 +1,21 @@
+# IKBKG
+
+In this region, Paraphase calls small variants in [IKBKG](https://www.ncbi.nlm.nih.gov/books/NBK1472/). In addition, there is a known 11.7kb deletion that can occur in either IKBKG or the pseudogene. Paraphase calls this deletion and locates it to IKBKG or the pseudogene.
+
+## Fields in the `json` file
+
+- `deletion_haplotypes`: haplotypes carrying the 11.7kb deletion
+
+Note that this deletion is also reported in the VCF as a structural variant (SV). 
+
+## Visualizing haplotypes
+
+To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Reads are realigned to IKBKG. Green represents phased copies on one allele if there is duplication of the 11.7kb region. 
+
+Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray. 
+
+![IKBKG examples](figures/IKBKG.png)
+
+- In this set of examples, the top panel shows a female sample without structural variants, i.e. two copies of IKBKG and two copies of IKBKGP1.
+- The middle panel shows a female sample with a copy of IKBKGP1 that carries the 11.7kb deletion.
+- The bottom panel shows a female sample where there is a duplication (duplicated three times) of the 11.7kb region on a copy of IKBKGP1 (in green).
diff --git a/docs/NCF1.md b/docs/NCF1.md
@@ -0,0 +1,17 @@
+# NCF1
+
+NCF1 is differentiated from its pseudogenes NCF1B and NCF1C by the presence of GT at the begining of Exon 2 ([c.75_76del or p.Tyr26fs](https://www.ncbi.nlm.nih.gov/clinvar/variation/2249/)).
+
+## Fields in the `json` file
+
+- `total_cn`: total copy number of the family
+- `gene_cn`: copy number of the gene of interest, i.e. NCF1
+- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
+
+## Visualizing haplotypes
+
+To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Reads are realigned to the main gene, NCF1. 
+
+Reads in blue are confidently consistent with a single haplotype. Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray. 
+
+![NCF1 example](figures/NCF1.png)
diff --git a/docs/NEB.md b/docs/NEB.md
@@ -0,0 +1,19 @@
+# NEB
+
+Paraphase resolves the triplicate (TRI) repeat region in NEB, where copy number variants are common. 
+
+## Fields in the `json` file
+
+- `total_cn`: total copy number of the triplicate repeat
+- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
+- `alleles_final`: when possible, different copies of TRI are phased into alleles with read based phasing. 
+
+## Visualizing haplotypes
+
+To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Reads are realigned to the first copy of TRI in the reference genome.
+
+Green and purple represent two alleles, i.e. all haplotypes in green are on one one allele and all haplotypes in purple are on the other allele. Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray. 
+
+![NEB example](figures/NEB.png)
+
+This example has three copies of TRI on one allele and another three copies of TRI on the other allele.