Malaria infections often contain multiple genotypes, and when sequenced together these produce a complex signal that is a mixture of the individual genotypes. Building on the framework of earlier programs like DEploid and DEploidIBD, Tapestry attempts to pull these individual genotypes apart by exploiting allele frequency imbalances within a sample, while simultaneously estimating segments of identity by descent (IBD) between sequences. Unlike previous programs, Tapestry uses advanced MCMC methods to ensure that results are robust even for high complexity of infection (COI).
Step 1: Install HTSlib dependency. e.g.
git clone https://github.com/samtools/htslib
cd htslib
autoreconf -i # Build the configure script and install files it uses
./configure # Optional but recommended, for choosing extra functionality
make
make install
Step 2: Clone Tapestry.
git clone https://github.com/mrc-ide/Tapestry.git
cd Tapestry
Step 3: Compile Tapestry using CMake.
For a (slower) debugging version:
cd build
cmake ..
make
For a (faster) release version:
mkdir release
cd release
cmake .. -DCMAKE_BUILD_TYPE="Release"
make
Optional: By default the tests are not compiled. To compile them, open CMakeLists.txt in a text editor, and change line 16:
set(COMPILE_TESTS OFF)
...to...
set(COMPILE_TESTS ON)
Repeat Step 3 and then run tests with the executable ./test_tapestry
. The testing framework is GoogleTest.
The executable tapestry
will be in your /build
or /release
.
$ ./release/tapestry infer --help
Run inference from an filtered VCF.
Usage: ./release/tapestry infer [OPTIONS]
Options:
-h,--help Print this help message and exit
Input and output:
-i,--input_vcf TEXT:FILE REQUIRED
Path to input VCF file.
-s,--target_sample TEXT REQUIRED
Target sample in VCF.
-o,--output_dir TEXT Output directory.
Model Hyperparameters:
-K,--COI INT:INT in [1 - 6] Complexity of infection.
-e,--error_ref FLOAT:INT in [0 - 1]
Probability of REF->ALT error.
-E,--error_alt FLOAT:INT in [0 - 1]
Probability of ALT->REF error.
-v,--var_wsaf FLOAT:POSITIVE
Controls dispersion in WSAF. Larger is less dispersed.
-r,--recomb_rate FLOAT:POSITIVE
Recombination rate in kbp/cM.
-b,--n_wsaf_bins INT:INT in [100 - 10000]
Number of WSAF bins in Betabin lookup table.
MCMC Parameters:
-w,--w_proposal FLOAT:POSITIVE
Controls variance in proportion proposals.
This repository includes a small python package under /python
for plotting the outputs from Tapestry. Install it like so:
cd python
conda update conda
conda env create -f environment.yml
conda activate unravel
pip install -e .
You should now have access to unravel
on the commad line:
$ unravel sample --help
Usage: unravel sample [OPTIONS]
Plot Tapestry outputs for an individual sample
Options:
-i, --input_dir PATH Directory containing Tapestry outputs, for an
individual sample. [required]
--help Show this message and exit.
- Copying of Particle inside of ProposalEngine could become slow when large number of parameters. Can we use pointers?
- Some objects (e.g. Model) are could get very large. We should consider allocation on heap to avoid Stack overflow.
- ProposalEngine does not make any allowance for asymmetric proposal distributions
You can see all minor TODO's with:
grep -nC5 "TODO" src/*
We imagine a scenario where our sequencing data is generated by a fixed number of genetically distinct strains, which may or may not share regions of identity by descent. Each strain is imagined to comprise a fixed proportion of the infection. The fraction of reads that are derived from each strain is influenced by this proportion.
The likelihood is formulated as a Hidden Markov Model:
$$ P(\vec{X}, \vec{S} | \Theta) = P(S_1) P(X_1 | S_1) \prod_{i=1}^{L} P(S_i|S_{i-1}) P(X_i | S_i) $$
As such, we can compute it by defining initiation, transition, and emission probabilities.
Given a set of proportions and haplotype states, we can compute the proportion of the sample comprised of the alternative allele (or rather the expected WSAF without accounting for error), as:
$$ q_i = \sum_{j=1}^{K} w_jh_{ij} $$
We then adjust for sequencing error by assuming two fixed error rates, $e_0$ and $e_1$:
$$ \pi_i = q_i(1-e_1) + (1 - q_i)e_0$$
If we sequenced every parasite genome in the host, our resultant observed WSAF,
We re-parameterise the distribution to have better control over its mean and variance:
With this parameterisation, the expected error-adjusted WSAF is:
as we sought. Additionally we have:
which gives us good control over the variance.
If reads were sampled independently, randomly, and with replacement from the underlying strain proportions, we would expect the observed WSAF to be binomially distributed. However, we would like to allow additional dispersion across SNPs. For example, genomic context can influence sequencing performance, and that context will be different for each SNP; SNPs with identical haplotype configurations should have more than just binomial variance in their observed WSAF. The
Rather than explicitly inferring haplotypes at each site, we marginalise over all possible haplotype configurations by making our final emission probability a finite mixture of beta-binomial distributions. Each haplotype configuration,
Our emission probability at each site becomes:
The IBD configuration
TODO
TODO
Parameter | Description |
---|---|
Number of SNPs. | |
SNP index, |
|
Reference (REF) allele count. | |
Alternative (ALT) allele count. | |
Observed within-sample alternative allele frequency (WSAF). | |
Estimated population-level alternative allele frequency. | |
Physical distance between SNPs, in basepairs. |
Parameter | Description |
---|---|
Number of strains in sample, i.e. complexity of infection (COI). | |
Strain index, |
|
Abundance of strain |
|
Haplotype state with |
|
The expected WSAF, without error adjustment. | |
The error-adjusted expected WSAF. | |
Index for the IBD state, |
Parameter | Description | Value | Reference |
---|---|---|---|
Recombination rate. | 13.5 kbp per centiMorgan | Miles et al. (2016) | |
REF to ALT read count error rate. | 0.01 | Calibrated from Pf3k | |
ALT to REF read count error rate. | 0.05 | Calibrated from Pf3k | |
Term setting dispersion in WSAF. | 500 | Calibrated from Pf3k |