-
Notifications
You must be signed in to change notification settings - Fork 11
Tutorial: Analyzing AAV Data
Last Updated: 11/10/2024
NOTE: this repo is being deprecated. Please go to FormBio's laava for the latest maintained open source code and wiki. Thanks!
Portions of this wiki are still being kept here for historical purpose, however they could be deleted at any time
Example Datasets, scAAV and ssAAV
NOTE 1: if you are using SMRTLink v11 and later, please use the "AAV" mode in Run Design which will run recalladapter --> CCS with heteroduplex detection on the instrument, which will save you tons of computing time!
NOTE 2: If you want to understand more how the AAV bfx + report in this GitHub works, check out the AAV bioinformatics webinar I did! Note however the nomenclature has slightly changed and is described in the section AAV classification.
(optional) 0A. Recall + CCS from CLR (subread) data, if using SMRTLink prior to v11
(optional) 0B. Demultiplexing samples
1. Preparing the genome and annotation file
2. Mapping reads to AAV Genome
3. Summarizing and Plotting Alignment Results
4. AAV Classification: ssAAV vs scAAV, full vs partials vs backbone, and more
recalladapters | Recalling adapter from CLR data |
---|---|
Required for | scAAV (or if there's a mixture of scAAV with ssAAV) |
Input | CLR data (<movie>.subreads.bam,.bam.pbi and <movie>.scraps.bam,.bam.pbi and <movie>.subreadset.xml ) |
Output | Recalled Subreads (<movie>.recall_subreads.bam ) |
recalladapters is necessary for self-complementary AAV (scAAV) constructs because only one side of the molecule has a SMRT Adapter. Without recalling adapter, multiplexed scAAV data may show poor demultiplexing results (as only one side of the barcode can be found). With proper recall adapter, the new CCS reads should double in read length, and mapping to the AAV construct should show a primary and supplementary alignment that is reverse-complement to each other and covering the whole ITR-to-ITR region.
Make sure the input XML file is straight from the sequencer. A newly generated subreadset might break recalladapters
.
Use the following command to recall adapters from a CLR (subreads) BAM file:
recalladapters \
-j <threads> \
--minSnr=2.0 \
--disableAdapterCorrection \
--subreadset <movie>.subreadset.xml \
-o <movie>.recalled
For the example scAAV dataset, see the run.sh
file in step 1-recall adapters on the exact commands used to generate the recalled BAM files.
Note - customers using recalladapters
using the bc2000 series adapters with SMRT link versions prior to SL v9 would additionally need to supply --adapter loop.fasta
where loop.fasta has:
>left_adapter
CAACAACAACAACGGAGGAGGAGGAAAAG
To upload the recalled data to SMRTlink, the subreadset.xml
file will need to be regenerated (recalladapters
produces a broken subreadset). We'll use dataset create
on the recalled subreads file to make an importable subreadset. dataset
is available through pbbioconda.
dataset create \
--type SubreadSet \
--generateIndices \
--name <movie>.recalled.fixed \
<movie>.recalled.fixed.subreadset.xml
<movie>.recalled.subreads.bam
The new <movie>.recalled.fixed.subreadset.xml
should now be importable into SMRTlink for CCS.
ccs | Re-run CCS after recalladapters (recommend v6.2.0+) |
---|---|
Required for | scAAV |
Input | Recalled Subreads (<movie>.recall_subreads.bam ) |
Output | Recalled CCS (<movie>.hifi_reads.bam ) |
After recalladapters
, CCS must be re-run.
In the example dataset, we show how this is run via SMRTLink script that also does de-multiplexing at the same time.
You can use SMRTLink (or Cromwell workflows) to run CCS (with or without demultiplexing, depending on your sample), since CCS can be time-consuming without proper chunking that can be handled by SMRTLink/Cromwell. See the SMRTLink Parameter set used for this CCS With Demultiplexing Run.
For datasets with mixed populations of AAV molecule types (ss and sc), ccs
should be run using the --by-strand
option.
If your sample is multiplexed, use SMRTLink Demultiplexing or CCS With Demultiplexing as a GUI option to demultiplex samples.
If using command line, use the lima tool.
An example is below:
lima -s --peek-guess --split-named \
hifi_reads.bam \
barcodes.fasta \
demux.bam
All the remaining analyses are shown as performed for a single (demultiplexed) sample.
The following fasta files (if available) should be combined into a single "genome" fasta file:
- (required) vector (including the AAV vector + plasmid backbone as a single sequence)
- helper plasmid
- repcap plasmid
- host genome (ex: hg38)
NOTE the sequence IDs should be free of blank spaces and symbols. Stick with numbers, alphabet letters, and _
and -
. If necessary, rename the sequence IDs in the combined fasta file.
Create a annotation.txt
file according to the following format:
NAME=<sequence id>;TYPE={vector|helper|repcap|host|lambda};REGION=<start>-<end>;
Only the vector
annotation is required and must be marked with REGION=
(the position from ITR to ITR) as well. All other types are optional.
For example:
NAME=myAAV_plasmid;TYPE=vector;REGION=100-2000;
NAME=myRepCap_plasmid;TYPE=repcap;REGION=500-1500;
NAME=myHelper_plasmid;TYPE=helper;
NAME=chr1;TYPE=host;
NAME=chr2;TYPE=host;
NAME=chr3;TYPE=host;
IMPORTANT!!! you must have exactly the same number of chromosomes in the reference fasta file as annotations file. This is especially common if you are including human genome (hg38) which has a lot of alternative chromosomes. It is recommended that you use a version of hg38 that only lists the major chromosomes.
minimap2 | Map CCS to Genome |
---|---|
Required for | all AAV samples |
Input | CCS (<movie>.hifi_reads.fasta ) |
Output | Mapped, sorted CCS SAM/BAM file (`.sorted_mapped.sam |
If you have multiplexed samples, map this per demultiplexed CCS BAM file.
To map the HiFi/CCS sequences to the genome, run the following:
minimap2 --eqx -a --secondary=no \
-t <threads> \
genome.fasta <movie>.hifi_reads.fasta > mapped.sam
samtools sort -n -O SAM mapped.sam > mapped.sort_by_read.sam
You can directly clone the AAV repo. You will need to make sure you have the Python and R dependencies. Read the AAV install section to see how to get the dependencies.
usage: summarize_AAV_alignment.py [-h] [--max_allowed_missing_flanking MAX_ALLOWED_MISSING_FLANKING] [--cpus CPUS]
[--debug]
sam_filename annotation_txt output_prefix
positional arguments:
sam_filename Sorted by read name SAM file
annotation_txt Annotation file
output_prefix Output prefix
optional arguments:
-h, --help show this help message and exit
--max_allowed_missing_flanking MAX_ALLOWED_MISSING_FLANKING
Maximum allowed missing flanking bp to be still considered 'full'
(default: 100)
--cpus CPUS Number of CPUs (default: 1)
--debug
For example:
python summarize_AAV_alignment.py \
mapped.sort_by_read.sam annotation.txt \
myAAV
--cpus 12
We can now plot the summary into a PDF:
Rscript plotAAVreport.R myAAV annotation.txt
Refer to laava's wiki for the latest definitions.