Skip to content

Tutorial: Analyzing AAV Data

Elizabeth Tseng edited this page Nov 11, 2024 · 66 revisions

Last Updated: 11/10/2024

NOTE: this repo is being deprecated. Please go to FormBio's laava for the latest maintained open source code and wiki. Thanks!

Portions of this wiki are still being kept here for historical purpose, however they could be deleted at any time

Example Datasets, scAAV and ssAAV

NOTE 1: if you are using SMRTLink v11 and later, please use the "AAV" mode in Run Design which will run recalladapter --> CCS with heteroduplex detection on the instrument, which will save you tons of computing time!

NOTE 2: If you want to understand more how the AAV bfx + report in this GitHub works, check out the AAV bioinformatics webinar I did! Note however the nomenclature has slightly changed and is described in the section AAV classification.


(optional) 0A. Recall + CCS from CLR (subread) data, if using SMRTLink prior to v11
(optional) 0B. Demultiplexing samples
1. Preparing the genome and annotation file
2. Mapping reads to AAV Genome
3. Summarizing and Plotting Alignment Results
4. AAV Classification: ssAAV vs scAAV, full vs partials vs backbone, and more


0A. Recall Adapter and CCS re-run if using CLR (subread) data

recalladapters Recalling adapter from CLR data
Required for scAAV (or if there's a mixture of scAAV with ssAAV)
Input CLR data (<movie>.subreads.bam,.bam.pbi and <movie>.scraps.bam,.bam.pbi and <movie>.subreadset.xml)
Output Recalled Subreads (<movie>.recall_subreads.bam)

recalladapters is necessary for self-complementary AAV (scAAV) constructs because only one side of the molecule has a SMRT Adapter. Without recalling adapter, multiplexed scAAV data may show poor demultiplexing results (as only one side of the barcode can be found). With proper recall adapter, the new CCS reads should double in read length, and mapping to the AAV construct should show a primary and supplementary alignment that is reverse-complement to each other and covering the whole ITR-to-ITR region.

Make sure the input XML file is straight from the sequencer. A newly generated subreadset might break recalladapters.

Use the following command to recall adapters from a CLR (subreads) BAM file:

recalladapters \
    -j <threads> \
    --minSnr=2.0 \
    --disableAdapterCorrection \
    --subreadset <movie>.subreadset.xml \
    -o <movie>.recalled

For the example scAAV dataset, see the run.sh file in step 1-recall adapters on the exact commands used to generate the recalled BAM files.

Note - customers using recalladapters using the bc2000 series adapters with SMRT link versions prior to SL v9 would additionally need to supply --adapter loop.fasta where loop.fasta has:

>left_adapter
CAACAACAACAACGGAGGAGGAGGAAAAG

0.1 Fix the recalladapters output

To upload the recalled data to SMRTlink, the subreadset.xml file will need to be regenerated (recalladapters produces a broken subreadset). We'll use dataset create on the recalled subreads file to make an importable subreadset. dataset is available through pbbioconda.

dataset create \
    --type SubreadSet \
    --generateIndices \
    --name <movie>.recalled.fixed \
    <movie>.recalled.fixed.subreadset.xml
    <movie>.recalled.subreads.bam

The new <movie>.recalled.fixed.subreadset.xml should now be importable into SMRTlink for CCS.

0.2 Re-run CCS after Recall Adapter

ccs Re-run CCS after recalladapters (recommend v6.2.0+)
Required for scAAV
Input Recalled Subreads (<movie>.recall_subreads.bam)
Output Recalled CCS (<movie>.hifi_reads.bam)

After recalladapters, CCS must be re-run.

In the example dataset, we show how this is run via SMRTLink script that also does de-multiplexing at the same time.

You can use SMRTLink (or Cromwell workflows) to run CCS (with or without demultiplexing, depending on your sample), since CCS can be time-consuming without proper chunking that can be handled by SMRTLink/Cromwell. See the SMRTLink Parameter set used for this CCS With Demultiplexing Run.

For datasets with mixed populations of AAV molecule types (ss and sc), ccs should be run using the --by-strand option.

0B. (optional) Demultiplexing samples

If your sample is multiplexed, use SMRTLink Demultiplexing or CCS With Demultiplexing as a GUI option to demultiplex samples.

If using command line, use the lima tool.

An example is below:

lima -s --peek-guess --split-named \
        hifi_reads.bam \
        barcodes.fasta \
        demux.bam

All the remaining analyses are shown as performed for a single (demultiplexed) sample.

1. Preparing the genome and annotation file

The following fasta files (if available) should be combined into a single "genome" fasta file:

  • (required) vector (including the AAV vector + plasmid backbone as a single sequence)
  • helper plasmid
  • repcap plasmid
  • host genome (ex: hg38)

NOTE the sequence IDs should be free of blank spaces and symbols. Stick with numbers, alphabet letters, and _ and -. If necessary, rename the sequence IDs in the combined fasta file.

Create a annotation.txt file according to the following format:

NAME=<sequence id>;TYPE={vector|helper|repcap|host|lambda};REGION=<start>-<end>;

Only the vector annotation is required and must be marked with REGION= (the position from ITR to ITR) as well. All other types are optional.

For example:

NAME=myAAV_plasmid;TYPE=vector;REGION=100-2000;
NAME=myRepCap_plasmid;TYPE=repcap;REGION=500-1500;
NAME=myHelper_plasmid;TYPE=helper;
NAME=chr1;TYPE=host;
NAME=chr2;TYPE=host;
NAME=chr3;TYPE=host;

IMPORTANT!!! you must have exactly the same number of chromosomes in the reference fasta file as annotations file. This is especially common if you are including human genome (hg38) which has a lot of alternative chromosomes. It is recommended that you use a version of hg38 that only lists the major chromosomes.

2. Mapping reads to AAV Genome

minimap2 Map CCS to Genome
Required for all AAV samples
Input CCS (<movie>.hifi_reads.fasta)
Output Mapped, sorted CCS SAM/BAM file (`.sorted_mapped.sam

If you have multiplexed samples, map this per demultiplexed CCS BAM file.

To map the HiFi/CCS sequences to the genome, run the following:

minimap2 --eqx -a --secondary=no \
   -t <threads> \
   genome.fasta <movie>.hifi_reads.fasta > mapped.sam

samtools sort -n -O SAM mapped.sam > mapped.sort_by_read.sam

3. Summarizing and Plotting Alignment Results

You can directly clone the AAV repo. You will need to make sure you have the Python and R dependencies. Read the AAV install section to see how to get the dependencies.

usage: summarize_AAV_alignment.py [-h] [--max_allowed_missing_flanking MAX_ALLOWED_MISSING_FLANKING] [--cpus CPUS]
                                  [--debug]
                                  sam_filename annotation_txt output_prefix

positional arguments:
  sam_filename          Sorted by read name SAM file
  annotation_txt        Annotation file
  output_prefix         Output prefix

optional arguments:
  -h, --help            show this help message and exit
  --max_allowed_missing_flanking MAX_ALLOWED_MISSING_FLANKING
                        Maximum allowed missing flanking bp to be still considered 'full'
                        (default: 100)
  --cpus CPUS           Number of CPUs (default: 1)
  --debug

For example:

python summarize_AAV_alignment.py \
         mapped.sort_by_read.sam annotation.txt \
         myAAV 
         --cpus 12

We can now plot the summary into a PDF:

Rscript plotAAVreport.R myAAV annotation.txt

4. AAV Classification

Refer to laava's wiki for the latest definitions.