Skip to content

Latest commit

 

History

History
100 lines (64 loc) · 4.56 KB

README.md

File metadata and controls

100 lines (64 loc) · 4.56 KB

Overview

Altai (Allele-specific Transcript Assembly Instrument) is a reference-based allele-specific transcript assembler. It incorporates variations, e.g. SNPs in a vcf file, in the transcript assembly to assemble the full-length transcript sequences by assigning those variations to their correct phases, i.e. alleles. Its output transcript sequences are expected to be the actual allele-specific sequences rather than subsequences of the reference genome.

Part of Altai is developed based on Scallop (license).

Installation

Altai uses additional libraries of Boost, htslib. If they have not been installed in your system, you first need to download and install them.

Download Boost

If Boost has not been downloaded/installed, download Boost (license) from (www.boost.org). Uncompress it somewhere (compiling and installing are not necessary).

Install htslib

If htslib has not been installed, download htslib (license) from (www.htslib.org/) with version 1.5 or higher. Note that htslib relies on zlib. So if zlib has not been installed in your system, you need to install zlib first.

Use the following commands to build htslib:

./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no
make
make install

The default installation location of htslib is /usr/lib. If you would install it to a different location, replace the above configure line with the following (by adding --prefix=/path/to/your/htslib to the end):

./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no --prefix=/path/to/your/htslib

In this case, you also need to export the runtime library path (note that there is an additional lib following the installation path):

export LD_LIBRARY_PATH=/path/to/your/htslib/lib:$LD_LIBRARY_PATH

Compile Altai

Use the following to compile Altai:

./configure --with-htslib=/path/to/your/htslib --with-boost=/path/to/your/boost
make

If some of the dependencies are installed in the default system directory (for example, /usr/lib), then the corresponding --with- option might not be necessary. The executable file altai will appear at src/altai.

Usage

The usage of altai is:

./altai -i <input.bam> -j <variants.vcf>  [--chr_exclude <comma,seperated,chr,without,space>] [-G <genome.fa>] -o <output-prefix> [options]

The --chr_exclude is a list of chromosome names. For example you may want to at least exclude chrY and chrX from the assembly (--chr_exclude chrY,chrX) for a male sample, and exclude chrX for female sample.

If you would like to output the transcript sequences in fasta format, -G <genome.fa> is necessary. Otherwise, it's optional.

The input.bam is the read alignment file generated by some RNA-seq aligner, (for example, TopHat2, STAR, or HISAT2). We recommand using STAR with a personalized vcf. STAR can extract variant information directly from the read.

# STAR with vcf is recommended. Other aligners are accepted.
STAR --runThreadN 8 \
 --outSAMstrandField intronMotif \
 --outSAMtype BAM SortedByCoordinate \
 --twopassMode Basic \
 --waspOutputMode SAMtag \
 --outSAMattributes NH HI AS nM NM MD jM jI XS MC ch vA vG vW \
 --genomeDir your_genome_dir \
 --outFileNamePrefix your_output_prefix \
 --varVCFfile your_sample_specific_vcf \
 --readFilesIn your_readfile_R1 your_readfile_R2

Make sure that the bam file is sorted; otherwise run samtools to sort it:

samtools sort input.bam > input.sort.bam

The reconstructed allele-specific transcripts shall be written as gtf format into output-prefix.gtf. Their sequences will be written as fasta format into output-prefix.fa.

The variants.vcf is a variant calling format file generated by some variant caller from DNA or RNA, or downloaded from a database. Variants should be phased. The 10th column has GT field and the 11th column have sample-specific genotype, e.g. 1|0 means allele 1 has alternative genotype and allele2 has reference genotype.

Make sure the vcf file is also sorted (with contigs in the same order as the bam file); otherwise run bcftools to sort it:

bcftools sort input.vcf > output.vcf

The reconstructed allele-specific transcripts shall be written as gvf format into output-prefix.gvf. Their sequences will be written as fasta format into output-prefix.fa.