This guide outlines the process for simulating high-fidelity (HiFi) reads using PBSIM3, converting the output to BAM format, sorting and merging the files, and generating FASTQ files for further analysis.
This process simulates long-read sequencing using the pbsim3
tool, processes SAM/BAM files, and generates HiFi reads for downstream applications.
Ensure the following tools are installed and available:
-
PBSIM3: A tool for simulating PacBio HiFi reads. PBSIM3 GitHub Repository
-
Samtools: A suite of tools for handling SAM/BAM files. Samtools GitHub Repository
-
CCS (Circular Consensus Sequencing): Tool for generating consensus reads from BAM files. Installation instructions are provided below. CCS Tool
-
pigz: A parallel implementation of gzip, used for compressing FASTQ files. pigz GitHub Repository
Make sure all paths are correctly set, and the required reference files are in place.
CCS can be installed directly using conda:
-
Install Miniconda or Anaconda if it is not already installed. Anaconda Installation Guide
-
Install CCS via conda without creating a new environment:
conda install bioconda::pbccs
- Verify the installation by checking the CCS version:
ccs --version
First, generate HiFi reads with pbsim3
using the specified genome file and parameters.
pbsim \
--strategy wgs \
--method qshmm \
--qshmm pbsim3/data/QSHMM-RSII.model \
--depth 15 \
--genome hg38.fa \
--pass-num 7
Parameters:
--strategy wgs
: Specifies the Whole Genome Sequencing (WGS) strategy for simulating the reads.--method qshmm
: Uses the QSHMM method for read simulation.--qshmm
: Path to the QSHMM model file (should be inside the pbsim3 directory).--depth 15
: The sequencing depth for simulating the reads.--genome
: Path to the reference genome file (hg38.fa in this case).--pass-num 7
: Specifies the number of passes for the HiFi reads.
After generating the reads, convert the generated SAM files to BAM format using samtools. The following loop processes all SAM files (sd_0001.sam
, sd_0002.sam
, ..., sd_0046.sam
):
for i in $(seq -f "%04g" 1 46); do
input_file="sd_$i.sam"
output_file="sd_$i.bam"
if [ -f "$input_file" ]; then
echo "Processing $input_file to $output_file"
samtools view -b -o "$output_file" "$input_file"
fi
done
Explanation:
- Loops through all
sd_XXXX.sam
files. - Uses samtools view to convert each SAM file to its corresponding BAM file.
Merge all the BAM files into a single file:
samtools merge sd_merged.bam sd_*.bam
Next, sort the BAM file to prepare them for downstream processing:
samtools sort -o sd_merged_sorted.bam sd_merged.bam
Ensure CCS is installed as described above. Then generate consensus sequences from the merged BAM file:
ccs sd_merged_sorted.bam sd_merged_sorted.ccs.bam
Convert the CCS BAM file into a FASTQ file using samtools:
samtools bam2fq sd_merged_sorted.ccs.bam > hg38.fastq
Finally, compress the generated FASTQ file using pigz
to speed up compression with multiple threads:
pigz -p 8 hg38.fastq
prefix=hg38; ref=${prefix}.fa; fq=${prefix}.fq; count=46; pbsim --strategy wgs --method qshmm --qshmm pbsim3/data/QSHMM-RSII.model --depth 15 --genome ${ref} --pass-num 7 --prefix ${prefix}; rm -f *.ref; for i in $(seq -f "%04g" 1 ${count}); do if [ -f "${prefix}_${i}.sam" ]; then echo "Processing ${prefix}_${i}.sam"; samtools view -b -o ${prefix}_${i}.bam ${prefix}_${i}.sam; rm -f ${prefix}_${i}.sam; fi done; samtools merge ${prefix}_merged.bam ${prefix}_*.bam; for i in $(seq -f "%04g" 1 ${count}); do rm -f "${prefix}_${i}.bam"; done; samtools sort -o ${prefix}_merged_sorted.bam ${prefix}_merged.bam; rm -f ${prefix}_merged.bam; ccs ${prefix}_merged_sorted.bam ${prefix}_merged_sorted.ccs.bam; rm -f ${prefix}_merged_sorted.bam; samtools bam2fq ${prefix}_merged_sorted.ccs.bam > ${fq}; pigz -p 8 ${fq}; rm -f ${prefix}_merged_sorted.ccs.bam ${prefix}_merged_sorted.ccs.bam.pbi;
This project follows the licensing policies of the data sources and tools used.