Workshop: SARS-CoV-2 genome reconstruction from Illumina and ONT data

In general

Here we describe basic tools and commands to process either Illumina or Nanopore sequencing data using SARS-CoV-2 amplification data as an example. The goal is to construct a genome sequence using a reference-based approach. Althought the methodological steps are quite similar, both Illumina and Nanopore need different tools and parameters.

Example data can be found here: https://osf.io/9qkz5/ and is also referenced in the used commands again.

Basic setup

In this part, we will go through the different steps to generate from a FASTQ raw sequencing data file a SARS-CoV-2 consensus genome and lineage annotation. We will do this using the Linux system and command line interfance.

The tools we will use are listed here and will be discussed in detail below:

QC

FastQC (Illumina)
fastp (Illumina)
NanoPlot (Nanopore)
Filtlong (Nanopore)
BAMclipper (Illumina & Nanopore)

Mapping

minimap2 (Illumina & Nanopore)
SAMtools (Illumina & Nanopore)
IGV (Illumina & Nanopore)

Variant calling & Consensus

Medaka (Nanopore)
freebayes (Illumina)
BCFtools (Illumina & Nanopore)

Lineage annotation

Pangolin (Illumina & Nanopore)

In addition, we install but will not directly use:

PycoQC
SNPeff
president

These tools are handy if you want to perform quality-control directly on the raw signal FAST5 files (PycoQC), of you want to annotate variant calls for example to know which amino acid is changed (SNPeff), and to perform a basic quality-control of your final consensus sequence (president).

In general, you can Google all of these tools and find further information in publications, GitHub code repositories and on (Bio)conda.

Install all tools Because this is a quite heavy environment we will use mamba as an updated version of conda to faster resolve the environment and install all tools.

# config some channels, this might be already done.
# basically helps to not explicitly type the channels
conda config --add channels default
conda config --add channels bioconda
conda config --add channels conda-forge

# now we create a new environment and install all tools
mamba create -n workshop fastqc fastp nanoplot pycoqc filtlong minimap2 samtools bcftools igv pangolin president snpeff bamclipper freebayes
conda activate workshop

Note: We skip medaka here because the tools has some conflicting dependencies with other tools. Thus, we will use medaka later in a separate mamba environment.

Note2: It might be also more convenient to have separate environments for each tool or to summarize tools per task (e.g. mapping) in one environment. Like it would be also best practice using a Workflow Management System. You can also do that if you want and then switch between environments.

Example input data

You can obtain example read files for Illumina and Nanopore here or use the commands below:

# Illumina
wget --no-check-certificate https://osf.io/yxep6/download -O SARSCoV2-illumina.R1.fastq.gz
wget --no-check-certificate https://osf.io/qvzsh/download -O SARSCoV2-illumina.R2.fastq.gz

# Nanopore
wget --no-check-certificate https://osf.io/k9px6/download -O SARSCoV2-nanopore.fastq.gz

ATTENTION: here we set now variables to point to the example data sets so we can just use that in the following commands. Adjust accordingly for your system and folder structure.

ILLUMINA_SAMPLE1='SARSCoV2-illumina.R1.fastq.gz'
ILLUMINA_SAMPLE2='SARSCoV2-illumina.R2.fastq.gz'
NANOPORE_SAMPLE='SARSCoV2-nanopore.fastq.gz'

FASTQ quality control

Illumina

FastQC

Quality assessment

# activate the conda environment
conda activate workshop

# always remember that almost all tools have a help page
fastqc --help

fastqc -t 4 $ILLUMINA_SAMPLE1 $ILLUMINA_SAMPLE2

Check the output HTML file. How does the quality look like? Anything strange?

fastp

Now, we want to quality-trim the Illumina reads to get rid of low-quality base calls especially at the end of reads. Then we check again the quality via fastqc.

Quality trimming

fastp -i $ILLUMINA_SAMPLE1 -I $ILLUMINA_SAMPLE2 -o clean_reads.R1.fastq.gz -O clean_reads.R2.fastq.gz --thread 4 --qualified_quality_phred 20 --length_required 50

fastqc -t 2 clean_reads.R{1,2}.fastq.gz

How did the quality change?

ATTENTION: From now on we work with the clean_reads.R{1,2}.fastq.gz data files!

Publication | Code

Nanopore

NanoPlot

Quality assessment

# activate the conda environment
conda activate workshop

# always remember that almost all tools have a help page
NanoPlot --help

# run NanoPlot on your FASTQ file
NanoPlot -t 4 --fastq $NANOPORE_SAMPLE -o nanoplot/raw 
    
# run NanoPlot on your FASTQ file with some more parameters
NanoPlot -t 4 --fastq $NANOPORE_SAMPLE --title "Raw reads" \
    --color darkslategrey --N50 --loglength -o nanoplot/raw

Publication | Code

Length filtering

filtlong

# Attention! min and max length of course depent on your sequencing protocol! 
# the example nanopore reads were sequenced with the ARTIC V1200 kit and thus
# yield ~1.2kbp reads
filtlong --min_length 800 --max_length 1400 $NANOPORE_SAMPLE | gzip - > clean_reads_nanopore.fastq.gz

NanoPlot -t 4 --fastq clean_reads_nanopore.fastq.gz --title "Filtered reads" \
    --color darkslategrey --N50 --loglength -o nanoplot/clean

Code

Note: PycoQC is another neat tool for QC and can be also used on raw signal FAST5 data, thus, providing even more detailed insights in the quality of your data. Please feel free to install and run the tool on your own using raw FAST5 data.

Mapping & visualization

Illumina & Nanopore

minimap2
SAMtools
IGV

First, we need a reference genome (in FASTA format) to map against. We will use the index reference sequence for SARS-CoV-2. You can download it manually from

NCBI, SARS-CoV-2 Index genome, MN908947.3

or for example via

wget --no-check-certificate https://osf.io/kt4dq/download -O nCoV-2019.reference.fasta
# we also download an index for later usage
wget --no-check-certificate https://osf.io/5eunp/download -O nCoV-2019.reference.fasta.fai

ATTENTION: In our example we will use minimap2 for both Illumina and Nanopore data. However, minimap2 was initially developed with long-read alignment in mind. There is a short-read mode, but it might be beneficial to also use specialized short-read-mapper such as BWA-MEM.

Mapping

# map the filtered reads to the reference genome

# Illumina
minimap2 -x sr -t 4 -a -o minimap2-illumina.sam nCoV-2019.reference.fasta clean_reads.R1.fastq.gz clean_reads.R2.fastq.gz

# Nanopore
minimap2 -x map-ont -t 4 -a -o minimap2-nanopore.sam nCoV-2019.reference.fasta clean_reads_nanopore.fastq.gz

Publication | Code

Please notice the different parameter settings used for Illumina and Nanopore data! Check the GitHub page of minimap2 for more information.

Process mapping results

# convert mapping results (SAM format) into a binary format 
# the binary format can be faster read by a machine but not by a human
# we will use the binary format (called BAM) for visualization of the mapping

# we will use a FOR loop to process both, Illumina and Nanopore data

for SAM in minimap2-illumina.sam minimap2-nanopore.sam; do

    # get the basename of the input SAM
    BN=$(basename $SAM .sam)

    # 1) convert SAM to BAM --> check the file size after convert!
    samtools view -bS $SAM > $BN.bam

    # 2) sort the BAM
    samtools sort $BN.bam > $BN.sorted.bam

    # 3) index the BAM
    samtools index $BN.sorted.bam
done

Publication | Code

Look at the mapped reads

# start the Integrative Genomics Viewer (IGV)
igv &
# hint: if you still want to use your terminal although a tool with a 
# graphical interface is running, you can add a '&' to your command
# to move the running process in the background. By that, you can still
# use your terminal.

Now, we will load the reference FASTA (that we used for the mapping) and the sorted BAM file (with the mapped reads) into IGV to look at the results. The samtools index command is important so that IGV can actually load the sorted BAM file! As so often, a certain index structure is needed.

First, load the nCoV-2019.reference.fasta reference FASTA via "Genomes" > "Load Genome from File"
Second, load the sorted BAM file via "File" > "Load from File"

Hint: You can load both the Illumina and Nanopore BAM file and compare them! How do they differ?

Additional resources

SAM format specifications

Primer clipping

BAMclipper

It is important to remove primer sequences from your amplicon reads. The primer sequences are used to amplify parts of the SARS-CoV-2 genome. But that also means that they are not actually part of your sequenced SARS-CoV-2 sample and thus might mask important mutations located in the primer binding region. Therefore, they should be removed, which you can for example do via BAMclipper.

To accurately clip the primer sequences, we need to know where they are located. This is usually stored in a so-called BED file and can be found for various primer schemes online.

Illumina

# First, we download the primer BED scheme for Cleanplex scheme that was used
# Change to another BED file if needed!
wget --no-check-certificate https://osf.io/4nztj/download -O cleanplex.amplicons.bedpe

# It's important that the FASTA header of the reference genome 
# and the IDs in the BED file match, let's check:
head nCoV-2019.reference.fasta
head cleanplex.amplicons.bedpe

# we can see: they dont match! 
# In the reference FASTA: 'MN908947.3'
# In the BED file: 'NC_045512.2'
# So we need to replace the ID in the BED file, e.g. via
sed 's/NC_045512.2/MN908947.3/g' cleanplex.amplicons.bedpe > cleanplex-corrected.amplicons.bedpe

# check again
head nCoV-2019.reference.fasta
head cleanplex-corrected.amplicons.bedpe

bamclipper.sh -b minimap2-illumina.sorted.bam -p cleanplex-corrected.amplicons.bedpe -n 4

Nanopore

# First, we download the primer BED scheme for the ARTIC V1200 scheme
# Change to another BED file if needed!
wget --no-check-certificate https://osf.io/3ks9b/download -O nCoV-2019.bed

# It's important that the FASTA header of the reference genome 
# and the IDs in the BED file match, let's check:
head nCoV-2019.reference.fasta
head nCoV-2019.bed

# now we convert this BED file into a BEDPE file needed by BAMclipper.
# The Illumina BED file we used above was already in the correct BEDPE format.
# we download a python script to do so:
wget --no-check-certificate https://osf.io/3295h/download -O primerbed2bedpe.py

# and run it
python primerbed2bedpe.py nCoV-2019.bed --forward_identifier _LEFT --reverse_identifier _RIGHT -o nCoV-2019.bedpe

# now we can use BAMclipper - finally
bamclipper.sh -b minimap2-nanopore.sorted.bam -p nCoV-2019.bedpe -n 4

Publication | Code

Task: Compare the unclipped BAM file with the clipped BAM file in IGV. Do you see differences? Hint: you can load multiple BAM files into IGV.

Code snippets how the official ARTIC pipeline removes primers with a custom script

Variant calling

Illumina

For Illumina data we use freebayes to call variants. We installed the tool already in the Conda workshop environment. We also use some default parameter.

Call variants with freebayes

# first re-calulate the index for the reference FASTA (sometimes issues occur bc/ of different samtool versions used)
samtools faidx nCoV-2019.reference.fasta

# now variant calling
freebayes -f nCoV-2019.reference.fasta --min-alternate-count 10 \
--min-alternate-fraction 0.1 --min-coverage 20 --pooled-continuous \
--haplotype-length -1 minimap2-illumina.sorted.primerclipped.bam > freebayes-illumina.vcf

Code

The result is a VCF file, short for variant call format. A tab-separated file with information on the detected variants/ mutations.

Nanopore

We want to use Medaka for variant calling. Medaka is not in your current workshop environment because it was conflicting with the other tools. That's why we need a separate Mamba environment for Medaka:

Make a new environment for medaka
- medaka might have many dependencies that conflict
an alternative to conda is mamba
- mamba can be much faster in solving your environment, e.g. here for the tool medaka
- thus, let us install mamba via conda and then install medaka

mamba create -y -p envs/medaka "medaka>=1.8.0"
conda activate envs/medaka

Code

Call variants with Medaka

# first generate a file with information about potential variants
# considering the used basecalling model. You should use the matching
# model from your Guppy basecalling settings!
medaka consensus --model r941_min_hac_g507 --threads 4 --chunk_len 800 --chunk_ovlp 400 minimap2-nanopore.sorted.primerclipped.bam medaka-nanopore.consensus.hdf

# actually call the variants
medaka variant nCoV-2019.reference.fasta medaka-nanopore.consensus.hdf medaka-nanopore.vcf

# annotate VCF with read depth info etc. so we can filter it
medaka tools annotate medaka-nanopore.vcf nCoV-2019.reference.fasta minimap2-nanopore.sorted.primerclipped.bam medaka-nanopore.annotate.vcf

Important: Always use the matching medaka model based on how you or others did the guppy basecalling! You can check which medaka models are available via:

medaka tools list_models | grep -v Default

Task: Compare the results from the VCF file with what you can observe via the IGV browser. Can you find the variants medaka called also in IGV?

Consensus generation

Illumina & Nanopore

Check and filter the VCF file

ATTENTION: We just do this here as an example using the Nanopore VCF. Just exchange the Nanopore annotated VCF input file with the Freebayes VCF file.

# switch to the workshop env if you are not already on it
conda activate workshop

# compress the annotated VCF file (needed for the next steps)
bgzip -f medaka-nanopore.annotate.vcf
 
# index a TAB-delimited genome position file in bgz format 
# and create an index file
tabix -f -p vcf medaka-nanopore.annotate.vcf.gz

# generate the consensus
bcftools consensus -f nCoV-2019.reference.fasta medaka-nanopore.annotate.vcf.gz -o consensus-nanopore.fasta

# rename the consensus FASTA, right now the FASTA ID is still the reference
sed -i 's/MN908947.3/Consensus-Nanopore/g' consensus-nanopore.fasta

Code

Attention: The above commands yield a simple consensus sequence without all necessary quality checks! What did we not do so far?

we did not properly filter the VCF file, e.g. for low-coverage variants
we should also mask low coverage regions in the consensus genome
- for example the 5' and 3' end of the genome usually has low coverage
- but amplicon drop outs can also lead to low (or even no) coverage in the genome
we did also not distinguish the two primer pools. It improves your accuracy to perform variant calling for each primer pool separately and then merge the resulting VCF files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hands-on.md

hands-on.md

Workshop: SARS-CoV-2 genome reconstruction from Illumina and ONT data

In general

Basic setup

Example input data

FASTQ quality control

Illumina

Nanopore

Mapping & visualization

Illumina & Nanopore

Primer clipping

Illumina

Nanopore

Variant calling

Illumina

Nanopore

Consensus generation

Illumina & Nanopore

FURTHER READING BEYOND CONSENSUS

Lineage annotation

Illumina & Nanopore

Consensus QC

Illumina & Nanopore

Files

hands-on.md

Latest commit

History

hands-on.md

File metadata and controls

Workshop: SARS-CoV-2 genome reconstruction from Illumina and ONT data

In general

Basic setup

Example input data

FASTQ quality control

Illumina

Nanopore

Mapping & visualization

Illumina & Nanopore

Primer clipping

Illumina

Nanopore

Variant calling

Illumina

Nanopore

Consensus generation

Illumina & Nanopore

FURTHER READING BEYOND CONSENSUS

Lineage annotation

Illumina & Nanopore

Consensus QC

Illumina & Nanopore