The WES-Variant-Calling workflow is designed to process human Whole Exome Sequencing (WES) data following GATK4 best practices for germline variant calling. This pipeline encompasses data downloading, quality control, alignment, duplicate marking, base quality score recalibration (BQSR), variant calling, filtering, and VCF validation.
Before running the WES-Variant-Calling pipeline, ensure you have the following:
-
Reference Files :
- Human Reference Genome (downloaded during execution): GRCh38 (
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
) - dbSNP VCF (downloaded during execution):
Homo_sapiens_assembly38.dbsnp138.vcf
- Sample Metadata TSV: Contains URLs for FASTQ files and sample information
- Human Reference Genome (downloaded during execution): GRCh38 (
-
Software Dependencies:
- Anaconda: For environment management
- Bioinformatics Tools (downloaded during execution): Installed via Conda (e.g., GATK4, BWA-MEM, Samtools, FastQC, Picard, BCFtools, VCF-validator)
-
Compute Resources:
- Sufficient memory and CPU cores as required by the pipeline steps
git clone https://github.com/KavyaBanerj/WES-Variant-Calling.git
cd WES-Variant-Calling
Create a TSV file (e.g., igsr_HG00479.tsv
) containing the URLs for your FASTQ files and sample information.
Ensure that the wes_pipeline.sh
script has the correct paths and configurations based on your directory structure and sample metadata.
Execute the pipeline script:
bash ./wes_pipeline.sh
Note: Ensure that the script has execute permissions. If not, set them using:
chmod +x ./wes_pipeline.sh
-
Data Downloading:
- Downloads FASTQ files based on URLs provided in the metadata TSV.
- Downloads and prepares the reference genome and dbSNP VCF.
-
Quality Control (QC):
- Runs FastQC on raw FASTQ files to assess quality metrics.
-
Alignment:
- Aligns reads to the reference genome using BWA-MEM.
- Includes read group (
@RG
) information for downstream analyses.
-
Duplicate Marking:
- Marks PCR duplicates using GATK's
MarkDuplicatesSpark
.
- Marks PCR duplicates using GATK's
-
Base Quality Score Recalibration (BQSR):
- Performs BQSR to correct systematic errors in base quality scores.
-
Variant Calling:
- Calls variants using GATK's
HaplotypeCaller
. - Generates raw VCF files with variant annotations.
- Calls variants using GATK's
-
Variant Extraction:
- Extracts SNPs and Indels into separate VCF files using
SelectVariants
.
- Extracts SNPs and Indels into separate VCF files using
-
Variant Filtering:
- Applies filters based on quality metrics (
QD
,FS
,MQ
,SOR
,MQRankSum
,ReadPosRankSum
) usingVariantFiltration
. - Applies genotype-level filters (
DP
,GQ
).
- Applies filters based on quality metrics (
-
VCF Processing:
- Further filters variants using
grep
to remove genotype filter flags. - Sorts, compresses, indexes, and concatenates SNP and Indel VCFs using
bcftools
.
- Further filters variants using
-
VCF Validation:
- Validates the final VCF file using GATK's
ValidateVariants
.
- Validates the final VCF file using GATK's
The pipeline generates several output files within the results/
directory:
-
Raw Variants:
raw_variants.vcf
: Unfiltered variant calls.
-
Filtered Variants:
filtered_snps.vcf
: SNPs after applying variant filters.filtered_indels.vcf
: Indels after applying variant filters.
-
Analysis-Ready Variants:
analysis_ready_snps.vcf
: SNPs passing all filters.analysis_ready_indels.vcf
: Indels passing all filters.
-
Final Combined VCF:
combined_analysis_ready_{sample_name}.vcf.gz
: Concatenated and compressed VCF file ready for downstream tools like wANNOVAR.
- Pipeline inspired by WES-analysis and variant_calling