Reproducing analysis from Hartmann et al., 2017 "A fungal wheat pathogen evolved host specialization by extensive chromosomal rearrangements".
The purpose of this repository is to demonstate capability of executing population, comparative genomics analysis, ultimately determining SNP variation between 54 Z. tritici strains. This is a subset of the 106 genomes analysed in the original paper.
Whole-genome sequencing and variant calling procedure
Raw sequencing data were deposited on the NCBI Short Read Archive under the BioProject PRJNA327615.
The sequencing reads were trimmed for remaining adapters and sequencing quality using the software Trimmomatic v0.32 (Bolger et al., 2014) with the following settings: illuminaclip=TruSeq3-PE.fa:2:30:10, leading=10, trailing=10, slidingwindow= 5:10, minlen=50. Sequence data from all isolates was aligned to the reference genome IPO323 (Goodwin et al., 2011), which is fully assembled into chromosomes, contains no gaps and includes complete telomere sequences except for one small chromosome. We used the short read aligner Bowtie 2 version 2.2.3 (Langmead et al., 2009) with the –very-sensitive-local option. PCR duplicates were identified using Picard tools version 1.118 (http://broadinstitute.github.io/ picard).
Single nucleotide polymorphism (SNP) calling was performed using the Genome Analysis Toolkit (GATK) version 3.3-0 (McKenna et al., 2010). HaplotypeCaller was run on each isolate individually with the following settings: -emitRefConfidence GVCF; –variant_index_type LINEAR; –variant_index_parameter 128000; –sample_ploidy 1. Joint variant calls were performed using GenotypeGVCFs on a merged gvcf variant file using the option -maxAltAlleles 2. For details on SNP call filtering, validation and genetic structure analyses see Supplementary Methods. The variant call format (VCF) file was deposited in the European Nucleotide Archive (ENA) under the accession numbers PRJEB15502/ERP017268 and the analysis number ERZ330467.
Model selection for genome-wide association study
Population structure and varying degrees of relatedness between individuals may be a source for p-value inflations and false positives due to non-random phenotype-genotype associations (Bergelson and Roux, 2010; Korte and Farlow, 2013). To account for this, we analyzed the population structure and genetic relatedness of all 106 isolates by performing a principal component analysis (PCA) and computing a kinship matrix, respectively (for full details see Supplementary Note S1). Then, we performed GWAS with three different statistical models commonly used for population structure and kinship corrections. We used a general linear model with the first three principal components (GLM Q), a mixed linear model with kinship effect only (MLM K), and a mixed linear model with kinship effect and principal components (MLM Q+K). Significance thresholds used for GWAS analyses were the following: Bonferroni threshold Po6.42e − 08 (α = 0.05), false discovery rate (FDR) thresholds of 5% and 10% calculated using the R package q-value (Storey and Tibshirani, 2003). For additional details on GWAS model selection see Supplementary Methods.
SNP variant calling and population structure analyses
The joint variant calls were subset to include only SNPs. SNPs were hard filtered using the GATK VariantFiltration and SelectVariants tools. We chose quality cutoffs following the GATK Best Practices recommendations (DePristo et al., 2011; Van der Auwera et al., 2013) and using empirical evaluations of filter value distributions in our own joint variant file. The final filtering was performed using the filtering cutoffs: QUAL < 250; QD < 20.0; MQ < 30.0; -2 > BaseQRankSum > 2; -2 > MQRankSum > 2; -2 > ReadPosRankSum > 2; FS > 0.1. After hard-filtering, we excluded SNPs with a genotyping rate <90% and a minor allele frequency (MAF) <5%. We also excluded SNPs on accessory chromosomes (i.e. chromosomes not found in all isolates of the species). These filtration steps were performed with vcftools version 0.1.12b (Danecek et al., 2011) and plink 1.9 version (https://www.coggenomics.org/plink2; (Chang et al., 2015). Finally, we converted tri-allelic SNPs to bi-allelic SNPs by recoding the least frequent allele as a missing genotype. Conversion of tri-allelic SNPs was performed to satisfy requirements of association mapping software. We used a distance matrix on the 779’178 retained SNPs to construct a tree using the neighborjoining method (Saitou and Nei, 1987) implemented in TASSEL version 5.2.14 (Bradbury et al., 2007). FigTree v1.4.2 (http:// tree.bio.ed.ac.uk/software/figtree) was used for visualization. Furthermore, we performed a principal component analysis (PCA) in TASSEL to quantify genetic differentiation among the included samples. We used the first three principal components (PCs) to correct for population structure in GWAS analyses. To estimate the level of differentiation between populations, we calculated the fixation index FST (Wright, 1951; Nei, 1987) and pairwise FST among all pairwise combinations of the five populations. We used the R package {hierfstats} (Goudet, 2005) that implemented Yang’s algorithm (Yang, 1998). For FST calculation, we selected 2’047 SNPs at intervals of 15 kb along the chromosomes. We estimated relatedness among genotypes by computing a kinship matrix using the scaled identity by state (IBS) algorithm (Endelman and Jannink, 2012) implemented in TASSEL. We included the kinship matrix as a random effect in the mixed linear models for GWAS. For illustration, a heatmap of the kinship matrix was drawn using the heatmap2 function of the {gplots} package of the open-source R software (R Core Team, 2014).
SNP validation of significant genome-wide associations
To confirm significantly associated SNPs, we used the SNP caller Freebayes v.0.9 (Garrison and Marth, 2012) as an alternative method to GATK. We set Freebayes to filter out poorly mapped reads and poor base quality prior to SNP calling using the following stringent parameters: --min-mapping-quality 20; --min-base-quality 20; -use-best-n-alleles 2; --minalternate-count 1; --ploidy 1; --use-mapping-quality. Variants were filtered for genotyping rate (>90%) and MAF (>5%) identical to the procedure for SNPs called by GATK. We found a high overlap between the two SNP calling approaches. For GWAS performed for virulence on the cultivar Toronit, 97% of significantly associated SNPs (identified by GATK) in the GWAS were confirmed by Freebayes. Similarly, for cultivar Greina 88% of significantly associated SNPs (identified by GATK) were confirmed by Freebayes. We excluded any association reported for a SNP that was not confirmed by both SNP callers.