Neural Networks (NNMM)

Extend Linear Mixed Model to Neural Networks (NN-LMM)

We proposed a new approach named NN-LMM to extended linear mixed model (LMM) to nonlinear neural networks (NN). This model is able to incorporate intermediate omics data such as gene expression levels. More details can be found in our papers:

Tianjing Zhao, Jian Zeng, and Hao Cheng. Extend Mixed Models to Multi-layer Neural Networks for Genomic Prediction Including Intermediate Omics Data, bioRxiv 2021.12.10.472186; https://doi.org/10.1101/2021.12.10.472186.

Tianjing Zhao, Rohan Fernando, and Hao Cheng. Interpretable artificial neural networks incorporating Bayesian alphabet models for genome-wide prediction and association studies, G3 Genes|Genomes|Genetics, 2021; jkab228, https://doi.org/10.1093/g3journal/jkab228

The flexibility of NN-LMM can be demonstrated below. NN-LMM can fit fully-connected neural networks ((a),(b)), or partial-connected neural networks ((c),(d)). Also, the relationship between hidden nodes and observed trait can be based on activation functions ((a),(c)), or defined by a user-defined function ((b),(d)).

The code to perform different neural networks is summarized below:

NN-LMM can also incorporate intermediate omics features (e.g., gene expression levels). In below example, for an individual, the gene expression levels of the first two genes are 0.9 and 0.1, respectively, and the gene expression level of the last gene is missing to be sampled. The missing patterns of gene expression levels can be different for different individuals.

Bayesian Neural Networks: Genotype -> Phenotye (no omics data)

* Note, please center your phenotype to have zero mean.

example(a): fully-connected neural networks

nonlinear function (to define relationship between hidden nodes and observed trait): tanh (other supported activation functions: "sigmoid", "relu", "leakyrelu", "linear")
number of hidden nodes: 3
Bayesian model: multiple independent single-trait BayesC (to sample marker effects). Note, to use multi-trait Bayesian Alphabet models to sample marker effects, please set mega_trait=false in runMCMC() function.
sampler: Hamiltonian Monte Carlo (to sample hidden nodes)

# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets

# Step 2: Read data 
phenofile  = dataset("phenotypes.csv")
genofile   = dataset("genotypes.csv")

phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
genotypes  = get_genotypes(genofile,separator=',',method="BayesC");


# Step 3: Build Model Equations
model_equation  ="y1 = intercept + genotypes"
model = build_model(model_equation,
		    num_hidden_nodes=3,
		    nonlinear_function="tanh");


# Step 4: Run Analysis
out=runMCMC(model,phenotypes,chain_length=5000); 

# Step 5: Check Accuruacy
results    = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID) 
accuruacy  = cor(results[!,:EBV],results[!,:bv1])

Example output files:

In NNBayes, the i-th hidden nodes will be named as "trait name"+"i". In our example, the observed trait is named "y1", and there are 3 hidden nodes, so the hidden nodes are named as "y11", "y12", and "y13", respectively.

Below is a list of files containing estimates and standard deviations for variables of interest.

file name	description
EBV_NonLinear.txt	estimated breeding values for observed trait
EBV_y11.txt	estimated breeding values for hidden node 1
EBV_y12.txt	estimated breeding values for hidden node 2
EBV_y13.txt	estimated breeding values for hidden node 3
genetic_variance.txt	estimated genetic variance-covariance of all hidden nodes
heritability.txt	estimated heritability of all hidden nodes
location_parameters.txt	estimated bias of all hidden nodes
neural_networks_bias_and_weights.txt.	estimated bias of observed trait and weights between hidden nodes and observed trait
pi_genotypes.txt	estimated pi of all hidden nodes
marker_effects_genotypes.txt	estimated marker effects of all hidden nodes
residual_variance.txt	estimated residual variance-covariance for all hidden nodes

Below is a list of files containing MCMC samples for variables of interest.

file name	description
MCMC_samples_EBV_NonLinear.txt	MCMC samples from the posterior distribution of breeding values for observed trait
MCMC_samples_EBV_y11.txt	MCMC samples from the posterior distribution of breeding values for hidden node 1
MCMC_samples_EBV_y12.txt	MCMC samples from the posterior distribution of breeding values for hidden node 2
MCMC_samples_EBV_y13.txt	MCMC samples from the posterior distribution of breeding values for hidden node 3
MCMC_samples_genetic_variance.txt	MCMC samples from the posterior distribution of genetic variance-covariance for all hidden nodes
MCMC_samples_heritability.txt	MCMC samples from the posterior distribution of heritability for all hidden nodes
MCMC_samples_marker_effects_genotypes_y11	MCMC samples from the posterior distribution of marker effects for hidden node 1
MCMC_samples_marker_effects_genotypes_y12	MCMC samples from the posterior distribution of marker effects for hidden node 2
MCMC_samples_marker_effects_genotypes_y13	MCMC samples from the posterior distribution of marker effects for hidden node 3
MCMC_samples_marker_effects_variances_genotypes.txt	MCMC samples from the posterior distribution of marker effect variance for all hidden nodes
MCMC_samples_neural_networks_bias_and_weights.txt.	MCMC samples from the posterior distribution of bias of observed trait and weights between hidden nodes and observed trait
MCMC_samples_pi_genotypes.txt	MCMC samples from the posterior distribution of pi for all hidden nodes
MCMC_samples_residual_variance.txt	MCMC samples from the posterior distribution of residual variance-covariance for all hidden nodes

Bayesian Neural Networks with Intermediate Omics Data: Genotype -> Omics -> Phenotype (complete/incomplete omics data)

The intermediate omics features should be in the same file as phenotypes. In this example, the "phenotypes.csv" file contains one column for the phenotype named "y1", and two columns for the omics features named "y2" and "y3".
Just simply indicate the named of the omics features in the build_model() function. In this example, we have latent_traits=["y2","y3"].
If there are many omics features (e.g., 1000), you can avoid printing the model information in the runMCMC() function by setting printout_model_info=false.
missing omics data is allowed. Just make sure the missings are recognized in Julia. For example, if the missing values are "NA" in your raw data, then you can set missingstrings=["NA"] in the CSV.read() function. Then thouse NA will be transferred to missing elements in Julia.
You may want to set missing values manually, for example, set the phenotypes for individuals in testing dataset as missing. In julia, you should first change the type of that column to allow missing, e.g., phenotypes[!,:y1] = convert(Vector{Union{Missing,Float64}}, phenotypes[!,:y1]). Then you can set missing values manually, e.g., phenotypes[1:2,:y1] .= missing sets the values for first two rows in column named y1 as missing.
To include residual (e.g. not mediated by other omics features) polygenic component, you can (1) an additional hidden node in the middle layer (see example (o2)); or use a more flexible partial-connected neural network (see example (o3)).
For the testing individuals (i.e., individuals without phenotype), if the testing individual have omics data, then incorporating those individuals in analysis will help to estimate marker effects. But if testing individuals only have genotype data, we cannot include them in our analysis. Instead, we can calculate the EBV once we have estimated the marker effects and neural network weights.

example(o1): NN-LMM Omics

# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets

# Step 2: Read data
phenofile  = dataset("phenotypes.csv")
genofile   = dataset("genotypes.csv")

phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"]) #should include omcis data!
genotypes  = get_genotypes(genofile,separator=',',method="BayesC")


# Step 3: Build Model Equations
model_equation  ="y1 = intercept + genotypes" #y1 is the observed phenotype
model = build_model(model_equation,
		    num_hidden_nodes=2,
                    latent_traits=["y2","y3"],  #y2 and y3 are two omics features
		    nonlinear_function="tanh")

# Step 4: Run Analysis
out = runMCMC(model,phenotypes,chain_length=5000,printout_model_info=false)

# Step 5: Check Accuruacy
results    = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy  = cor(results[!,:EBV],results[!,:bv1])

example(o2): NN-LMM Omics: includes a residual that is not mediated by other omics features

This can be done by adding an extra hidden node. For all individuals, this extra hidden node will be treated as unknown to be sampled.

The example for fully-connected neural network and partial-connected neural network:

Example code for fully-connected neural network with residual:

# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets

# Step 2: Read data
phenofile  = dataset("phenotypes.csv")
genofile   = dataset("genotypes.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
insertcols!(phenotypes, 5, :residual => missing)  #add one column named "residual" with missing values, position is the 5th column in phenotypes
phenotypes[!,:residual] = convert(Vector{Union{Missing,Float64}}, phenotypes[!,:residual]) #transform the datatype is required for Julia
genotypes  = get_genotypes(genofile,separator=',',method="BayesC")

# Step 3: Build Model Equations
model_equation  ="y1 = intercept + genotypes"   #y1 is the observed phenotype
model = build_model(model_equation,
		    num_hidden_nodes=3,
                    latent_traits=["y2","y3","residual"],  #y2 and y3 are two omics features
		    nonlinear_function="tanh")

# Step 4: Run Analysis
out = runMCMC(model,phenotypes,chain_length=5000,printout_model_info=false)

# Step 5: Check Accuruacy
results    = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy  = cor(results[!,:EBV],results[!,:bv1])

Example code for partial-connected neural network with residual:

# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets

# Step 2: Read data
phenofile   = dataset("phenotypes.csv")
genofile1   = dataset("genotypes_group1.csv")
genofile2   = dataset("genotypes_group2.csv")
genofile3   = dataset("GRM.csv")

phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
insertcols!(phenotypes, 5, :residual => missing)  #add one column named "residual" with missing values, position is the 5th column in phenotypes
phenotypes[!,:residual] = convert(Vector{Union{Missing,Float64}}, phenotypes[!,:residual]) #transform the datatype is required for Julia

geno1  = get_genotypes(genofile1,separator=',',method="BayesA");
geno2  = get_genotypes(genofile2,separator=',',method="BayesC");
geno3  = get_genotypes(genofile3,separator=',',header=false,method="GBLUP");

# Step 3: Build Model Equations
model_equation = "y1 = intercept + geno1 + geno2 + geno3";
model = build_model(model_equation,
		    num_hidden_nodes=3,
		    nonlinear_function="tanh",
	            latent_traits=["y3","y2","residual"])

# Step 4: Run Analysis
out = runMCMC(model, phenotypes, chain_length=5000,printout_model_info=false);

# Step 5: Check Accuruacy
results    = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy  = cor(results[!,:EBV],results[!,:bv1])

More flexible usage:

example(b): fully-connected neural networks with user-defined relationship between hidden nodes and observed trait

number of hidden nodes: 2
nonlinear function (to define relationship between hidden nodes and observed trait): y = sqrt(x1^2 / (x1^2 + x2^2))
sampler: Matropolis-Hastings (to sample hidden nodes)

Code are the same as example(a), except:

# Step 3: Build Model Equations
pig_growth(x1,x2) = sqrt(x1^2 / (x1^2 + x2^2))
model_equation  ="y1 = intercept + genotypes"
model = build_model(model_equation,
                    nonlinear_function=pig_growth);

example(c): partial-connected neural networks

number of hidden nodes: 3
nonlinear function (to define relationship between hidden nodes and observed trait): tanh
Bayesian model (to sample marker effects):
- genotype group 1: single-trait BayesA
- genotype group 2: single-trait BayesB
- genotype group 3: single-trait BayesC
sampler: Hamiltonian Monte Carlo (to sample hidden nodes)

# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets

# Step 2: Read data
phenofile   = dataset("phenotypes.csv")
genofile1   = dataset("genotypes_group1.csv")
genofile2   = dataset("genotypes_group2.csv")
genofile3   = dataset("genotypes_group3.csv")

phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
geno1  = get_genotypes(genofile1,separator=',',method="BayesA");
geno2  = get_genotypes(genofile2,separator=',',method="BayesB");
geno3  = get_genotypes(genofile3,separator=',',method="BayesC");

# Step 3: Build Model Equations
model_equation = "y1 = intercept + geno1 + geno2 + geno3";
model = build_model(model_equation,
		    nonlinear_function="tanh")

# Step 4: Run Analysis
out = runMCMC(model, phenotypes);

# Step 5: Check Accuruacy
results    = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy  = cor(results[!,:EBV],results[!,:bv1])

example(o3): partial-connected neural networks with omics data

For example, SNP group1 only affect omics1, and SNP group2 only affect omics2

# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets

# Step 2: Read data
phenofile   = dataset("phenotypes.csv")
genofile1   = dataset("genotypes_group1.csv")
genofile2   = dataset("genotypes_group2.csv")

phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
geno1  = get_genotypes(genofile1,separator=',',method="BayesA");
geno2  = get_genotypes(genofile2,separator=',',method="BayesC");

# Step 3: Build Model Equations
model_equation = "y1 = intercept + geno1 + geno2";  #y1 is the phenotype
model = build_model(model_equation,
		    num_hidden_nodes=2,
		    nonlinear_function="tanh",
	            latent_traits=["y2","y3"]) #y2 and y3 are two omics features

# Step 4: Run Analysis
out = runMCMC(model, phenotypes, chain_length=100, printout_model_info=false);

# Step 5: Check Accuruacy
results    = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy  = cor(results[!,:EBV],results[!,:bv1])

example(d): partial-connected neural networks with user-defined relationship between hidden nodes and observed trait

number of hidden nodes: 3
nonlinear function (to define relationship between hidden nodes and observed trait): y = sqrt(x1^2 / (x1^2 + x2^2 + x3^2))
sampler: Matropolis-Hastings (to sample hidden nodes)

Code are the same as example(c), except:

# Step 3: Build Model Equations
pig_growth(x1,x2,x3) = sqrt(x1^2 / (x1^2 + x2^2 + x3^2))
model_equation = "y1 = intercept + geno1 + geno2 + geno3";
model = build_model(model_equation,
		    nonlinear_function=pig_growth)

Multi-threaded parallelism

In NN-Bayes, by default, multiple single-trait models will be used with multi-threaded parallelism. Usually, the speed will be about 3 times faster than the single thread.

The number of execution threads is controlled by using the -t/--threads command-line argument (requires at least Julia 1.5).

For example, to start Julia with 4 threads: