-
Notifications
You must be signed in to change notification settings - Fork 45
Neural Networks (NNMM)
We proposed a new approach named NN-LMM to extended linear mixed model (LMM) to nonlinear neural networks (NN). This model is able to incorporate intermediate omics data such as gene expression levels. More details can be found in our papers:
- Tianjing Zhao, Jian Zeng, and Hao Cheng. Extend Mixed Models to Multi-layer Neural Networks for Genomic Prediction Including Intermediate Omics Data, bioRxiv 2021.12.10.472186; https://doi.org/10.1101/2021.12.10.472186.
- Tianjing Zhao, Rohan Fernando, and Hao Cheng. Interpretable artificial neural networks incorporating Bayesian alphabet models for genome-wide prediction and association studies, G3 Genes|Genomes|Genetics, 2021; jkab228, https://doi.org/10.1093/g3journal/jkab228
The flexibility of NN-LMM can be demonstrated below. NN-LMM can fit fully-connected neural networks ((a),(b)), or partial-connected neural networks ((c),(d)). Also, the relationship between hidden nodes and observed trait can be based on activation functions ((a),(c)), or defined by a user-defined function ((b),(d)).
The code to perform different neural networks is summarized below:
NN-LMM can also incorporate intermediate omics features (e.g., gene expression levels). In below example, for an individual, the gene expression levels of the first two genes are 0.9 and 0.1, respectively, and the gene expression level of the last gene is missing to be sampled. The missing patterns of gene expression levels can be different for different individuals.
- nonlinear function (to define relationship between hidden nodes and observed trait): tanh (other supported activation functions: "sigmoid", "relu", "leakyrelu", "linear")
- number of hidden nodes: 3
- Bayesian model: multiple independent single-trait BayesC (to sample marker effects). Note, to use multi-trait Bayesian Alphabet models to sample marker effects, please set
mega_trait=false
inrunMCMC()
function. - sampler: Hamiltonian Monte Carlo (to sample hidden nodes)
# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets
# Step 2: Read data
phenofile = dataset("phenotypes.csv")
genofile = dataset("genotypes.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
genotypes = get_genotypes(genofile,separator=',',method="BayesC");
# Step 3: Build Model Equations
model_equation ="y1 = intercept + genotypes"
model = build_model(model_equation,
num_hidden_nodes=3,
nonlinear_function="tanh");
# Step 4: Run Analysis
out=runMCMC(model,phenotypes,chain_length=5000);
# Step 5: Check Accuruacy
results = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy = cor(results[!,:EBV],results[!,:bv1])
In NNBayes, the i-th hidden nodes will be named as "trait name"+"i". In our example, the observed trait is named "y1", and there are 3 hidden nodes, so the hidden nodes are named as "y11", "y12", and "y13", respectively.
Below is a list of files containing estimates and standard deviations for variables of interest.
file name | description |
---|---|
EBV_NonLinear.txt | estimated breeding values for observed trait |
EBV_y11.txt | estimated breeding values for hidden node 1 |
EBV_y12.txt | estimated breeding values for hidden node 2 |
EBV_y13.txt | estimated breeding values for hidden node 3 |
genetic_variance.txt | estimated genetic variance-covariance of all hidden nodes |
heritability.txt | estimated heritability of all hidden nodes |
location_parameters.txt | estimated bias of all hidden nodes |
neural_networks_bias_and_weights.txt. | estimated bias of observed trait and weights between hidden nodes and observed trait |
pi_genotypes.txt | estimated pi of all hidden nodes |
marker_effects_genotypes.txt | estimated marker effects of all hidden nodes |
residual_variance.txt | estimated residual variance-covariance for all hidden nodes |
Below is a list of files containing MCMC samples for variables of interest.
file name | description |
---|---|
MCMC_samples_EBV_NonLinear.txt | MCMC samples from the posterior distribution of breeding values for observed trait |
MCMC_samples_EBV_y11.txt | MCMC samples from the posterior distribution of breeding values for hidden node 1 |
MCMC_samples_EBV_y12.txt | MCMC samples from the posterior distribution of breeding values for hidden node 2 |
MCMC_samples_EBV_y13.txt | MCMC samples from the posterior distribution of breeding values for hidden node 3 |
MCMC_samples_genetic_variance.txt | MCMC samples from the posterior distribution of genetic variance-covariance for all hidden nodes |
MCMC_samples_heritability.txt | MCMC samples from the posterior distribution of heritability for all hidden nodes |
MCMC_samples_marker_effects_genotypes_y11 | MCMC samples from the posterior distribution of marker effects for hidden node 1 |
MCMC_samples_marker_effects_genotypes_y12 | MCMC samples from the posterior distribution of marker effects for hidden node 2 |
MCMC_samples_marker_effects_genotypes_y13 | MCMC samples from the posterior distribution of marker effects for hidden node 3 |
MCMC_samples_marker_effects_variances_genotypes.txt | MCMC samples from the posterior distribution of marker effect variance for all hidden nodes |
MCMC_samples_neural_networks_bias_and_weights.txt. | MCMC samples from the posterior distribution of bias of observed trait and weights between hidden nodes and observed trait |
MCMC_samples_pi_genotypes.txt | MCMC samples from the posterior distribution of pi for all hidden nodes |
MCMC_samples_residual_variance.txt | MCMC samples from the posterior distribution of residual variance-covariance for all hidden nodes |
Bayesian Neural Networks with Intermediate Omics Data: Genotype -> Omics -> Phenotype (complete/incomplete omics data)
- The intermediate omics features should be in the same file as phenotypes. In this example, the "phenotypes.csv" file contains one column for the phenotype named "y1", and two columns for the omics features named "y2" and "y3".
- Just simply indicate the named of the omics features in the
build_model()
function. In this example, we havelatent_traits=["y2","y3"]
. - If there are many omics features (e.g., 1000), you can avoid printing the model information in the
runMCMC()
function by settingprintout_model_info=false
. - missing omics data is allowed. Just make sure the missings are recognized in Julia. For example, if the missing values are "NA" in your raw data, then you can set
missingstrings=["NA"]
in theCSV.read()
function. Then thouse NA will be transferred tomissing
elements in Julia. - You may want to set missing values manually, for example, set the phenotypes for individuals in testing dataset as missing. In julia, you should first change the type of that column to allow missing, e.g.,
phenotypes[!,:y1] = convert(Vector{Union{Missing,Float64}}, phenotypes[!,:y1])
. Then you can set missing values manually, e.g.,phenotypes[1:2,:y1] .= missing
sets the values for first two rows in column named y1 as missing. - To include residual (e.g. not mediated by other omics features) polygenic component, you can (1) an additional hidden node in the middle layer (see example (o2)); or use a more flexible partial-connected neural network (see example (o3)).
- For the testing individuals (i.e., individuals without phenotype), if the testing individual have omics data, then incorporating those individuals in analysis will help to estimate marker effects. But if testing individuals only have genotype data, we cannot include them in our analysis. Instead, we can calculate the EBV once we have estimated the marker effects and neural network weights.
# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets
# Step 2: Read data
phenofile = dataset("phenotypes.csv")
genofile = dataset("genotypes.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"]) #should include omcis data!
genotypes = get_genotypes(genofile,separator=',',method="BayesC")
# Step 3: Build Model Equations
model_equation ="y1 = intercept + genotypes" #y1 is the observed phenotype
model = build_model(model_equation,
num_hidden_nodes=2,
latent_traits=["y2","y3"], #y2 and y3 are two omics features
nonlinear_function="tanh")
# Step 4: Run Analysis
out = runMCMC(model,phenotypes,chain_length=5000,printout_model_info=false)
# Step 5: Check Accuruacy
results = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy = cor(results[!,:EBV],results[!,:bv1])
This can be done by adding an extra hidden node. For all individuals, this extra hidden node will be treated as unknown to be sampled.
The example for fully-connected neural network and partial-connected neural network:
Example code for fully-connected neural network with residual:
# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets
# Step 2: Read data
phenofile = dataset("phenotypes.csv")
genofile = dataset("genotypes.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
insertcols!(phenotypes, 5, :residual => missing) #add one column named "residual" with missing values, position is the 5th column in phenotypes
phenotypes[!,:residual] = convert(Vector{Union{Missing,Float64}}, phenotypes[!,:residual]) #transform the datatype is required for Julia
genotypes = get_genotypes(genofile,separator=',',method="BayesC")
# Step 3: Build Model Equations
model_equation ="y1 = intercept + genotypes" #y1 is the observed phenotype
model = build_model(model_equation,
num_hidden_nodes=3,
latent_traits=["y2","y3","residual"], #y2 and y3 are two omics features
nonlinear_function="tanh")
# Step 4: Run Analysis
out = runMCMC(model,phenotypes,chain_length=5000,printout_model_info=false)
# Step 5: Check Accuruacy
results = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy = cor(results[!,:EBV],results[!,:bv1])
Example code for partial-connected neural network with residual:
# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets
# Step 2: Read data
phenofile = dataset("phenotypes.csv")
genofile1 = dataset("genotypes_group1.csv")
genofile2 = dataset("genotypes_group2.csv")
genofile3 = dataset("GRM.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
insertcols!(phenotypes, 5, :residual => missing) #add one column named "residual" with missing values, position is the 5th column in phenotypes
phenotypes[!,:residual] = convert(Vector{Union{Missing,Float64}}, phenotypes[!,:residual]) #transform the datatype is required for Julia
geno1 = get_genotypes(genofile1,separator=',',method="BayesA");
geno2 = get_genotypes(genofile2,separator=',',method="BayesC");
geno3 = get_genotypes(genofile3,separator=',',header=false,method="GBLUP");
# Step 3: Build Model Equations
model_equation = "y1 = intercept + geno1 + geno2 + geno3";
model = build_model(model_equation,
num_hidden_nodes=3,
nonlinear_function="tanh",
latent_traits=["y3","y2","residual"])
# Step 4: Run Analysis
out = runMCMC(model, phenotypes, chain_length=5000,printout_model_info=false);
# Step 5: Check Accuruacy
results = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy = cor(results[!,:EBV],results[!,:bv1])
example(b): fully-connected neural networks with user-defined relationship between hidden nodes and observed trait
- number of hidden nodes: 2
- nonlinear function (to define relationship between hidden nodes and observed trait): y = sqrt(x1^2 / (x1^2 + x2^2))
- sampler: Matropolis-Hastings (to sample hidden nodes)
Code are the same as example(a), except:
# Step 3: Build Model Equations
pig_growth(x1,x2) = sqrt(x1^2 / (x1^2 + x2^2))
model_equation ="y1 = intercept + genotypes"
model = build_model(model_equation,
nonlinear_function=pig_growth);
- number of hidden nodes: 3
- nonlinear function (to define relationship between hidden nodes and observed trait): tanh
- Bayesian model (to sample marker effects):
- genotype group 1: single-trait BayesA
- genotype group 2: single-trait BayesB
- genotype group 3: single-trait BayesC
- sampler: Hamiltonian Monte Carlo (to sample hidden nodes)
# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets
# Step 2: Read data
phenofile = dataset("phenotypes.csv")
genofile1 = dataset("genotypes_group1.csv")
genofile2 = dataset("genotypes_group2.csv")
genofile3 = dataset("genotypes_group3.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
geno1 = get_genotypes(genofile1,separator=',',method="BayesA");
geno2 = get_genotypes(genofile2,separator=',',method="BayesB");
geno3 = get_genotypes(genofile3,separator=',',method="BayesC");
# Step 3: Build Model Equations
model_equation = "y1 = intercept + geno1 + geno2 + geno3";
model = build_model(model_equation,
nonlinear_function="tanh")
# Step 4: Run Analysis
out = runMCMC(model, phenotypes);
# Step 5: Check Accuruacy
results = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy = cor(results[!,:EBV],results[!,:bv1])
For example, SNP group1 only affect omics1, and SNP group2 only affect omics2
# Step 1: Load packages
using JWAS,DataFrames,CSV,Statistics,JWAS.Datasets
# Step 2: Read data
phenofile = dataset("phenotypes.csv")
genofile1 = dataset("genotypes_group1.csv")
genofile2 = dataset("genotypes_group2.csv")
phenotypes = CSV.read(phenofile,DataFrame,delim = ',',header=true,missingstrings=["NA"])
geno1 = get_genotypes(genofile1,separator=',',method="BayesA");
geno2 = get_genotypes(genofile2,separator=',',method="BayesC");
# Step 3: Build Model Equations
model_equation = "y1 = intercept + geno1 + geno2"; #y1 is the phenotype
model = build_model(model_equation,
num_hidden_nodes=2,
nonlinear_function="tanh",
latent_traits=["y2","y3"]) #y2 and y3 are two omics features
# Step 4: Run Analysis
out = runMCMC(model, phenotypes, chain_length=100, printout_model_info=false);
# Step 5: Check Accuruacy
results = innerjoin(out["EBV_NonLinear"], phenotypes, on = :ID)
accuruacy = cor(results[!,:EBV],results[!,:bv1])
example(d): partial-connected neural networks with user-defined relationship between hidden nodes and observed trait
- number of hidden nodes: 3
- nonlinear function (to define relationship between hidden nodes and observed trait): y = sqrt(x1^2 / (x1^2 + x2^2 + x3^2))
- sampler: Matropolis-Hastings (to sample hidden nodes)
Code are the same as example(c), except:
# Step 3: Build Model Equations
pig_growth(x1,x2,x3) = sqrt(x1^2 / (x1^2 + x2^2 + x3^2))
model_equation = "y1 = intercept + geno1 + geno2 + geno3";
model = build_model(model_equation,
nonlinear_function=pig_growth)
In NN-Bayes, by default, multiple single-trait models will be used with multi-threaded parallelism. Usually, the speed will be about 3 times faster than the single thread.
The number of execution threads is controlled by using the -t/--threads command-line argument (requires at least Julia 1.5).
For example, to start Julia with 4 threads:
julia --threads 4
If you're using Juno via IDE like Atom, all threads will be loaded automatically. You can check the number of threads by Threads.nthreads()
.
Joint Analysis of Continuous, Censored and Categorical Traits
Integrating Phenotypic Causal Networks in GWAS
single trait and multiple trait GBLUP by providing the relationship matrix directly
User-defined Prediction Equation
Description of Mixed Effects Model
Constraint on variance components