Skip to content
Hao Cheng edited this page Feb 20, 2020 · 2 revisions

write to a file

  • Writing a row vector (Float64) of length 50_000 to a text file 50_000 times takes 10~20 minutes. In terms of JWAS, this is speeded up by saving only 1000 MCMC samples (iMac).

read large genotype files

Julia v1.2.0
DataFrames v0.19.2
CSV v0.5.11
DATA no header, 1st column is 0,1,2 or "0","1","2", all others are 0,1,2

n(#row) p(#column) readtable(header=false) CSV.read(header=false) CSV.read(header=false, skipto=2) CSV.read(header=true) CSV.read(types=etv, header=false) (1st col::String, other col::Float64) CSV.read(types=etv, header=false) (all col::Float64) CSV.read(header=false) 1st col in raw data is String CSV.read(types=etv, header=false) (all col::Int64) CSV.read(types=etv, header=false) (1st col::String, other col::Int64)
n=21316 p=45613 582s (3.83 G allocations: 107.298 GiB, 5.32% gc time) 69s (18.39 M allocations: 903.549 MiB, 0.72% gc time) 75s (18.49 M allocations: 909.132 MiB, 0.67% gc time) 3763s (2.09 G allocations: 94.273 GiB, 0.24% gc time) 171s (18.84 M allocations: 924.486 MiB, 0.29% gc time) 191s (18.83 M allocations: 923.961 MiB, 0.26% gc time) 66s (18.50 M allocations: 909.606 MiB, 0.73% gc time) 86.826469 seconds (18.83 M allocations: 923.959 MiB, 0.65% gc time) 86.903623 seconds (18.84 M allocations: 924.484 MiB, 0.66% gc time)
2n p 1221s (7.72 G allocations: 215.153 GiB, 3.86% gc time) 128s (18.40 M allocations: 905.178 MiB, 0.39% gc time) 131s (18.49 M allocations: 909.132 MiB, 0.40% gc time) 3829s (2.09 G allocations: 94.273 GiB, 0.24% gc time) 357s (18.84 M allocations: 924.486 MiB, 0.14% gc time) 359s (18.83 M allocations: 923.961 MiB, 0.14% gc time) 135s (18.50 M allocations: 909.606 MiB, 0.40% gc time) 121.635426 seconds (18.83 M allocations: 923.959 MiB, 0.42% gc time) 122.701367 seconds (18.84 M allocations: 924.484 MiB, 0.41% gc time)
3n p 1879s (11.61 G allocations: 331.119 GiB, 3.80% gc time) 206s (18.40 M allocations: 905.178 MiB, 0.24% gc time) 197s (18.49 M allocations: 909.132 MiB, 0.25% gc time) 3892s (2.09 G allocations: 94.273 GiB, 0.23% gc time) 569s (18.84 M allocations: 924.486 MiB, 0.09% gc time) 565s (18.83 M allocations: 923.961 MiB, 0.09% gc time) 192s (18.50 M allocations: 909.606 MiB, 0.27% gc time) 195.203569 seconds (18.83 M allocations: 923.959 MiB, 0.28% gc time) 203.113346 seconds (18.84 M allocations: 924.484 MiB, 0.25% gc time)
  • when header = true, CSV.read is very slow.
  • code to read a large genotype file with a header #iMac,n=21316,p=45613
df        = CSV.read("data.txt",header=false,skipto=2)                #35s
obsID     = map(string,df[!,1])                                       #-
genotypes = map(Float64,convert(Matrix,df[!,2:end]))                  #8s
myfile    = open("data.txt")                                          #-
markerID  = split(readline(myfile),[',','\n'],keepempty=false)[2:end] #0.015s
close(myfile)                                                         #-
  • Predefined types to be Int32/Float32 will still have type Float64/Int64. (predefine by CSV.read(types=etv))

run JWAS

(iMac)

using DataFrames
n=50_000
p=50_000
phenotypes = DataFrame(ID=1:n,y=randn(n));
genotypes = rand([0.0,1.0,2.0],n,p);
using JWAS
model_equation1  ="y = intercept";
R      = 1.0
model1 = build_model(model_equation1,R);

G3 =1.0
add_genotypes(model1,genotypes,G3);

@time out1=runMCMC(model1,phenotypes,methods="BayesC",estimatePi=true,chain_length=10000);

10110.098186 seconds (283.69 M allocations: 85.364 GiB, 0.09% gc time)

10110*5/60/60

14 #hours

using LinearAlgebra
model_equation1  ="y1 = intercept
                   y2 = intercept
                   y3 = intercept";
R      = Matrix{Float64}(I, 3, 3)  
model1 = build_model(model_equation1,R);

G3 =  Matrix{Float64}(I, 3, 3)  
add_genotypes(model1,genotypes,G3);
@time out1=runMCMC(model1,phenotypes,methods="BayesC",estimatePi=true,chain_length=10000);

25378.676536 seconds (33.85 G allocations: 3.205 TiB, 2.18% gc time)

25378*5/3600

35 hours

on iMac: Julia Version 1.2.0 Commit c6da87ff4b (2019-08-20 00:03 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.6.0) CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Memory efficiency

  • number of markers = 100, number of individuals = 204000 (number of genotyoed ind 1000)
@time Mn = -Aing*Mg         #154Mb  #size of Mn is about 161Mb
@time Mn = Ainn\(-Aing*Mg)  #870Mb  #size of Mn is about 161Mb #8X more allocation 
                                    #(here the allocation seems to be a one-time memory use, which may cause 
                                    #memory problem (crach). This is different from for loops. In for loops, you 
                                    #may see allocation larger than 
                                    #maximum computer memory, but it is ok because of in-time garbage collection.
@time M2 = [Mn;Mg];         #~155Mb
#0.193736 seconds (313.37 k allocations: 170.114 MiB, 14.16% gc time)
#2.456383 seconds (4.86 M allocations: 1.096 GiB, 20.73% gc time)
#0.191316 seconds (153.87 k allocations: 163.735 MiB, 3.10% gc time)

This memory problem is fixed by Ainn(-Aing*Mg[:,blocki]) for blocks

Clone this wiki locally