-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example in documentation doesn't seem to work as advertised #102
Comments
Hi, julia> using SimpleChains
julia> mlpd = SimpleChain(
static(4),
TurboDense(tanh, 32),
TurboDense(tanh, 16),
TurboDense(identity, 4)
)
SimpleChain with the following layers:
TurboDense static(32) with bias.
Activation layer applying: tanh
TurboDense static(16) with bias.
Activation layer applying: tanh
TurboDense static(4) with bias.
julia> function f(x)
N = Base.isqrt(length(x))
A = reshape(view(x, 1:N*N), (N,N))
expA = exp(A)
vec(expA)
end
f (generic function with 1 method)
julia> T = Float32;
julia> X = randn(T, 2*2, 10_000);
julia> Y = reduce(hcat, map(f, eachcol(X)));
julia> Xtest = randn(T, 2*2, 10_000);
julia> Ytest = reduce(hcat, map(f, eachcol(Xtest)));
julia> @time p = SimpleChains.init_params(mlpd);
8.216584 seconds (3.67 M allocations: 317.728 MiB, 1.15% gc time, 100.00% compilation time)
julia> G = SimpleChains.alloc_threaded_grad(mlpd);
julia> mlpdloss = SimpleChains.add_loss(mlpd, SquaredLoss(Y));
julia> mlpdtest = SimpleChains.add_loss(mlpd, SquaredLoss(Ytest));
julia> report = let mtrain = mlpdloss, X=X, Xtest=Xtest, mtest = mlpdtest
p -> begin
let train = mlpdloss(X, p), test = mlpdtest(Xtest, p)
@info "Loss:" train test
end
end
end
#3 (generic function with 1 method)
julia> report(p)
┌ Info: Loss:
│ train = 133158.62f0
└ test = 130800.52f0
julia> for _ in 1:3
@time SimpleChains.train_unbatched!(
G, p, mlpdloss, X, SimpleChains.ADAM(), 10_000
);
report(p)
end
4.784989 seconds (8.28 M allocations: 565.218 MiB, 6.79% gc time, 89.51% compilation time)
┌ Info: Loss:
│ train = 192.6596f0
└ test = 523.15155f0
0.409552 seconds
┌ Info: Loss:
│ train = 29.760109f0
└ test = 309.85358f0
0.413950 seconds
┌ Info: Loss:
│ train = 19.802664f0
└ test = 275.5991f0
julia> versioninfo()
Julia Version 1.9.0-DEV.1189
Commit 293031b4a5* (2022-08-26 20:24 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.5 (ORCJIT, cascadelake)
Threads: 36 on 36 virtual cores
Environment:
JULIA_NUM_THREADS = 36 I should have included the The 10980XE, compared to the E5-2650V3, has 1.8x more cores (18 vs 10), 4x more L2 cache per core (1 MiB vs 256 KiB), and 4x higher throughput per core (AVX512 vs AVX without FMA). Clock speeds are also higher on the 10980XE. julia> 1.8*4 * 0.4 # 1.8x fewer cores * 4x fewer FLOPS/clock * 0.4 seconds
2.8800000000000003 A bit over 2.8 seconds is roughly inline with what one would expect from the Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. However, this is computer is not representative of what most people will have access to, so it should probably emphasize the architecture. (Although PumasAI' customers are likely to use SimpleChains through JuliaHub, which will run on AVX512-capable servers and allow high core counts.) This means that unfortunately, we probably can't expect much better than the times you reported on your computer.
Squared loss went from reporting half mean square loss to half of the sum of the squared loss. I'll make a PR to update the docs (also adding |
My Dell XPS 13 with a 4 core tiger lake chip took 2.1 seconds (making it faster than the moderately beefy server without FMA instructions), which is also within the realm of what's expected, because it does have AVX512, but can only perform a single FMA/clock cycle. Memory operations and
Mind sharing |
Great answers! I will link here from slack. On my mac:
On the first Linux machine:
On the larger Linux machine:
|
For completeness sake, here is the result on an M1 mac
|
That is unexpectedly poor performance. I see My Mac Mini (only 4 big cores) is about 2x faster: 6.450894 seconds (10.19 M allocations: 694.866 MiB, 2.52% gc time, 47.80% compilation time)
┌ Info: Loss:
│ train = 139.80142f0
└ test = 647.0813f0
3.359473 seconds
┌ Info: Loss:
│ train = 31.356768f0
└ test = 324.7406f0
3.367158 seconds
┌ Info: Loss:
│ train = 19.30026f0
└ test = 262.24854f0
julia> versioninfo()
Julia Version 1.9.0-DEV.1189
Commit 293031b4a5* (2022-08-26 20:24 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.6.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.5 (ORCJIT, apple-m1)
Threads: 4 on 4 virtual cores If you do have an 8 big core system, perhaps the problem is communication overhead between the separate 4 core clusters? Not sure. Perhaps also in that case, going down to 4 cores can help? |
Too many threads is the answer. When I run with 4 threads, performance matches yours:
|
Upgrading packages (in particular, SLEFPirates to 0.6.34) should help performance on this example. It improves the 5.584716 seconds (9.68 M allocations: 664.697 MiB, 6.34% gc time, 57.13% compilation time)
┌ Info: Loss:
│ train = 113.73494f0
└ test = 3894.2808f0
2.391776 seconds
┌ Info: Loss:
│ train = 26.246262f0
└ test = 3112.5964f0
2.402684 seconds
┌ Info: Loss:
│ train = 17.083645f0
└ test = 2907.2073f0 On the 10980XE: 4.357419 seconds (7.64 M allocations: 527.256 MiB, 5.13% gc time, 91.42% compilation time)
┌ Info: Loss:
│ train = 173.95671f0
└ test = 381.87326f0
0.314578 seconds
┌ Info: Loss:
│ train = 20.398003f0
└ test = 199.80536f0
0.318370 seconds
┌ Info: Loss:
│ train = 12.305162f0
└ test = 159.66428f0 |
Yes. that made quite the difference (after restarting Julia). That isn't installed in a base install, however, so that is another thing that could help your docs. I think that this issue can be closed (I appreciate the tutorial, however) subject to the tiny doc updates (which you may have already done). |
I am excited and intrigued about the promise of SimpleChains, but I am having some problems in reproducing even the simplest examples.
So, for my first experiment, I was looking at https://pumasai.github.io/SimpleChains.jl/stable/examples/smallmlp/
That page says that they get these results:
But when I run this code, I get
(I have increased the number of iterations here)
This is running on a moderately beefy server
and I started Julia with 20 threads in case it would help (it makes quote a difference)
Interestingly, the speed is only a bit faster than on my old Mac (2.8s per step versus 4.2s). Faster, yes, but not devastatingly so.
Also, threading options on the Mac made very little difference. On the server-scale machine, the difference was devastating (27s vs 2.8s)
But, in any case, the speeds are all massively slower than what the web page describes and the loss achieved is much worse.
Am I doing something obviously, patently wrong?
Or is the web page out of date?
Or is there a don't-go-slow switch that defaults to off?
The text was updated successfully, but these errors were encountered: