Example in documentation doesn't seem to work as advertised #102

tdunning · 2022-08-26T23:02:58Z

I am excited and intrigued about the promise of SimpleChains, but I am having some problems in reproducing even the simplest examples.

So, for my first experiment, I was looking at https://pumasai.github.io/SimpleChains.jl/stable/examples/smallmlp/

That page says that they get these results:

┌ Info: Loss:
│   train = 0.012996411f0
└   test = 0.021395735f0
  0.488138 seconds
┌ Info: Loss:
│   train = 0.0027068993f0
└   test = 0.009439239f0
  0.481226 seconds
┌ Info: Loss:
│   train = 0.0016358295f0
└   test = 0.0074498975f0

But when I run this code, I get

julia> include("fit.jl")
 14.593964 seconds (9.99 M allocations: 743.941 MiB, 2.50% gc time, 100.00% compilation time)
┌ Info: Loss:
│   train = 117384.07f0
└   test = 120648.69f0
 10.075712 seconds (10.75 M allocations: 653.625 MiB, 2.71% gc time, 64.78% compilation time)
┌ Info: Loss:
│   train = 281.1819f0
└   test = 2137.8977f0
  3.006191 seconds
┌ Info: Loss:
│   train = 36.69812f0
└   test = 1375.113f0
  3.068424 seconds
┌ Info: Loss:
│   train = 22.220306f0
└   test = 1274.7125f0
  2.768954 seconds
┌ Info: Loss:
│   train = 17.59526f0
└   test = 1231.006f0
  2.838665 seconds
┌ Info: Loss:
│   train = 14.732641f0
└   test = 1203.2766f0
  3.007222 seconds
┌ Info: Loss:
│   train = 12.916879f0
└   test = 1181.4078f0
  2.835089 seconds
┌ Info: Loss:
│   train = 11.717702f0
└   test = 1163.2349f0
  2.826076 seconds
┌ Info: Loss:
│   train = 10.467139f0
└   test = 1146.4462f0
  2.841819 seconds
┌ Info: Loss:
│   train = 9.798438f0
└   test = 1134.0793f0
  2.773523 seconds
┌ Info: Loss:
│   train = 9.052085f0
└   test = 1122.8378f0

(I have increased the number of iterations here)

This is running on a moderately beefy server

model name	: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
stepping	: 2
microcode	: 0x46
cpu MHz		: 1198.829
cache size	: 25600 KB

and I started Julia with 20 threads in case it would help (it makes quote a difference)
Interestingly, the speed is only a bit faster than on my old Mac (2.8s per step versus 4.2s). Faster, yes, but not devastatingly so.

Also, threading options on the Mac made very little difference. On the server-scale machine, the difference was devastating (27s vs 2.8s)

But, in any case, the speeds are all massively slower than what the web page describes and the loss achieved is much worse.
Am I doing something obviously, patently wrong?

Or is the web page out of date?

Or is there a don't-go-slow switch that defaults to off?

The text was updated successfully, but these errors were encountered:

chriselrod · 2022-08-27T17:30:31Z

Hi,
I just re-ran the example.

julia> using SimpleChains

julia> mlpd = SimpleChain(
         static(4),
         TurboDense(tanh, 32),
         TurboDense(tanh, 16),
         TurboDense(identity, 4)
       )
SimpleChain with the following layers:
TurboDense static(32) with bias.
Activation layer applying: tanh
TurboDense static(16) with bias.
Activation layer applying: tanh
TurboDense static(4) with bias.

julia> function f(x)
         N = Base.isqrt(length(x))
         A = reshape(view(x, 1:N*N), (N,N))
         expA = exp(A)
         vec(expA)
       end
f (generic function with 1 method)

julia> T = Float32;

julia> X = randn(T, 2*2, 10_000);

julia> Y = reduce(hcat, map(f, eachcol(X)));

julia> Xtest = randn(T, 2*2, 10_000);

julia> Ytest = reduce(hcat, map(f, eachcol(Xtest)));

julia> @time p = SimpleChains.init_params(mlpd);
  8.216584 seconds (3.67 M allocations: 317.728 MiB, 1.15% gc time, 100.00% compilation time)

julia> G = SimpleChains.alloc_threaded_grad(mlpd);

julia> mlpdloss = SimpleChains.add_loss(mlpd, SquaredLoss(Y));

julia> mlpdtest = SimpleChains.add_loss(mlpd, SquaredLoss(Ytest));

julia> report = let mtrain = mlpdloss, X=X, Xtest=Xtest, mtest = mlpdtest
         p -> begin
           let train = mlpdloss(X, p), test = mlpdtest(Xtest, p)
             @info "Loss:" train test
           end
         end
       end
#3 (generic function with 1 method)

julia> report(p)
┌ Info: Loss:
│   train = 133158.62f0
└   test = 130800.52f0

julia> for _ in 1:3
         @time SimpleChains.train_unbatched!(
           G, p, mlpdloss, X, SimpleChains.ADAM(), 10_000
         );
         report(p)
       end
  4.784989 seconds (8.28 M allocations: 565.218 MiB, 6.79% gc time, 89.51% compilation time)
┌ Info: Loss:
│   train = 192.6596f0
└   test = 523.15155f0
  0.409552 seconds
┌ Info: Loss:
│   train = 29.760109f0
└   test = 309.85358f0
  0.413950 seconds
┌ Info: Loss:
│   train = 19.802664f0
└   test = 275.5991f0

julia> versioninfo()
Julia Version 1.9.0-DEV.1189
Commit 293031b4a5* (2022-08-26 20:24 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores
Environment:
  JULIA_NUM_THREADS = 36

I should have included the versioninfo() to avoid potential confusion in the documentation.
My timings now are better than what were in the docs, which is either because performance improved since then, or because the computer was slower (e.g., could have been run on a 7980XE, which I also have).

The 10980XE, compared to the E5-2650V3, has 1.8x more cores (18 vs 10), 4x more L2 cache per core (1 MiB vs 256 KiB), and 4x higher throughput per core (AVX512 vs AVX without FMA). Clock speeds are also higher on the 10980XE.
That said,

julia> 1.8*4 * 0.4 # 1.8x fewer cores * 4x fewer FLOPS/clock * 0.4 seconds
2.8800000000000003

A bit over 2.8 seconds is roughly inline with what one would expect from the Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz.

However, this is computer is not representative of what most people will have access to, so it should probably emphasize the architecture. (Although PumasAI' customers are likely to use SimpleChains through JuliaHub, which will run on AVX512-capable servers and allow high core counts.)

This means that unfortunately, we probably can't expect much better than the times you reported on your computer.

But, in any case, the speeds are all massively slower than what the web page describes and the loss achieved is much worse.

Squared loss went from reporting half mean square loss to half of the sum of the squared loss. I'll make a PR to update the docs (also adding versioninfo()).

chriselrod · 2022-08-27T17:41:16Z

My Dell XPS 13 with a 4 core tiger lake chip took 2.1 seconds (making it faster than the moderately beefy server without FMA instructions), which is also within the realm of what's expected, because it does have AVX512, but can only perform a single FMA/clock cycle. Memory operations and tanh still receive all/most of the benefit from AVX512.

Interestingly, the speed is only a bit faster than on my old Mac (2.8s per step versus 4.2s). Faster, yes, but not devastatingly so.

Also, threading options on the Mac made very little difference.

Mind sharing versioninfo() on the Mac?

tdunning · 2022-08-28T00:24:46Z

Great answers! I will link here from slack.

On my mac:

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

On the first Linux machine:

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

On the larger Linux machine:

Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 1 on 40 virtual cores

tdunning · 2022-08-28T00:40:24Z

For completeness sake, here is the result on an M1 mac

julia> include("fit.jl")
  5.351259 seconds (10.33 M allocations: 813.431 MiB, 2.26% gc time, 100.00% compilation time)
┌ Info: Loss:
│   train = 122779.22f0
└   test = 129720.38f0
 10.984314 seconds (13.95 M allocations: 819.360 MiB, 2.04% gc time, 29.61% compilation time)
┌ Info: Loss:
│   train = 206.32681f0
└   test = 1412.7253f0
  7.759999 seconds
┌ Info: Loss:
│   train = 36.585293f0
└   test = 880.9235f0
  7.786647 seconds
┌ Info: Loss:
│   train = 21.568672f0
└   test = 753.61115f0
  7.731717 seconds
┌ Info: Loss:
│   train = 16.15998f0
└   test = 678.20703f0
  7.823953 seconds
┌ Info: Loss:
│   train = 13.404591f0
└   test = 629.08093f0
^CERROR: LoadError: InterruptException:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores

chriselrod · 2022-08-28T00:58:30Z

That is unexpectedly poor performance.
Is this a 4 big/4 small or 8 big/2 small system?

I see versioninfo() reports "8 virtual cores", but I don't recall if it was updated to report only big cores in Julia 1.8, or only on master.
If it is trying to run code on little cores, that is likely the problem and starting Julia with only 4 threads should help.
It'd be harder to explain if it's 8 big/2 small, as naively I'd think 6 big cores would be close to 50% faster than 4.

My Mac Mini (only 4 big cores) is about 2x faster:

  6.450894 seconds (10.19 M allocations: 694.866 MiB, 2.52% gc time, 47.80% compilation time)      
┌ Info: Loss:                                                                                      
│   train = 139.80142f0                                                                            
└   test = 647.0813f0
  3.359473 seconds
┌ Info: Loss:
│   train = 31.356768f0
└   test = 324.7406f0
  3.367158 seconds
┌ Info: Loss:
│   train = 19.30026f0
└   test = 262.24854f0

julia> versioninfo()
Julia Version 1.9.0-DEV.1189
Commit 293031b4a5* (2022-08-26 20:24 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, apple-m1)
  Threads: 4 on 4 virtual cores

If you do have an 8 big core system, perhaps the problem is communication overhead between the separate 4 core clusters? Not sure. Perhaps also in that case, going down to 4 cores can help?

tdunning · 2022-08-28T01:26:44Z

Too many threads is the answer. When I run with 4 threads, performance matches yours:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 4 on 8 virtual cores

julia> include("fit.jl")
  5.487713 seconds (10.33 M allocations: 813.347 MiB, 3.46% gc time, 100.00% compilation time)
┌ Info: Loss:
│   train = 130214.086f0
└   test = 120674.42f0
  6.729007 seconds (13.74 M allocations: 800.713 MiB, 3.93% gc time, 48.73% compilation time)
┌ Info: Loss:
│   train = 426.51428f0
└   test = 311.41187f0
  3.508436 seconds
┌ Info: Loss:
│   train = 60.985065f0
└   test = 347.3205f0
  3.452805 seconds
┌ Info: Loss:
│   train = 31.754074f0
└   test = 283.42307f0
  3.484129 seconds
┌ Info: Loss:
│   train = 22.97762f0
└   test = 209.78539f0
  3.535808 seconds
┌ Info: Loss:
│   train = 18.768747f0
└   test = 165.22177f0

chriselrod · 2022-08-28T01:49:59Z

Upgrading packages (in particular, SLEFPirates to 0.6.34) should help performance on this example. It improves the tanh implementation.
On the M1:

  5.584716 seconds (9.68 M allocations: 664.697 MiB, 6.34% gc time, 57.13% compilation time)       
┌ Info: Loss:                                                                                      
│   train = 113.73494f0                                                                            
└   test = 3894.2808f0
  2.391776 seconds
┌ Info: Loss:
│   train = 26.246262f0
└   test = 3112.5964f0
  2.402684 seconds
┌ Info: Loss:
│   train = 17.083645f0
└   test = 2907.2073f0

On the 10980XE:

  4.357419 seconds (7.64 M allocations: 527.256 MiB, 5.13% gc time, 91.42% compilation time)
┌ Info: Loss:
│   train = 173.95671f0
└   test = 381.87326f0
  0.314578 seconds
┌ Info: Loss:
│   train = 20.398003f0
└   test = 199.80536f0
  0.318370 seconds
┌ Info: Loss:
│   train = 12.305162f0
└   test = 159.66428f0

tdunning · 2022-08-28T03:04:38Z

Yes. that made quite the difference (after restarting Julia).

That isn't installed in a base install, however, so that is another thing that could help your docs.

I think that this issue can be closed (I appreciate the tutorial, however) subject to the tiny doc updates (which you may have already done).

chriselrod mentioned this issue Aug 27, 2022

Use 4 core laptop for smallmlp and update printed losses #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example in documentation doesn't seem to work as advertised #102

Example in documentation doesn't seem to work as advertised #102

tdunning commented Aug 26, 2022

chriselrod commented Aug 27, 2022

chriselrod commented Aug 27, 2022 •

edited

Loading

tdunning commented Aug 28, 2022 •

edited

Loading

tdunning commented Aug 28, 2022

chriselrod commented Aug 28, 2022

tdunning commented Aug 28, 2022

chriselrod commented Aug 28, 2022

tdunning commented Aug 28, 2022

Example in documentation doesn't seem to work as advertised #102

Example in documentation doesn't seem to work as advertised #102

Comments

tdunning commented Aug 26, 2022

chriselrod commented Aug 27, 2022

chriselrod commented Aug 27, 2022 • edited Loading

tdunning commented Aug 28, 2022 • edited Loading

tdunning commented Aug 28, 2022

chriselrod commented Aug 28, 2022

tdunning commented Aug 28, 2022

chriselrod commented Aug 28, 2022

tdunning commented Aug 28, 2022

chriselrod commented Aug 27, 2022 •

edited

Loading

tdunning commented Aug 28, 2022 •

edited

Loading