Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example in documentation doesn't seem to work as advertised #102

Open
tdunning opened this issue Aug 26, 2022 · 8 comments
Open

Example in documentation doesn't seem to work as advertised #102

tdunning opened this issue Aug 26, 2022 · 8 comments

Comments

@tdunning
Copy link

I am excited and intrigued about the promise of SimpleChains, but I am having some problems in reproducing even the simplest examples.

So, for my first experiment, I was looking at https://pumasai.github.io/SimpleChains.jl/stable/examples/smallmlp/

That page says that they get these results:

┌ Info: Loss:
│   train = 0.012996411f0
└   test = 0.021395735f0
  0.488138 seconds
┌ Info: Loss:
│   train = 0.0027068993f0
└   test = 0.009439239f0
  0.481226 seconds
┌ Info: Loss:
│   train = 0.0016358295f0
└   test = 0.0074498975f0

But when I run this code, I get

julia> include("fit.jl")
 14.593964 seconds (9.99 M allocations: 743.941 MiB, 2.50% gc time, 100.00% compilation time)
┌ Info: Loss:
│   train = 117384.07f0
└   test = 120648.69f0
 10.075712 seconds (10.75 M allocations: 653.625 MiB, 2.71% gc time, 64.78% compilation time)
┌ Info: Loss:
│   train = 281.1819f0
└   test = 2137.8977f0
  3.006191 seconds
┌ Info: Loss:
│   train = 36.69812f0
└   test = 1375.113f0
  3.068424 seconds
┌ Info: Loss:
│   train = 22.220306f0
└   test = 1274.7125f0
  2.768954 seconds
┌ Info: Loss:
│   train = 17.59526f0
└   test = 1231.006f0
  2.838665 seconds
┌ Info: Loss:
│   train = 14.732641f0
└   test = 1203.2766f0
  3.007222 seconds
┌ Info: Loss:
│   train = 12.916879f0
└   test = 1181.4078f0
  2.835089 seconds
┌ Info: Loss:
│   train = 11.717702f0
└   test = 1163.2349f0
  2.826076 seconds
┌ Info: Loss:
│   train = 10.467139f0
└   test = 1146.4462f0
  2.841819 seconds
┌ Info: Loss:
│   train = 9.798438f0
└   test = 1134.0793f0
  2.773523 seconds
┌ Info: Loss:
│   train = 9.052085f0
└   test = 1122.8378f0

(I have increased the number of iterations here)

This is running on a moderately beefy server

model name	: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
stepping	: 2
microcode	: 0x46
cpu MHz		: 1198.829
cache size	: 25600 KB

and I started Julia with 20 threads in case it would help (it makes quote a difference)
Interestingly, the speed is only a bit faster than on my old Mac (2.8s per step versus 4.2s). Faster, yes, but not devastatingly so.

Also, threading options on the Mac made very little difference. On the server-scale machine, the difference was devastating (27s vs 2.8s)

But, in any case, the speeds are all massively slower than what the web page describes and the loss achieved is much worse.
Am I doing something obviously, patently wrong?

Or is the web page out of date?

Or is there a don't-go-slow switch that defaults to off?

@chriselrod
Copy link
Contributor

Hi,
I just re-ran the example.

julia> using SimpleChains

julia> mlpd = SimpleChain(
         static(4),
         TurboDense(tanh, 32),
         TurboDense(tanh, 16),
         TurboDense(identity, 4)
       )
SimpleChain with the following layers:
TurboDense static(32) with bias.
Activation layer applying: tanh
TurboDense static(16) with bias.
Activation layer applying: tanh
TurboDense static(4) with bias.

julia> function f(x)
         N = Base.isqrt(length(x))
         A = reshape(view(x, 1:N*N), (N,N))
         expA = exp(A)
         vec(expA)
       end
f (generic function with 1 method)

julia> T = Float32;

julia> X = randn(T, 2*2, 10_000);

julia> Y = reduce(hcat, map(f, eachcol(X)));

julia> Xtest = randn(T, 2*2, 10_000);

julia> Ytest = reduce(hcat, map(f, eachcol(Xtest)));

julia> @time p = SimpleChains.init_params(mlpd);
  8.216584 seconds (3.67 M allocations: 317.728 MiB, 1.15% gc time, 100.00% compilation time)

julia> G = SimpleChains.alloc_threaded_grad(mlpd);

julia> mlpdloss = SimpleChains.add_loss(mlpd, SquaredLoss(Y));

julia> mlpdtest = SimpleChains.add_loss(mlpd, SquaredLoss(Ytest));

julia> report = let mtrain = mlpdloss, X=X, Xtest=Xtest, mtest = mlpdtest
         p -> begin
           let train = mlpdloss(X, p), test = mlpdtest(Xtest, p)
             @info "Loss:" train test
           end
         end
       end
#3 (generic function with 1 method)

julia> report(p)
┌ Info: Loss:
│   train = 133158.62f0
└   test = 130800.52f0

julia> for _ in 1:3
         @time SimpleChains.train_unbatched!(
           G, p, mlpdloss, X, SimpleChains.ADAM(), 10_000
         );
         report(p)
       end
  4.784989 seconds (8.28 M allocations: 565.218 MiB, 6.79% gc time, 89.51% compilation time)
┌ Info: Loss:
│   train = 192.6596f0
└   test = 523.15155f0
  0.409552 seconds
┌ Info: Loss:
│   train = 29.760109f0
└   test = 309.85358f0
  0.413950 seconds
┌ Info: Loss:
│   train = 19.802664f0
└   test = 275.5991f0

julia> versioninfo()
Julia Version 1.9.0-DEV.1189
Commit 293031b4a5* (2022-08-26 20:24 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores
Environment:
  JULIA_NUM_THREADS = 36

I should have included the versioninfo() to avoid potential confusion in the documentation.
My timings now are better than what were in the docs, which is either because performance improved since then, or because the computer was slower (e.g., could have been run on a 7980XE, which I also have).

The 10980XE, compared to the E5-2650V3, has 1.8x more cores (18 vs 10), 4x more L2 cache per core (1 MiB vs 256 KiB), and 4x higher throughput per core (AVX512 vs AVX without FMA). Clock speeds are also higher on the 10980XE.
That said,

julia> 1.8*4 * 0.4 # 1.8x fewer cores * 4x fewer FLOPS/clock * 0.4 seconds
2.8800000000000003

A bit over 2.8 seconds is roughly inline with what one would expect from the Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz.

However, this is computer is not representative of what most people will have access to, so it should probably emphasize the architecture. (Although PumasAI' customers are likely to use SimpleChains through JuliaHub, which will run on AVX512-capable servers and allow high core counts.)

This means that unfortunately, we probably can't expect much better than the times you reported on your computer.

But, in any case, the speeds are all massively slower than what the web page describes and the loss achieved is much worse.

Squared loss went from reporting half mean square loss to half of the sum of the squared loss. I'll make a PR to update the docs (also adding versioninfo()).

@chriselrod
Copy link
Contributor

chriselrod commented Aug 27, 2022

My Dell XPS 13 with a 4 core tiger lake chip took 2.1 seconds (making it faster than the moderately beefy server without FMA instructions), which is also within the realm of what's expected, because it does have AVX512, but can only perform a single FMA/clock cycle. Memory operations and tanh still receive all/most of the benefit from AVX512.

Interestingly, the speed is only a bit faster than on my old Mac (2.8s per step versus 4.2s). Faster, yes, but not devastatingly so.

Also, threading options on the Mac made very little difference.

Mind sharing versioninfo() on the Mac?

@tdunning
Copy link
Author

tdunning commented Aug 28, 2022

Great answers! I will link here from slack.

On my mac:

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

On the first Linux machine:

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

On the larger Linux machine:

Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 1 on 40 virtual cores

@tdunning
Copy link
Author

For completeness sake, here is the result on an M1 mac

julia> include("fit.jl")
  5.351259 seconds (10.33 M allocations: 813.431 MiB, 2.26% gc time, 100.00% compilation time)
┌ Info: Loss:
│   train = 122779.22f0
└   test = 129720.38f0
 10.984314 seconds (13.95 M allocations: 819.360 MiB, 2.04% gc time, 29.61% compilation time)
┌ Info: Loss:
│   train = 206.32681f0
└   test = 1412.7253f0
  7.759999 seconds
┌ Info: Loss:
│   train = 36.585293f0
└   test = 880.9235f0
  7.786647 seconds
┌ Info: Loss:
│   train = 21.568672f0
└   test = 753.61115f0
  7.731717 seconds
┌ Info: Loss:
│   train = 16.15998f0
└   test = 678.20703f0
  7.823953 seconds
┌ Info: Loss:
│   train = 13.404591f0
└   test = 629.08093f0
^CERROR: LoadError: InterruptException:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores

@chriselrod
Copy link
Contributor

That is unexpectedly poor performance.
Is this a 4 big/4 small or 8 big/2 small system?

I see versioninfo() reports "8 virtual cores", but I don't recall if it was updated to report only big cores in Julia 1.8, or only on master.
If it is trying to run code on little cores, that is likely the problem and starting Julia with only 4 threads should help.
It'd be harder to explain if it's 8 big/2 small, as naively I'd think 6 big cores would be close to 50% faster than 4.

My Mac Mini (only 4 big cores) is about 2x faster:

  6.450894 seconds (10.19 M allocations: 694.866 MiB, 2.52% gc time, 47.80% compilation time)      
┌ Info: Loss:                                                                                      
│   train = 139.80142f0                                                                            
└   test = 647.0813f0
  3.359473 seconds
┌ Info: Loss:
│   train = 31.356768f0
└   test = 324.7406f0
  3.367158 seconds
┌ Info: Loss:
│   train = 19.30026f0
└   test = 262.24854f0

julia> versioninfo()
Julia Version 1.9.0-DEV.1189
Commit 293031b4a5* (2022-08-26 20:24 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, apple-m1)
  Threads: 4 on 4 virtual cores

If you do have an 8 big core system, perhaps the problem is communication overhead between the separate 4 core clusters? Not sure. Perhaps also in that case, going down to 4 cores can help?

@tdunning
Copy link
Author

Too many threads is the answer. When I run with 4 threads, performance matches yours:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 4 on 8 virtual cores

julia> include("fit.jl")
  5.487713 seconds (10.33 M allocations: 813.347 MiB, 3.46% gc time, 100.00% compilation time)
┌ Info: Loss:
│   train = 130214.086f0
└   test = 120674.42f0
  6.729007 seconds (13.74 M allocations: 800.713 MiB, 3.93% gc time, 48.73% compilation time)
┌ Info: Loss:
│   train = 426.51428f0
└   test = 311.41187f0
  3.508436 seconds
┌ Info: Loss:
│   train = 60.985065f0
└   test = 347.3205f0
  3.452805 seconds
┌ Info: Loss:
│   train = 31.754074f0
└   test = 283.42307f0
  3.484129 seconds
┌ Info: Loss:
│   train = 22.97762f0
└   test = 209.78539f0
  3.535808 seconds
┌ Info: Loss:
│   train = 18.768747f0
└   test = 165.22177f0

@chriselrod
Copy link
Contributor

Upgrading packages (in particular, SLEFPirates to 0.6.34) should help performance on this example. It improves the tanh implementation.
On the M1:

  5.584716 seconds (9.68 M allocations: 664.697 MiB, 6.34% gc time, 57.13% compilation time)       
┌ Info: Loss:                                                                                      
│   train = 113.73494f0                                                                            
└   test = 3894.2808f0
  2.391776 seconds
┌ Info: Loss:
│   train = 26.246262f0
└   test = 3112.5964f0
  2.402684 seconds
┌ Info: Loss:
│   train = 17.083645f0
└   test = 2907.2073f0

On the 10980XE:

  4.357419 seconds (7.64 M allocations: 527.256 MiB, 5.13% gc time, 91.42% compilation time)
┌ Info: Loss:
│   train = 173.95671f0
└   test = 381.87326f0
  0.314578 seconds
┌ Info: Loss:
│   train = 20.398003f0
└   test = 199.80536f0
  0.318370 seconds
┌ Info: Loss:
│   train = 12.305162f0
└   test = 159.66428f0

@tdunning
Copy link
Author

Yes. that made quite the difference (after restarting Julia).

That isn't installed in a base install, however, so that is another thing that could help your docs.

I think that this issue can be closed (I appreciate the tutorial, however) subject to the tiny doc updates (which you may have already done).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants