Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks with the same workloads sometimes finish in much longer time than others #53269

Closed
PetrKryslUCSD opened this issue Feb 10, 2024 · 12 comments
Labels
multithreading Base.Threads and related functionality

Comments

@PetrKryslUCSD
Copy link

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

This batch file is run on an unloaded machine (single user):

if !isdir("FinEtoolsHeatDiff.jl")
    run(`git clone https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl.git`)
end
using Pkg
cd("FinEtoolsHeatDiff.jl/examples")
println("Current folder: $(pwd())")
Pkg.activate(".")
Pkg.instantiate()
Pkg.update()
@show Threads.nthreads()
include(joinpath(pwd(), "steady_state/3-d/Poisson_examples.jl"))
using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_tasks_example(100)
using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_tasks_example(100)

As shown in the batch file, a computation is run twice with the same parameters:

pkrysl@samadira exp % ~/julia-1.10.0/bin/julia -t 17 batch.jl
Current folder: /Users/pkrysl/exp/FinEtoolsHeatDiff.jl/examples
  Activating project at `~/exp/FinEtoolsHeatDiff.jl/examples`
    Updating registry at `~/.julia/registries/General.toml`
    Updating git-repo `https://github.com/PetrKryslUCSD/Targe2.jl.git`
  No Changes to `~/exp/FinEtoolsHeatDiff.jl/examples/Project.toml`
  No Changes to `~/exp/FinEtoolsHeatDiff.jl/examples/Manifest.toml`
Threads.nthreads() = 17
[ Info: All examples may be executed with
using .Main.Poisson_examples; Main.Poisson_examples.allrun()
[ Info: Starting                                  <<<<<<<<<<<<<<<<<< FIRST RUN
[ Info: 1000000 elements
[ Info: Number of free degrees of freedom: 3910599
[ Info: Conductivity: serial
[ Info: Conductivity done serial 19.16116690635681
[ Info: All done serial 21.8343768119812
[ Info: Conductivity: parallel
[ Info: 1: Started 0.03969597816467285
[ Info: 1: Spawned 0.09760499000549316
[ Info: 2: Started 0.09940600395202637
[ Info: 3: Started 0.09951305389404297
[ Info: 2: Spawned 0.0995330810546875
[ Info: 4: Started 0.09955406188964844
[ Info: 5: Started 0.09959006309509277
[ Info: 3: Spawned 0.09958100318908691
[ Info: 6: Started 0.09961915016174316
[ Info: 5: Spawned 0.09964394569396973
[ Info: 4: Spawned 0.09963512420654297
[ Info: 6: Spawned 0.09971213340759277
[ Info: 7: Started 0.09971809387207031
[ Info: 8: Started 0.09978413581848145
[ Info: 8: Spawned 0.09986209869384766
[ Info: 7: Spawned 0.09983396530151367
[ Info: 9: Started 0.0999000072479248
[ Info: 10: Started 0.09997797012329102
[ Info: 9: Spawned 0.10000205039978027
[ Info: 10: Spawned 0.10006904602050781
[ Info: 11: Started 0.10009407997131348
[ Info: 12: Started 0.1001901626586914
[ Info: 11: Spawned 0.10021495819091797
[ Info: 12: Spawned 0.1002650260925293
[ Info: 13: Started 0.10025405883789062
[ Info: 14: Started 0.10038900375366211
[ Info: 13: Spawned 0.10040903091430664
[ Info: 14: Spawned 0.10048103332519531
[ Info: 15: Started 0.10048699378967285
[ Info: 16: Started 0.10061907768249512
[ Info: 16: Spawned 0.10070610046386719
[ Info: 15: Spawned 0.10070610046386719
[ Info: 1: Started conductivity 0.29084300994873047
[ Info: 6: Started conductivity 0.30785512924194336
[ Info: 2: Started conductivity 0.31269001960754395
[ Info: 8: Started conductivity 0.31369996070861816
[ Info: 10: Started conductivity 0.3148770332336426
[ Info: 11: Started conductivity 0.3154640197753906
[ Info: 3: Started conductivity 0.3155040740966797
[ Info: 4: Started conductivity 0.31554698944091797
[ Info: 7: Started conductivity 0.3156399726867676
[ Info: 14: Started conductivity 0.3165130615234375
[ Info: 5: Started conductivity 0.3165619373321533
[ Info: 9: Started conductivity 0.3167459964752197
[ Info: 12: Started conductivity 0.31691908836364746
[ Info: 13: Started conductivity 0.3172330856323242
[ Info: 16: Started conductivity 0.3172590732574463
[ Info: 15: Started conductivity 0.3174870014190674
[ Info: 6: Finished 1.7215111255645752           <<<<<<<<<<<< ALL TASKS FINISH
[ Info: 2: Finished 1.7214860916137695
[ Info: 7: Finished 1.7238011360168457
[ Info: 3: Finished 1.7238130569458008
[ Info: 1: Finished 1.723978042602539
[ Info: 16: Finished 1.7248311042785645
[ Info: 13: Finished 1.7270441055297852
[ Info: 8: Finished 1.7272679805755615
[ Info: 14: Finished 1.729992151260376
[ Info: 4: Finished 1.7307720184326172
[ Info: 5: Finished 1.7324531078338623
[ Info: 9: Finished 1.7324731349945068
[ Info: 10: Finished 1.7330989837646484
[ Info: 11: Finished 1.733449935913086
[ Info: 15: Finished 1.7334620952606201
[ Info: 12: Finished 1.7362711429595947            <<<<<<<<<<< IN ROUGHLY THE SAME TIME
[ Info: Started make-matrix 1.7365190982818604
[ Info: All done 1.743549108505249
[ Info: Short-circuited exit
[ Info: Starting                                   <<<<<<<<<<<<<<<<< SECOND RUN
[ Info: 1000000 elements
[ Info: Number of free degrees of freedom: 3910599
[ Info: Conductivity: serial
[ Info: Conductivity done serial 18.674499034881592
[ Info: All done serial 21.335928916931152
[ Info: Conductivity: parallel
[ Info: 1: Started 0.02824687957763672
[ Info: 2: Started 0.02838897705078125
[ Info: 1: Spawned 0.02837681770324707
[ Info: 2: Spawned 0.028560876846313477
[ Info: 3: Started 0.028569936752319336
[ Info: 3: Spawned 0.028743982315063477
[ Info: 4: Started 0.028760910034179688
[ Info: 5: Started 0.028944015502929688
[ Info: 4: Spawned 0.028958797454833984
[ Info: 6: Started 0.029123783111572266
[ Info: 5: Spawned 0.029111862182617188
[ Info: 7: Started 0.029152870178222656
[ Info: 6: Spawned 0.029179811477661133
[ Info: 8: Started 0.02931690216064453
[ Info: 7: Spawned 0.02931690216064453
[ Info: 8: Spawned 0.029472827911376953
[ Info: 9: Started 0.029507875442504883
[ Info: 10: Started 0.029678821563720703
[ Info: 9: Spawned 0.029706954956054688
[ Info: 11: Started 0.029832839965820312
[ Info: 12: Started 0.029864788055419922
[ Info: 10: Spawned 0.029883861541748047
[ Info: 11: Spawned 0.029930830001831055
[ Info: 13: Started 0.030006885528564453
[ Info: 14: Started 0.030041933059692383
[ Info: 13: Spawned 0.030104875564575195
[ Info: 12: Spawned 0.030069828033447266
[ Info: 15: Started 0.030176877975463867
[ Info: 16: Started 0.030210018157958984
[ Info: 16: Spawned 0.03033900260925293
[ Info: 14: Spawned 0.030319929122924805
[ Info: 15: Spawned 0.030336856842041016
[ Info: 13: Started conductivity 0.09562182426452637
[ Info: 9: Started conductivity 0.09810400009155273
[ Info: 2: Started conductivity 0.1115868091583252
[ Info: 1: Started conductivity 0.12673091888427734
[ Info: 3: Started conductivity 0.12754082679748535
[ Info: 7: Started conductivity 0.12888884544372559
[ Info: 10: Started conductivity 0.12903380393981934
[ Info: 11: Started conductivity 0.1291518211364746
[ Info: 5: Started conductivity 0.12967586517333984
[ Info: 6: Started conductivity 0.1297318935394287
[ Info: 4: Started conductivity 0.12999987602233887
[ Info: 12: Started conductivity 0.1302487850189209
[ Info: 16: Started conductivity 0.13028287887573242
[ Info: 8: Started conductivity 0.13030695915222168
[ Info: 15: Started conductivity 0.15201091766357422
[ Info: 14: Started conductivity 0.162139892578125
[ Info: 12: Finished 1.3724877834320068
[ Info: 1: Finished 1.372588872909546
[ Info: 3: Finished 1.4589488506317139
[ Info: 5: Finished 1.4598939418792725
[ Info: 2: Finished 1.4610848426818848
[ Info: 11: Finished 1.4626657962799072
[ Info: 7: Finished 1.4626657962799072
[ Info: 13: Finished 1.4634709358215332
[ Info: 10: Finished 1.4653029441833496
[ Info: 16: Finished 1.4664249420166016
[ Info: 6: Finished 1.4673268795013428
[ Info: 9: Finished 1.468505859375
[ Info: 4: Finished 1.4687118530273438
[ Info: 14: Finished 2.7745578289031982  <<<<<<<<<<<<<<<< THESE FINISH in double the time as the others
[ Info: 15: Finished 2.776059865951538   <<<<<<<<<<<<<<<<
[ Info: 8: Finished 2.776533842086792    <<<<<<<<<<<<<<<<
[ Info: Started make-matrix 2.776829957962036
[ Info: All done 2.7807087898254395
[ Info: Short-circuited exit
pkrysl@samadira exp %

So, some tasks take practically double the time of the others. The number of tasks that takes longer varies from run to run.

@jariji
Copy link
Contributor

jariji commented Feb 10, 2024

This seems like a better fit for https://discourse.julialang.org/.

@PetrKryslUCSD
Copy link
Author

PetrKryslUCSD commented Feb 10, 2024

Sorry, why? Doesn't this point at a problem with the task library? I can think of at least one reason: scheduling of tasks may be at fault (?)...

@carstenbauer carstenbauer added the multithreading Base.Threads and related functionality label Feb 10, 2024
@carstenbauer
Copy link
Member

Irrespective of whether this belongs here or not, can you try to provide a more minimal example?

@PetrKryslUCSD
Copy link
Author

PetrKryslUCSD commented Feb 10, 2024

@carstenbauer That is hard. I tried, but for some reason much simpler setups did not yield reproducible behaviour like this. Even this example shows random variations: sometimes the tasks run equally well. But, start the sim one more time and some tasks will again lag behind the other group (it is always two groups: fast tasks and slow tasks). It is almost as if some tasks had to wait for a thread to run. Even though there should be enough for all to not to have to wait.

@carstenbauer
Copy link
Member

carstenbauer commented Feb 10, 2024

@PetrKryslUCSD
Copy link
Author

PetrKryslUCSD commented Feb 10, 2024

But, start the sim one more time and some tasks will again lag behind the other group
(it is always two groups: fast tasks and slow tasks). It is almost as if some tasks
had to wait for a thread to run. Even though there should be enough for all to not to have to wait.

It seems I answered my own question:

Strategy A. When the computation is started with N threads and the number of tasks is N-1, it seems that at some point one or more tasks cannot find a thread to run on, wait for one, and finish consequently late.

Strategy B. The following approach works much better: start as many threads as you have on the machine, and use however many tasks you wish (leaving a substantial number of threads over). For instance, spin up julia with 24 threads, and use up to 16 tasks. Then all tasks will finish at the same time.

I wonder why some of the tasks cannot find a thread to run on in Strategy A? What else does julia run on those threads?

@vtjnash
Copy link
Member

vtjnash commented Feb 10, 2024

Sounds like this is better suited to continue discussion on the discourse thread then. Github does not handle threads well.

@PetrKryslUCSD
Copy link
Author

The problem appears to be real, wouldn't you say?

@PetrKryslUCSD
Copy link
Author

PetrKryslUCSD commented Feb 12, 2024

Edit: Updated MWE:

module mwe_tasks_2
using Base.Threads
function work(r)
    s = 0.0
    for j in r
        s = s + exp(-(j - minimum(r))^2/(maximum(r)-minimum(r))^2)
    end
    s
end
function test(nchunks = 2)
    # @info "nchunks = $(nchunks)"
    # @info "nthreads.((:default, :interactive)) = $(Threads.nthreads.((:default, :interactive)))"
    # @info "maxthreadid = $(Threads.maxthreadid())"
    # @info "threadpool.(1:Threads.maxthreadid()) = $(threadpool.(1:Threads.maxthreadid()))"
    # @info "Threads.threadpoolsize.((:default, :interactive)) = $(Threads.threadpoolsize.((:default, :interactive)))"
    N = 200000000
    chincr = N / nchunks
    @assert nchunks * chincr == N
    chunks = [(((i-1)*chincr+1:i*chincr), i) for i = 1:nchunks]

    s = Float64[]
    start = time()
    Threads.@sync begin
        for ch in chunks
            Threads.@spawn let r = $ch[1], i = $ch[2]
                # @info "Chunk $(i), thread $(threadid()), $(Threads.threadpool(threadid())): Spawned $(time() - start)"                
                push!(s, work(r))
                # @info "$(i): Finished $(time() - start)"
            end
        end
    end
    # @info "Finished $(time() - start)"
    # @show s
end
end
using Main.mwe_tasks_2;
mwe_tasks_2.test()
ts = []
for n = 1:10
    push!(ts, @elapsed mwe_tasks_2.test())
end
@show extrema(ts)

Many runs repeated. Often the max run time is twice the min, indicating scheduling problems.

@vtjnash
Copy link
Member

vtjnash commented Feb 12, 2024

It does not sound real based on the current description. I think the usual approach is to divide the problem up into about 4nthreads() chunks, to ensure adequate balancing possible

@PetrKryslUCSD
Copy link
Author

@vtjnash I'm afraid I'm not following you. The workload in the MWE is perfectly distributed. I start enough threads so that each task can run on a thread, on a machine that has enough threads not to be oversubscribed. Yet apparently some tasks have to wait on a thread to run.
Would one consider this normal?

Creating more chunks means creating more tasks means spending more time in the setup of the parallel loop. Is that a good strategy?

@PetrKryslUCSD
Copy link
Author

I think the @info prints may have messed with the timing. I will close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests

4 participants