Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentSimulationRunner.RunConcurrently: execution time does not scale with the number of cores as expected #1455

Open
Yuri05 opened this issue Jan 22, 2022 · 65 comments
Assignees

Comments

@Yuri05
Copy link
Member

Yuri05 commented Jan 22, 2022

Attached is a small example in R, which loads N copies of the same simulation and executes runSimulations (which calls ConcurrentSimulationRunner.RunConcurrently in the OSPSuite.Core)
testRunSimulations.zip

I have executed it on the machine with 8 cores (thus as per default, 7 cores are used).
The expectation would be: when executing any number of simulations between 1 and 7: execution time should be approximately the same. However, the time is increasing with every new simulation:

Number of simulations Time (runSimulations) [sec] Expected speedup factor vs. 1 Core Real speedup factor
1 18,27 1 1
2 22,03 2 1,66
3 28,88 3 1,90
4 36,48 4 2,00
5 44,17 5 2,07
6 51,43 6 2,13
7 59,22 7 2,16
...
21 182,61 7 2,10
Screenshot
@Yuri05
Copy link
Member Author

Yuri05 commented Jan 22, 2022

@abdullahhamadeh @pchelle FYI

@msevestre
Copy link
Member

We should try to reproduce in .net only to remove any intermediate layers

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 22, 2022

Can you reproduce it in R first?
I just tried something different: instead of using parallelism in .NET I tried parallelism in R (parallel, doparallel, foreach)
testRunSimulations_parallel_2.zip
And the result was very similiar to the previous one: the speedup was much less than expected.

Number of simulations Load simulations [s] Load and run simulatios [s] Run simulations [s] Speedup (run sim)
1 10,3 31,4 21,1 1
4 12,8 51,8 39,0 2,2
7 17,4 75,8 58,4 2,5

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 23, 2022

The previous tests were on the machine with 8 logical cores, but only 4 physical cores.
Now tried it (still in R) on another machine with 16 logical cores and 16 physical cores (theoretically. It's a virtual machine in the cloud, so the real underlying hardware is unknown)

Using the R-parallelism, I get speedup factor 7 on 8 cores and speedup factor 12 on 15 cores:

Number of simulations Load simulations [s] Load and run simulatios [s] Run simulations [s] Speedup (run sim)
1 5,9 27,8 21,9 1
8 6,3 31,0 24,6 7,1
15 7,1 34,5 27,4 12,0

However using the .NET parallelism, I get in both cases only speedup factors of between 4 and 5:

Number of simulations Run simulations [s] Speedup (run sim)
1 22,6 1
8 43,1 4,2
15 73,9 4,6

@msevestre
Copy link
Member

I am not sure what to make of this. Only physical cores are actually performing work?

@msevestre
Copy link
Member

It feels to me that we should be able to reproduce in .net with the same simulation

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

ok, tried it with ospsuite-R 10.0.72
It looks now bit better in .NET compared to ospsuite-R 11 - but still slower than if parallelising in R.

.NET parallelism

Number of simulations Run simulations [s] Speedup (run sim)
1 20,4 1
4 24,2 3,4
8 31,3 5,2
15 46,6 6,6

R parallelism

Number of simulations Load simulations [s] Load and run simulatios [s] Run simulations [s] Speedup (run sim)
1 5,8 25,6 19,8 1
4 5,8 26,4 20,6 3,9
8 6,5 27,9 21,4 7,4
15 6,5 32,4 26,0 11,4

@msevestre
Copy link
Member

So I tried with a .NET console application. I am getting similar results in .NET so this is not an R issue
(@Yuri05 please create your table :))

Loading 1 simulation in 00m:02s:716ms
Running 1 in 00m:11s:991ms
Memory usage at the end: 160MB

Loading 4 simulation in 00m:10s:748ms
Running 4 in 00m:21s:822ms
Memory usage at the end: 460MB

Loading 4 simulation in 00m:10s:553ms
Running 4 in 00m:17s:636ms
Memory usage at the end: 460MB

Loading 8 simulation in 00m:23s:930ms
Running 8 in 00m:28s:214ms
Memory usage at the end: 880MB

Loading 16 simulation in 00m:46s:125ms
Running 16 in 01m:03s:592ms
Memory usage at the end: 1.6GB

The leak can also be seen in .NET it seems. So this should be easy to spot using the profile.

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

@msevestre Nice! This should make the profiling much easier I hope :)

Number of simulations Run simulations [s] Speedup (run sim)
1 12,0 1
4 19,7 2,4
8 28,2 3,4
16 63,6 3,0

@msevestre
Copy link
Member

Investigating

image

The gray area is C++. It is completely freed

@msevestre
Copy link
Member

However a Force.GC removed everything..
image

We might not have a leak...

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

So maybe the objects are not disposed only when calling from R e.g. due to references in rClr?

@msevestre
Copy link
Member

image

@msevestre
Copy link
Member

I am wondering if we are clearing the ConcurrentSImulationRunner... I need to check this
This is right before GC...After GC it's all gone

@msevestre
Copy link
Member

Yep. Dispose is not call when calling runSimulations....
that does not explain the speed issue but potentially some of the ram issue

@msevestre
Copy link
Member

with the fix, leak is gone
image

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

Yesss!
grafik

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

P.S. To achieve this I still need to perform the clean up at the end and make sure that all R variables which hold .NET pointers are removed as well.
If I e.g. deactivate R-GC or .NET-GC: the memory remains occupied (will be prob. released at some time).

  rm(loadedSimulations)
  resetSimulationCache()
  gc(verbose = FALSE)
  rClr::clrCallStatic("System.GC", "Collect")

@msevestre
Copy link
Member

Time profile is more of a weird issue. It looks like function just take slower with the number of parallel tasks executing which does not make sense

Here I am starting 12 simulations. I can see that 12 threads are started (12 +1 for main)
image

The time spent on each thread is almost the same and one can see that for instance on a given thread, the LoadFromXmlString takes 10sec! (SimModel code)

The same code when calling only 2 simulations take 5 sec
image

When using 8 simulations , 7,7 sec
image

When using 1 simulation only, 4.6 sec
image

This does not male sense to me at all. We are creating the right amount of threads, and yet the execution time on each threads get slower.....WHAT?

This can be observed for ALL SimModel call. The other tasks are too insignificant in comparisons

@msevestre
Copy link
Member

I tried to remove the // foreach and use

 var tasks = _simulations.Select(x => Task.Run(()=> runSimulation(x, _cancellationTokenSource.Token)));

instead.

Here is the outcome for 12
image

@abdelr @Yuri05 Can we have a chat tomorrow to discuss this?

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

hmm, that's really weird. no explanation at the moment

@Yuri05
Copy link
Member Author

Yuri05 commented Jan 24, 2022

Can we have a chat tomorrow to discuss this?

sure :)

@abdelr
Copy link
Member

abdelr commented Jan 24, 2022

Sure ;) we talk, but it is indeed strange. Are the threads locking each other somehow?

@msevestre
Copy link
Member

Are the threads locking each other somehow?

Nope absolutely not. At least not in theory

@msevestre
Copy link
Member

can you try to specify a simulationRunOptions explciitely

@PavelBal
Copy link
Member

PavelBal commented May 17, 2022

can you try to specify a simulationRunOptions explciitely

Ok wow, yes. So if we do not provide a simulationRunOptions to rClr call, .NET probably assumes only one core? I can create a PR in R for this.

Now, the runtime of simulation batches is more or less the same than running multiple simulations in parallel. So actually no gain from initializing only once?

Number of simulations Run simulations [s] Run simulation batch
1 16.27 16.98
2 18.02 19.5
4 21.23 23.62
16 43.29 48.93
32 93.11 94.64

@msevestre
Copy link
Member

it should be much faster as you are only initializing your simulation once.

@abdelr
Copy link
Member

abdelr commented May 17, 2022

Ok wow, yes. So if we do not provide a simulationRunOptions to rClr call, .NET probably assumes only one core? I can create a PR in R for this.

It does private int numberOfCoresToUse() => SimulationRunOptions?.NumberOfCoresToUse ?? 1;

@abdelr
Copy link
Member

abdelr commented May 17, 2022

Now, the runtime of simulation batches is more or less the same than running multiple simulations in parallel. So actually no gain from initializing only once?

Did we not discussed this before when @msevestre found out that the batch was not properly used? I remember there was an issue from some external user about it.

@msevestre
Copy link
Member

I am pretty sure we solved all of this. @abdelr WE updated the code of ConcurrencyManager in core a million times. Now it's using Parallel.ForEach and we assumed, (maybe wrongly) that it would be better than your initial implementation.

Can you try to create a version of the DLL that we could try locally using your old code? WE should be able to simply update the ConcurrenyManager as the API was more or less consistent

@msevestre
Copy link
Member

no forget about this. This is the same user issue again
We allocate a simulation slot for each simulation run. Next time you run the variation again, it will be super fast

So in your code pavel, do

 for (i in 1:numberOfSimulationToRun) {
    simBatch$addRunValues(70)
  }
  toc()
  
  print("Running simulations")
  tic()
  runSimulationBatches(simulationBatches = simBatch)
  toc()

 for (i in 1:numberOfSimulationToRun) {
    simBatch$addRunValues(70)
  }
  toc()
  
  print("Running simulations again")
  tic()
  runSimulationBatches(simulationBatches = simBatch)
  toc()

the second run will be much, much faster

@msevestre
Copy link
Member

but effecitvely, if you are not batching anything, you are just creating simulations for each run and not reusing them again, which is not efficient at all. IF you want to do this, you should use the old SimulationBatch (do we still have it in R)?

@msevestre
Copy link
Member

Correction: it's gone.

@msevestre
Copy link
Member

on my machine
[1] "Running simulations"
3.7 sec elapsed
[1] "Running simulations again"
0.08 sec elapsed

@PavelBal
Copy link
Member

Ok, then I just do not understand how to use it.

I have one simulation, I want to run it with 10 different parameter sets in parallel, so basically like a population simulation. How can I do it? The way you proposed, it will be sequentially.

@abdelr
Copy link
Member

abdelr commented May 17, 2022

Ok, then I just do not understand how to use it.

I have one simulation, I want to run it with 10 different parameter sets in parallel, so basically like a population simulation. How can I do it? The way you proposed, it will be sequentially.

There is no magic solution. You do have a gain from initializing only once but still, you need to initialize. So, If you want to run 10 simulations in parallel you still need to initialize them. Now, the gain in initializing only once is that you keep them and next time you need to run them again you pass the params and that's it. So, if you are optimizing something, most likely you will need to run something, check results and run again with adjusted params... an so on... in this case it will work as @msevestre suggested. If you want to run ten simulations only once, well you cannot use the initialize only once feature, right?

@msevestre
Copy link
Member

Batch is not intended for this. It is meant to be used in algorithm where you want to varied the same number of parameters, multiple times (think PI: Set values, calculate, set new values calculate etc..) With this approach, it will be much faster.

What you are after is the version that I had originally implemented..

Probably for running the same simulation with a variation of a parameter in parallel, you can create a population with only this one parameter? That would work I believe

@PavelBal
Copy link
Member

Then adding multiple run values to a batch does not really make sense...

Well, actually I thought SimulationBatch is exactly what is behind population simulation, and we would get the same benefint - initialize once, run in parallel.

@msevestre
Copy link
Member

Well, actually I thought SimulationBatch is exactly what is behind population simulation, and we would get the same benefint - initialize once, run in parallel.

That what is used to be

@msevestre
Copy link
Member

Then adding multiple run values to a batch does not really make sense...

yes because it allocates the slots and does not release them. Maybe your algorithm wants to try a bunch of scenarios at once and (x, y) (x+value, y +value) etc.. and decide what direction to move based on the evaluation of multiple simulations. In this case, this makes sense.

But this is true, in most cases, you will have one run. So we have lost the ability to do what you are after I believe (at least easily)

@msevestre
Copy link
Member

@PavelBal Actually I am full of shit. This did not work like this

Looking at the example, you could old a batch, set value run, set value run etc.
https://github.com/Open-Systems-Pharmacology/OSPSuite-R/blob/main/tests/dev/script-simulation-batch.R

So you could not specify a list of values to run.
I am wondering now why we implemented this like that. We could have created one slot per Batch and then try to parallelized as we do for the population for each run. @abdelr do you know why we did it otherwise? I can't remember

@PavelBal
Copy link
Member

I am wondering now why we implemented this like that. We could have created one slot per Batch and then try to parallelized as we do for the population for each run. @abdelr do you know why we did it otherwise? I can't remember

Yes this would be awesome, and actually what I always wanted it to do ;)

@abdelr
Copy link
Member

abdelr commented May 17, 2022

So you could not specify a list of values to run.
I am wondering now why we implemented this like that. We could have created one slot per Batch and then try to parallelized as we do for the population for each run. @abdelr do you know why we did it otherwise? I can't remember

So, in the original implementation if you define one batch with different run values, the algorithm will, in parallel, initialize as many batches as needed for the parameters. Later you can run them in parallel and you can reuse them. I see no way to make it differently. So, again, you need to initialize at least once the batch. Once it is initialized you can reuse it but the initialization needs to occur for the runner to be able to use it. So unless I am missing something, you can set different parameters to one batch and it will try to parallelize as much as possible but it will need to initialize all the batches. Later you reuse them but the first time there is nothing to do about it, right?

@msevestre
Copy link
Member

@abdelr for one simulation, we can use the population way of doing this where we initialize only once. This is why the population is so fast. Theoretically this could also be done like this

@abdelr
Copy link
Member

abdelr commented May 18, 2022

@msevestre @PavelBal: I think PopulationRunner works the same as the ConcurrentSimulationRunner. PopulationRunner has RunPopulationAsync. This method creates x simulations our of the original simulation, stores these simulations and reuse them for all individuals. The amount x is decided based on the amount of individuals and the amount of cores. Now, the simulations are created on runSimulation using createAndFinalizeSimulation. So, all in all, the method calculates in advance I will need 10 simulations, it creates such 10 simulations (and initialize them) and stores them for reuse. Later all individuals are split into these 10 simulations so if you had 100 individuals it will assign 10 individuals per simulation.

Now, the original implementation of the ConcurrentSimulationRunner did something similar. It did not use the Parallel.ForEach but instead it calculated how many working threads were needed and created one simulation per thread. Later it used these threads to take from a pool of tasks and run the simulations. So, the same idea. The implementation was rejected in favor of Parallel.ForEach passing the control to the caller. This means you can reuse the simulations you create but you also are responsible for it. Now to mimic the the same behavior we have just described with the current implementation we should calculate from the caller I need to run 100 simulations and I have 10 cores so I will add 10 batches and split the runs in 10 passing 10 values are a time. Alternatively we could try to revert to the original code if we manage to find it in the history of commits (maybe not easy since we squash) and lose the ability to reuse from one call of RunAsync to the next. In any case we need to agree on the use case before adding any further change.

This could also be on the interest of @Yuri05 so I am tagging him so he reflects a bit on it.

@msevestre
Copy link
Member

This method creates x simulations our of the original simulation, stores these simulations and reuse them for all individuals

No it creates one per core allocated and then split the individual along those cores

But at the end of the day, running 10 simulations in // will cost more than running time sequentially. The gain comes when. You are running >> number of cores allocated

@msevestre
Copy link
Member

I don't think there will be any gain to switching back to the more naive implementation not using for each. The problem is that in order to parallelize a simulation, you will need one instance per core. No way around this.

I think the problem may be more into the API that we are offering as opposed to the implementation

@abdelr
Copy link
Member

abdelr commented May 18, 2022

Maybe we need to have a short meeting for this. I think we are talking about the same here just phrasing it differently. If you try to run 10 simulations on 10 core you gains almost nothing (if anything). The gain is indeed when you try to run 100 sims in 10 cores and I agree there will not be a big gain by reverting back. I am just stressing the fact that the PopulationRunner also does the initialization step.

@abdelr
Copy link
Member

abdelr commented May 18, 2022

I think we need to rethink what is the real use case here. Is it clear or not that we want to reuse the batches?

@msevestre
Copy link
Member

In fact, you need to allocatr less core that. Simulations.
So in you example Pavel, you need to allocate 2-3 cores to run 10 simulations.

@msevestre
Copy link
Member

Yes i agree with you m reuse of batch is even more advanced. But if you just want to run 100 variations of the same simulations, you can create one batch with 100 runs . On 10 cores, you will see serious speed increase

@msevestre
Copy link
Member

At any rate, we should discuss in a discussion because we are polluting this entry that is still valid for nothing. Sorry 😔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants