ConcurrentSimulationRunner.RunConcurrently: execution time does not scale with the number of cores as expected #1455

Yuri05 · 2022-01-22T11:10:20Z

Attached is a small example in R, which loads N copies of the same simulation and executes runSimulations (which calls ConcurrentSimulationRunner.RunConcurrently in the OSPSuite.Core)
testRunSimulations.zip

I have executed it on the machine with 8 cores (thus as per default, 7 cores are used).
The expectation would be: when executing any number of simulations between 1 and 7: execution time should be approximately the same. However, the time is increasing with every new simulation:

Number of simulations	Time (runSimulations) [sec]	Expected speedup factor vs. 1 Core	Real speedup factor
1	18,27	1	1
2	22,03	2	1,66
3	28,88	3	1,90
4	36,48	4	2,00
5	44,17	5	2,07
6	51,43	6	2,13
7	59,22	7	2,16
…	…	…	...
21	182,61	7	2,10

Screenshot

The text was updated successfully, but these errors were encountered:

Yuri05 · 2022-01-22T11:13:25Z

@abdullahhamadeh @pchelle FYI

msevestre · 2022-01-22T11:58:10Z

We should try to reproduce in .net only to remove any intermediate layers

Yuri05 · 2022-01-22T23:02:39Z

Can you reproduce it in R first?
I just tried something different: instead of using parallelism in .NET I tried parallelism in R (parallel, doparallel, foreach)
testRunSimulations_parallel_2.zip
And the result was very similiar to the previous one: the speedup was much less than expected.

Number of simulations	Load simulations [s]	Load and run simulatios [s]	Run simulations [s]	Speedup (run sim)
1	10,3	31,4	21,1	1
4	12,8	51,8	39,0	2,2
7	17,4	75,8	58,4	2,5

Yuri05 · 2022-01-23T14:13:47Z

The previous tests were on the machine with 8 logical cores, but only 4 physical cores.
Now tried it (still in R) on another machine with 16 logical cores and 16 physical cores (theoretically. It's a virtual machine in the cloud, so the real underlying hardware is unknown)

Using the R-parallelism, I get speedup factor 7 on 8 cores and speedup factor 12 on 15 cores:

Number of simulations	Load simulations [s]	Load and run simulatios [s]	Run simulations [s]	Speedup (run sim)
1	5,9	27,8	21,9	1
8	6,3	31,0	24,6	7,1
15	7,1	34,5	27,4	12,0

However using the .NET parallelism, I get in both cases only speedup factors of between 4 and 5:

Number of simulations	Run simulations [s]	Speedup (run sim)
1	22,6	1
8	43,1	4,2
15	73,9	4,6

msevestre · 2022-01-23T16:04:15Z

I am not sure what to make of this. Only physical cores are actually performing work?

msevestre · 2022-01-23T19:26:37Z

It feels to me that we should be able to reproduce in .net with the same simulation

Yuri05 · 2022-01-24T11:49:53Z

ok, tried it with ospsuite-R 10.0.72
It looks now bit better in .NET compared to ospsuite-R 11 - but still slower than if parallelising in R.

.NET parallelism

Number of simulations	Run simulations [s]	Speedup (run sim)
1	20,4	1
4	24,2	3,4
8	31,3	5,2
15	46,6	6,6

R parallelism

Number of simulations	Load simulations [s]	Load and run simulatios [s]	Run simulations [s]	Speedup (run sim)
1	5,8	25,6	19,8	1
4	5,8	26,4	20,6	3,9
8	6,5	27,9	21,4	7,4
15	6,5	32,4	26,0	11,4

msevestre · 2022-01-24T11:54:48Z

So I tried with a .NET console application. I am getting similar results in .NET so this is not an R issue
(@Yuri05 please create your table :))

Loading 1 simulation in 00m:02s:716ms
Running 1 in 00m:11s:991ms
Memory usage at the end: 160MB

Loading 4 simulation in 00m:10s:748ms
Running 4 in 00m:21s:822ms
Memory usage at the end: 460MB

Loading 4 simulation in 00m:10s:553ms
Running 4 in 00m:17s:636ms
Memory usage at the end: 460MB

Loading 8 simulation in 00m:23s:930ms
Running 8 in 00m:28s:214ms
Memory usage at the end: 880MB

Loading 16 simulation in 00m:46s:125ms
Running 16 in 01m:03s:592ms
Memory usage at the end: 1.6GB

The leak can also be seen in .NET it seems. So this should be easy to spot using the profile.

Yuri05 · 2022-01-24T12:40:52Z

@msevestre Nice! This should make the profiling much easier I hope :)

Number of simulations	Run simulations [s]	Speedup (run sim)
1	12,0	1
4	19,7	2,4
8	28,2	3,4
16	63,6	3,0

msevestre · 2022-01-24T14:23:01Z

Investigating

The gray area is C++. It is completely freed

msevestre · 2022-01-24T14:25:09Z

However a Force.GC removed everything..

We might not have a leak...

Yuri05 · 2022-01-24T14:52:33Z

So maybe the objects are not disposed only when calling from R e.g. due to references in rClr?

msevestre · 2022-01-24T14:59:22Z

msevestre · 2022-01-24T15:00:03Z

I am wondering if we are clearing the ConcurrentSImulationRunner... I need to check this
This is right before GC...After GC it's all gone

msevestre · 2022-01-24T15:46:47Z

Yep. Dispose is not call when calling runSimulations....
that does not explain the speed issue but potentially some of the ram issue

msevestre · 2022-01-24T16:17:58Z

with the fix, leak is gone

Yuri05 · 2022-01-24T17:00:44Z

Yesss!

Yuri05 · 2022-01-24T17:06:36Z

P.S. To achieve this I still need to perform the clean up at the end and make sure that all R variables which hold .NET pointers are removed as well.
If I e.g. deactivate R-GC or .NET-GC: the memory remains occupied (will be prob. released at some time).

  rm(loadedSimulations)
  resetSimulationCache()
  gc(verbose = FALSE)
  rClr::clrCallStatic("System.GC", "Collect")

msevestre · 2022-01-24T18:25:22Z

Time profile is more of a weird issue. It looks like function just take slower with the number of parallel tasks executing which does not make sense

Here I am starting 12 simulations. I can see that 12 threads are started (12 +1 for main)

The time spent on each thread is almost the same and one can see that for instance on a given thread, the LoadFromXmlString takes 10sec! (SimModel code)

The same code when calling only 2 simulations take 5 sec

When using 8 simulations , 7,7 sec

When using 1 simulation only, 4.6 sec

This does not male sense to me at all. We are creating the right amount of threads, and yet the execution time on each threads get slower.....WHAT?

This can be observed for ALL SimModel call. The other tasks are too insignificant in comparisons

msevestre · 2022-01-24T18:54:39Z

I tried to remove the // foreach and use

 var tasks = _simulations.Select(x => Task.Run(()=> runSimulation(x, _cancellationTokenSource.Token)));

instead.

Here is the outcome for 12

@abdelr @Yuri05 Can we have a chat tomorrow to discuss this?

Yuri05 · 2022-01-24T18:57:28Z

hmm, that's really weird. no explanation at the moment

Yuri05 · 2022-01-24T18:58:57Z

Can we have a chat tomorrow to discuss this?

sure :)

abdelr · 2022-01-24T19:24:32Z

Sure ;) we talk, but it is indeed strange. Are the threads locking each other somehow?

msevestre · 2022-01-25T09:26:49Z

Are the threads locking each other somehow?

Nope absolutely not. At least not in theory

msevestre · 2022-05-17T08:39:14Z

can you try to specify a simulationRunOptions explciitely

PavelBal · 2022-05-17T08:49:18Z

can you try to specify a simulationRunOptions explciitely

Ok wow, yes. So if we do not provide a simulationRunOptions to rClr call, .NET probably assumes only one core? I can create a PR in R for this.

Now, the runtime of simulation batches is more or less the same than running multiple simulations in parallel. So actually no gain from initializing only once?

Number of simulations	Run simulations [s]	Run simulation batch
1	16.27	16.98
2	18.02	19.5
4	21.23	23.62
16	43.29	48.93
32	93.11	94.64

msevestre · 2022-05-17T09:49:39Z

it should be much faster as you are only initializing your simulation once.

abdelr · 2022-05-17T09:57:15Z

Ok wow, yes. So if we do not provide a simulationRunOptions to rClr call, .NET probably assumes only one core? I can create a PR in R for this.

It does private int numberOfCoresToUse() => SimulationRunOptions?.NumberOfCoresToUse ?? 1;

abdelr · 2022-05-17T09:58:59Z

Now, the runtime of simulation batches is more or less the same than running multiple simulations in parallel. So actually no gain from initializing only once?

Did we not discussed this before when @msevestre found out that the batch was not properly used? I remember there was an issue from some external user about it.

msevestre · 2022-05-17T10:07:07Z

I am pretty sure we solved all of this. @abdelr WE updated the code of ConcurrencyManager in core a million times. Now it's using Parallel.ForEach and we assumed, (maybe wrongly) that it would be better than your initial implementation.

Can you try to create a version of the DLL that we could try locally using your old code? WE should be able to simply update the ConcurrenyManager as the API was more or less consistent

msevestre · 2022-05-17T10:35:04Z

no forget about this. This is the same user issue again
We allocate a simulation slot for each simulation run. Next time you run the variation again, it will be super fast

So in your code pavel, do

 for (i in 1:numberOfSimulationToRun) {
    simBatch$addRunValues(70)
  }
  toc()
  
  print("Running simulations")
  tic()
  runSimulationBatches(simulationBatches = simBatch)
  toc()

 for (i in 1:numberOfSimulationToRun) {
    simBatch$addRunValues(70)
  }
  toc()
  
  print("Running simulations again")
  tic()
  runSimulationBatches(simulationBatches = simBatch)
  toc()

the second run will be much, much faster

msevestre · 2022-05-17T10:36:06Z

but effecitvely, if you are not batching anything, you are just creating simulations for each run and not reusing them again, which is not efficient at all. IF you want to do this, you should use the old SimulationBatch (do we still have it in R)?

msevestre · 2022-05-17T10:36:49Z

Correction: it's gone.

msevestre · 2022-05-17T10:38:35Z

on my machine
[1] "Running simulations"
3.7 sec elapsed
[1] "Running simulations again"
0.08 sec elapsed

PavelBal · 2022-05-17T10:44:39Z

Ok, then I just do not understand how to use it.

I have one simulation, I want to run it with 10 different parameter sets in parallel, so basically like a population simulation. How can I do it? The way you proposed, it will be sequentially.

abdelr · 2022-05-17T11:01:58Z

Ok, then I just do not understand how to use it.

I have one simulation, I want to run it with 10 different parameter sets in parallel, so basically like a population simulation. How can I do it? The way you proposed, it will be sequentially.

There is no magic solution. You do have a gain from initializing only once but still, you need to initialize. So, If you want to run 10 simulations in parallel you still need to initialize them. Now, the gain in initializing only once is that you keep them and next time you need to run them again you pass the params and that's it. So, if you are optimizing something, most likely you will need to run something, check results and run again with adjusted params... an so on... in this case it will work as @msevestre suggested. If you want to run ten simulations only once, well you cannot use the initialize only once feature, right?

msevestre · 2022-05-17T11:03:46Z

Batch is not intended for this. It is meant to be used in algorithm where you want to varied the same number of parameters, multiple times (think PI: Set values, calculate, set new values calculate etc..) With this approach, it will be much faster.

What you are after is the version that I had originally implemented..

Probably for running the same simulation with a variation of a parameter in parallel, you can create a population with only this one parameter? That would work I believe

PavelBal · 2022-05-17T11:06:18Z

Then adding multiple run values to a batch does not really make sense...

Well, actually I thought SimulationBatch is exactly what is behind population simulation, and we would get the same benefint - initialize once, run in parallel.

msevestre · 2022-05-17T11:06:50Z

Well, actually I thought SimulationBatch is exactly what is behind population simulation, and we would get the same benefint - initialize once, run in parallel.

That what is used to be

msevestre · 2022-05-17T11:10:22Z

Then adding multiple run values to a batch does not really make sense...

yes because it allocates the slots and does not release them. Maybe your algorithm wants to try a bunch of scenarios at once and (x, y) (x+value, y +value) etc.. and decide what direction to move based on the evaluation of multiple simulations. In this case, this makes sense.

But this is true, in most cases, you will have one run. So we have lost the ability to do what you are after I believe (at least easily)

msevestre · 2022-05-17T11:15:12Z

@PavelBal Actually I am full of shit. This did not work like this

Looking at the example, you could old a batch, set value run, set value run etc.
https://github.com/Open-Systems-Pharmacology/OSPSuite-R/blob/main/tests/dev/script-simulation-batch.R

So you could not specify a list of values to run.
I am wondering now why we implemented this like that. We could have created one slot per Batch and then try to parallelized as we do for the population for each run. @abdelr do you know why we did it otherwise? I can't remember

PavelBal · 2022-05-17T11:27:29Z

I am wondering now why we implemented this like that. We could have created one slot per Batch and then try to parallelized as we do for the population for each run. @abdelr do you know why we did it otherwise? I can't remember

Yes this would be awesome, and actually what I always wanted it to do ;)

abdelr · 2022-05-17T15:49:32Z

So you could not specify a list of values to run.
I am wondering now why we implemented this like that. We could have created one slot per Batch and then try to parallelized as we do for the population for each run. @abdelr do you know why we did it otherwise? I can't remember

So, in the original implementation if you define one batch with different run values, the algorithm will, in parallel, initialize as many batches as needed for the parameters. Later you can run them in parallel and you can reuse them. I see no way to make it differently. So, again, you need to initialize at least once the batch. Once it is initialized you can reuse it but the initialization needs to occur for the runner to be able to use it. So unless I am missing something, you can set different parameters to one batch and it will try to parallelize as much as possible but it will need to initialize all the batches. Later you reuse them but the first time there is nothing to do about it, right?

msevestre · 2022-05-17T16:52:29Z

@abdelr for one simulation, we can use the population way of doing this where we initialize only once. This is why the population is so fast. Theoretically this could also be done like this

abdelr · 2022-05-18T06:40:31Z

@msevestre @PavelBal: I think PopulationRunner works the same as the ConcurrentSimulationRunner. PopulationRunner has RunPopulationAsync. This method creates x simulations our of the original simulation, stores these simulations and reuse them for all individuals. The amount x is decided based on the amount of individuals and the amount of cores. Now, the simulations are created on runSimulation using createAndFinalizeSimulation. So, all in all, the method calculates in advance I will need 10 simulations, it creates such 10 simulations (and initialize them) and stores them for reuse. Later all individuals are split into these 10 simulations so if you had 100 individuals it will assign 10 individuals per simulation.

Now, the original implementation of the ConcurrentSimulationRunner did something similar. It did not use the Parallel.ForEach but instead it calculated how many working threads were needed and created one simulation per thread. Later it used these threads to take from a pool of tasks and run the simulations. So, the same idea. The implementation was rejected in favor of Parallel.ForEach passing the control to the caller. This means you can reuse the simulations you create but you also are responsible for it. Now to mimic the the same behavior we have just described with the current implementation we should calculate from the caller I need to run 100 simulations and I have 10 cores so I will add 10 batches and split the runs in 10 passing 10 values are a time. Alternatively we could try to revert to the original code if we manage to find it in the history of commits (maybe not easy since we squash) and lose the ability to reuse from one call of RunAsync to the next. In any case we need to agree on the use case before adding any further change.

This could also be on the interest of @Yuri05 so I am tagging him so he reflects a bit on it.

msevestre · 2022-05-18T06:45:29Z

This method creates x simulations our of the original simulation, stores these simulations and reuse them for all individuals

No it creates one per core allocated and then split the individual along those cores

But at the end of the day, running 10 simulations in // will cost more than running time sequentially. The gain comes when. You are running >> number of cores allocated

msevestre · 2022-05-18T06:47:59Z

I don't think there will be any gain to switching back to the more naive implementation not using for each. The problem is that in order to parallelize a simulation, you will need one instance per core. No way around this.

I think the problem may be more into the API that we are offering as opposed to the implementation

abdelr · 2022-05-18T06:50:26Z

Maybe we need to have a short meeting for this. I think we are talking about the same here just phrasing it differently. If you try to run 10 simulations on 10 core you gains almost nothing (if anything). The gain is indeed when you try to run 100 sims in 10 cores and I agree there will not be a big gain by reverting back. I am just stressing the fact that the PopulationRunner also does the initialization step.

abdelr · 2022-05-18T06:53:56Z

I think we need to rethink what is the real use case here. Is it clear or not that we want to reuse the batches?

msevestre · 2022-05-18T06:54:40Z

In fact, you need to allocatr less core that. Simulations.
So in you example Pavel, you need to allocate 2-3 cores to run 10 simulations.

msevestre · 2022-05-18T06:56:40Z

Yes i agree with you m reuse of batch is even more advanced. But if you just want to run 100 variations of the same simulations, you can create one batch with 100 runs . On 10 cores, you will see serious speed increase

msevestre · 2022-05-18T06:58:28Z

At any rate, we should discuss in a discussion because we are polluting this entry that is still valid for nothing. Sorry 😔

Yuri05 added type: bug prio: high labels Jan 22, 2022

Yuri05 assigned msevestre, abdelr and georgeDaskalakis Jan 22, 2022

Yuri05 mentioned this issue Jan 24, 2022

Fixes #719 call dispose Open-Systems-Pharmacology/OSPSuite-R#720

Merged

Yuri05 unassigned abdelr Nov 9, 2022

Yuri05 removed the type: bug label Feb 11, 2025

ConcurrentSimulationRunner.RunConcurrently: execution time does not scale with the number of cores as expected #1455

ConcurrentSimulationRunner.RunConcurrently: execution time does not scale with the number of cores as expected #1455

Comments

Yuri05 commented Jan 22, 2022

Yuri05 commented Jan 22, 2022

msevestre commented Jan 22, 2022

Yuri05 commented Jan 22, 2022

Yuri05 commented Jan 23, 2022

msevestre commented Jan 23, 2022

msevestre commented Jan 23, 2022

Yuri05 commented Jan 24, 2022

msevestre commented Jan 24, 2022

Yuri05 commented Jan 24, 2022 • edited Loading

msevestre commented Jan 24, 2022

msevestre commented Jan 24, 2022

Yuri05 commented Jan 24, 2022

msevestre commented Jan 24, 2022

msevestre commented Jan 24, 2022

msevestre commented Jan 24, 2022

msevestre commented Jan 24, 2022

Yuri05 commented Jan 24, 2022

Yuri05 commented Jan 24, 2022

msevestre commented Jan 24, 2022

msevestre commented Jan 24, 2022

Yuri05 commented Jan 24, 2022

Yuri05 commented Jan 24, 2022

abdelr commented Jan 24, 2022

msevestre commented Jan 25, 2022

msevestre commented May 17, 2022

PavelBal commented May 17, 2022 • edited Loading

msevestre commented May 17, 2022

abdelr commented May 17, 2022

abdelr commented May 17, 2022

msevestre commented May 17, 2022

msevestre commented May 17, 2022

msevestre commented May 17, 2022

msevestre commented May 17, 2022

msevestre commented May 17, 2022

PavelBal commented May 17, 2022

abdelr commented May 17, 2022

msevestre commented May 17, 2022

PavelBal commented May 17, 2022

msevestre commented May 17, 2022

msevestre commented May 17, 2022

msevestre commented May 17, 2022

PavelBal commented May 17, 2022

abdelr commented May 17, 2022

msevestre commented May 17, 2022

abdelr commented May 18, 2022

msevestre commented May 18, 2022

msevestre commented May 18, 2022

abdelr commented May 18, 2022

abdelr commented May 18, 2022

msevestre commented May 18, 2022

msevestre commented May 18, 2022

msevestre commented May 18, 2022

Yuri05 commented Jan 24, 2022 •

edited

Loading

PavelBal commented May 17, 2022 •

edited

Loading