-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentSimulationRunner.RunConcurrently: execution time does not scale with the number of cores as expected #1455
Comments
We should try to reproduce in .net only to remove any intermediate layers |
Can you reproduce it in R first?
|
The previous tests were on the machine with 8 logical cores, but only 4 physical cores. Using the R-parallelism, I get speedup factor 7 on 8 cores and speedup factor 12 on 15 cores:
However using the .NET parallelism, I get in both cases only speedup factors of between 4 and 5:
|
I am not sure what to make of this. Only physical cores are actually performing work? |
It feels to me that we should be able to reproduce in .net with the same simulation |
ok, tried it with ospsuite-R 10.0.72 .NET parallelism
R parallelism
|
So I tried with a .NET console application. I am getting similar results in .NET so this is not an R issue Loading 1 simulation in 00m:02s:716ms Loading 4 simulation in 00m:10s:748ms Loading 4 simulation in 00m:10s:553ms Loading 8 simulation in 00m:23s:930ms Loading 16 simulation in 00m:46s:125ms The leak can also be seen in .NET it seems. So this should be easy to spot using the profile. |
@msevestre Nice! This should make the profiling much easier I hope :)
|
So maybe the objects are not disposed only when calling from R e.g. due to references in rClr? |
I am wondering if we are clearing the ConcurrentSImulationRunner... I need to check this |
Yep. Dispose is not call when calling runSimulations.... |
P.S. To achieve this I still need to perform the clean up at the end and make sure that all R variables which hold .NET pointers are removed as well.
|
hmm, that's really weird. no explanation at the moment |
sure :) |
Sure ;) we talk, but it is indeed strange. Are the threads locking each other somehow? |
Nope absolutely not. At least not in theory |
can you try to specify a simulationRunOptions explciitely |
Ok wow, yes. So if we do not provide a Now, the runtime of simulation batches is more or less the same than running multiple simulations in parallel. So actually no gain from initializing only once?
|
it should be much faster as you are only initializing your simulation once. |
It does |
Did we not discussed this before when @msevestre found out that the batch was not properly used? I remember there was an issue from some external user about it. |
I am pretty sure we solved all of this. @abdelr WE updated the code of Can you try to create a version of the DLL that we could try locally using your old code? WE should be able to simply update the ConcurrenyManager as the API was more or less consistent |
no forget about this. This is the same So in your code pavel, do
the second run will be much, much faster |
but effecitvely, if you are not batching anything, you are just creating simulations for each run and not reusing them again, which is not efficient at all. IF you want to do this, you should use the old SimulationBatch (do we still have it in R)? |
Correction: it's gone. |
on my machine |
Ok, then I just do not understand how to use it. I have one simulation, I want to run it with 10 different parameter sets in parallel, so basically like a population simulation. How can I do it? The way you proposed, it will be sequentially. |
There is no magic solution. You do have a gain from initializing only once but still, you need to initialize. So, If you want to run 10 simulations in parallel you still need to initialize them. Now, the gain in initializing only once is that you keep them and next time you need to run them again you pass the params and that's it. So, if you are optimizing something, most likely you will need to run something, check results and run again with adjusted params... an so on... in this case it will work as @msevestre suggested. If you want to run ten simulations only once, well you cannot use the initialize only once feature, right? |
Batch is not intended for this. It is meant to be used in algorithm where you want to varied the same number of parameters, multiple times (think PI: Set values, calculate, set new values calculate etc..) With this approach, it will be much faster. What you are after is the version that I had originally implemented.. Probably for running the same simulation with a variation of a parameter in parallel, you can create a population with only this one parameter? That would work I believe |
Then adding multiple run values to a batch does not really make sense... Well, actually I thought SimulationBatch is exactly what is behind population simulation, and we would get the same benefint - initialize once, run in parallel. |
That what is used to be |
yes because it allocates the slots and does not release them. Maybe your algorithm wants to try a bunch of scenarios at once and (x, y) (x+value, y +value) etc.. and decide what direction to move based on the evaluation of multiple simulations. In this case, this makes sense. But this is true, in most cases, you will have one run. So we have lost the ability to do what you are after I believe (at least easily) |
@PavelBal Actually I am full of shit. This did not work like this Looking at the example, you could old a batch, set value run, set value run etc. So you could not specify a list of values to run. |
Yes this would be awesome, and actually what I always wanted it to do ;) |
So, in the original implementation if you define one batch with different run values, the algorithm will, in parallel, initialize as many batches as needed for the parameters. Later you can run them in parallel and you can reuse them. I see no way to make it differently. So, again, you need to initialize at least once the batch. Once it is initialized you can reuse it but the initialization needs to occur for the runner to be able to use it. So unless I am missing something, you can set different parameters to one batch and it will try to parallelize as much as possible but it will need to initialize all the batches. Later you reuse them but the first time there is nothing to do about it, right? |
@abdelr for one simulation, we can use the population way of doing this where we initialize only once. This is why the population is so fast. Theoretically this could also be done like this |
@msevestre @PavelBal: I think PopulationRunner works the same as the ConcurrentSimulationRunner. PopulationRunner has RunPopulationAsync. This method creates x simulations our of the original simulation, stores these simulations and reuse them for all individuals. The amount x is decided based on the amount of individuals and the amount of cores. Now, the simulations are created on runSimulation using createAndFinalizeSimulation. So, all in all, the method calculates in advance I will need 10 simulations, it creates such 10 simulations (and initialize them) and stores them for reuse. Later all individuals are split into these 10 simulations so if you had 100 individuals it will assign 10 individuals per simulation. Now, the original implementation of the ConcurrentSimulationRunner did something similar. It did not use the Parallel.ForEach but instead it calculated how many working threads were needed and created one simulation per thread. Later it used these threads to take from a pool of tasks and run the simulations. So, the same idea. The implementation was rejected in favor of Parallel.ForEach passing the control to the caller. This means you can reuse the simulations you create but you also are responsible for it. Now to mimic the the same behavior we have just described with the current implementation we should calculate from the caller I need to run 100 simulations and I have 10 cores so I will add 10 batches and split the runs in 10 passing 10 values are a time. Alternatively we could try to revert to the original code if we manage to find it in the history of commits (maybe not easy since we squash) and lose the ability to reuse from one call of RunAsync to the next. In any case we need to agree on the use case before adding any further change. This could also be on the interest of @Yuri05 so I am tagging him so he reflects a bit on it. |
No it creates one per core allocated and then split the individual along those cores But at the end of the day, running 10 simulations in // will cost more than running time sequentially. The gain comes when. You are running >> number of cores allocated |
I don't think there will be any gain to switching back to the more naive implementation not using for each. The problem is that in order to parallelize a simulation, you will need one instance per core. No way around this. I think the problem may be more into the API that we are offering as opposed to the implementation |
Maybe we need to have a short meeting for this. I think we are talking about the same here just phrasing it differently. If you try to run 10 simulations on 10 core you gains almost nothing (if anything). The gain is indeed when you try to run 100 sims in 10 cores and I agree there will not be a big gain by reverting back. I am just stressing the fact that the PopulationRunner also does the initialization step. |
I think we need to rethink what is the real use case here. Is it clear or not that we want to reuse the batches? |
In fact, you need to allocatr less core that. Simulations. |
Yes i agree with you m reuse of batch is even more advanced. But if you just want to run 100 variations of the same simulations, you can create one batch with 100 runs . On 10 cores, you will see serious speed increase |
At any rate, we should discuss in a discussion because we are polluting this entry that is still valid for nothing. Sorry 😔 |
Attached is a small example in R, which loads N copies of the same simulation and executes
runSimulations
(which callsConcurrentSimulationRunner.RunConcurrently
in the OSPSuite.Core)testRunSimulations.zip
I have executed it on the machine with 8 cores (thus as per default, 7 cores are used).
The expectation would be: when executing any number of simulations between 1 and 7: execution time should be approximately the same. However, the time is increasing with every new simulation:
Screenshot
The text was updated successfully, but these errors were encountered: