-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocess issue #468
Comments
@darrencl. I think your right the issue is with the PBS cluster scheduler.Some processes were interrupted by the cluster(maybe due to other users using the cluster). I am not familiar with the PBS cluster. But you could checkout ClusterManagers package for more info. |
@OkonSamuel Thanks! My pipeline requires long time when training due to pre-processing occurs in the cross validation to tune pre-processing hyper-parameter. This will obviously be slower especially when training with models having some hyper-parameters to be tuned. It could even take few days to train the pipeline (maybe partly due to old hardware and scheduling/shared system). I know in deep net framework like keras, we're able to stop and resume the training, so similarly, I'm just wondering when my script breaks in the middle, is there any workaround to let the pipeline resume gracefully? |
The plan, which is yet to be implemented, is to provide a model wrapper |
Closing as the original issue seems not to be MLJ related. The other discussion could continue at #139. |
Hi @ablaom @OkonSamuel I have a question on multiprocessed logs when using
Thanks! |
@darrencl . I think what's happening here is that the number of models to be evaluated during tuning is far greater than the number of processes available to julia (which is |
@OkonSamuel Ahh, I see. Fair enough. So for evaluating over a lot of models (due to a lot of hyper-parameters to be tuned), is there any other alternatives to optimize the process? Anyway, I am on Ubuntu 18.04 and tried to use only Also, looking into the logs produced during resampling (5-fold with 5 CPUThreads), it seems the threads are still done in serial. If it's not serial, then the 3 seconds log should've come first, then 6s, etc. So I am not sure as to how this helps in terms of optimizing the performance?
|
Yes your right this is done serially. To use
Both multithread and Multicore are forms of parallelism. Multi-thread is shared memory parallelism( The same julia process making use of multiple CPUs and sharing memory) while Multicore does not shared memory (each distinct julia process runs on a different CPU). There is no best choice both have their use cases.
I don't think there is an alternative except having more cores. But using |
@OkonSamuel if I get this correctly, BLAS' threads is used for computational operation such as matrix, etc. while Julia threads is for general purpose? If I were to set this in a cluster, is there anyway I can do this in code instead of going into worker nodes and setup I see, so spawning 5 processes with 4 threads (i.e. |
YES.
No. For now it has too be set up usjng the
Yes. This spawns 4 threads per process making a total of 20 threads. Therefore having more cpu's would be more effective. |
Hi, not sure if I should open this in Distributed.jl or here. Sometimes when I train my pipeline with
acceleration=MLJ.CPUProcesses(), acceleration_resampling=MLJ.CPUThreads()
I got the following error (The logs are mixed up with the other process, so you can ignore the non-relevant logs). I am using 64 processes with 64 threads.I'm not sure what's happening here since this doesn't always happen although using same code. Might be it's the issue in my HPC's CPU scheduling (I am using PBS cluster)?
The text was updated successfully, but these errors were encountered: