-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI] Socket-based distributed training fails #6117
Comments
Thanks for using LightGBM. I'm confused... what specific problem are you reporting? The issue title suggests that you're concerned about the warnings shown in machine 2's logs, but what I see there is that it did eventually find the other worker and successfully form the network. Did you omit some logs for machine 1? Or are you saying that after We need more details to help you. |
Thanks for the feedback @jameslamb This is the full log, so machine 2 exit the process and no training has been done. As I recall it was working a year ago, I looked at my backup and found an old lightgbm.exe version, I copied it over and it worked right away. I'm using this example to simplify my testing https://github.com/microsoft/lightgbm/tree/master/examples/parallel_learning Old code from 2022 (note: I waited 1 min before starting the second lightgbm on the 2nd machine)New code 2023
It seems there is something fishy with the latest code concerning Distributed with Socket Thanks for your help Wil |
Note: I also ran with "tree_learner = serial" with both versions of lightGBM v2022_10 and latest 2023 and they both completed and gave the same result which is great! |
Hi @jameslamb, I also have the same issue. @wil70 have you found a solution? Other than using the old version. |
Linking #5159 since it was a change for the socket configuration in Windows. That's the only change to this that I remember, @shiyu1994 do you think that could cause the problem here? |
@jmoralez I'm using a MacOS machine and a Linux one for training. |
@jacopocecch right now I'm using the old solution, I will pull the latest code and try again. |
@jacopocecch I compiled latest code from today on windows and it seems to work well. I also check the model by comparing the result to Oct 2022 and it is the same - so all good! :)) Some number with the the parrallel_learning_example from git for serial
for feature
for data
for Voting
Note: take those results with a grain of salt, as my machines were already (consistently) busy, so I took the slowest score for the 'serial', I might redo this test once the machines are done with their current work. Note: For info the mpi version doesn't compile with the latest code on windows, it is missing the mpi.h file. I'm not using mpi as of now as I'm using the socket version. Would the mpi version be way faster than socket? Thanks @jameslamb @jmoralez @jacopocecch and @shiyu1994 ! We can close this thread. |
Not necessarily. The MPI version exists specifically for users who are working in an environment where MPI programs are the main supported pattern for analytical workloads, e.g. for those using Slurm as a workload manager (docs). So it is there to allow LightGBM training to look and feel like users' other workloads... "be faster than other approaches" is not one of its design goals.
We'd welcome a separate bug report with a specific, minimal, reproducible example describing exactly what you're talking about. If you choose to open that, please provide evidence that it's a bug in LightGBM and not just that you've installed the MPI headers in a non-standard location or failed to install them at all.
Great, we will close this. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Hello,
I'm trying to run a distributed training with 2 machines.
I'm using LightGBM CLI with the socket distributed option.
I compile the latest c++ code as of today.
The config on both machines look like this:
the mlist.txt look like this:
I open the firewall for private and domaine on both machines.
Machine 1 output
Machine 2 output
Any idea what I should check?
thanks for your help
Wil
The text was updated successfully, but these errors were encountered: