Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI] Socket-based distributed training fails #6117

Closed
wil70 opened this issue Sep 29, 2023 · 10 comments
Closed

[CLI] Socket-based distributed training fails #6117

wil70 opened this issue Sep 29, 2023 · 10 comments
Labels

Comments

@wil70
Copy link

wil70 commented Sep 29, 2023

Hello,

I'm trying to run a distributed training with 2 machines.
I'm using LightGBM CLI with the socket distributed option.
I compile the latest c++ code as of today.

The config on both machines look like this:

...
tree_learner=voting

# number of machines in parallel training, alias: num_machine
num_machines = 2

# local listening port in parallel training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
machine_list_file = mlist.txt

the mlist.txt look like this:

machine_1_ip 12400
machine_2_ip 12400

I open the firewall for private and domaine on both machines.

Machine 1 output

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Trying to bind port 12400...
[LightGBM] [Info] Binding port 12400 succeeded
[LightGBM] [Info] Listening...

Machine 2 output


[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Trying to bind port 12400...
[LightGBM] [Info] Binding port 12400 succeeded
[LightGBM] [Info] Listening...[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds

[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2113 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2746 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 3569 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 4639 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 6030 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 7838 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 10189 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 13245 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 17218 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 22383 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 29097 milliseconds
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Finished initializing network

Any idea what I should check?

thanks for your help

Wil

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

I'm confused... what specific problem are you reporting? The issue title suggests that you're concerned about the warnings shown in machine 2's logs, but what I see there is that it did eventually find the other worker and successfully form the network.

Did you omit some logs for machine 1? Or are you saying that after Listening.... the process on machine 1 crashed or timed out?

We need more details to help you.

@wil70 wil70 changed the title [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds [LightGBM CLI] Can't get Distributed with socket working anymore! Sep 29, 2023
@wil70
Copy link
Author

wil70 commented Sep 29, 2023

Thanks for the feedback @jameslamb
Let me rename the title and describe a bit more the issue that I think I'm seeing

This is the full log, so machine 2 exit the process and no training has been done.
I downloaded the full latest code again, compiled it again and same result.

As I recall it was working a year ago, I looked at my backup and found an old lightgbm.exe version, I copied it over and it worked right away. I'm using this example to simplify my testing https://github.com/microsoft/lightgbm/tree/master/examples/parallel_learning

Old code from 2022 (note: I waited 1 min before starting the second lightgbm on the 2nd machine)

image
It completed the training

New code 2023

image
It does not start the training and the process seems to have crashed at the end on the second machine

It seems there is something fishy with the latest code concerning Distributed with Socket
Let me know if you want me to run some extra tests

Thanks for your help

Wil

@wil70
Copy link
Author

wil70 commented Sep 29, 2023

Note: I also ran with "tree_learner = serial" with both versions of lightGBM v2022_10 and latest 2023 and they both completed and gave the same result which is great!
So the only concern is Distributed with Docket

@jameslamb jameslamb changed the title [LightGBM CLI] Can't get Distributed with socket working anymore! [CLI] Socket-based distributed training fails Sep 29, 2023
@jacopocecch
Copy link

jacopocecch commented Oct 26, 2023

Hi @jameslamb, I also have the same issue. @wil70 have you found a solution? Other than using the old version.
Also, do you remember the number of the version that is working?

@jmoralez
Copy link
Collaborator

Linking #5159 since it was a change for the socket configuration in Windows. That's the only change to this that I remember, @shiyu1994 do you think that could cause the problem here?

@jacopocecch
Copy link

@jmoralez I'm using a MacOS machine and a Linux one for training.

@wil70
Copy link
Author

wil70 commented Nov 12, 2023

@jacopocecch right now I'm using the old solution,
For info, I'm on windows.

I will pull the latest code and try again.

@wil70
Copy link
Author

wil70 commented Nov 14, 2023

@jacopocecch I compiled latest code from today on windows and it seems to work well. I also check the model by comparing the result to Oct 2022 and it is the same - so all good! :))

Some number with the the parrallel_learning_example from git
I change for 1000 iterations/num_trees at 0.01 learning_rate

for serial

[LightGBM] [Info] Iteration:1000, training binary_logloss : 0.217513
[LightGBM] [Info] Iteration:1000, training auc : 0.999187
[LightGBM] [Info] Iteration:1000, valid_1 binary_logloss : 0.494112
[LightGBM] [Info] Iteration:1000, valid_1 auc : 0.840815
[LightGBM] [Info] 16.303107 seconds elapsed, finished iteration 1000
[LightGBM] [Info] Finished training

for feature

[LightGBM] [Info] Iteration:1000, training binary_logloss : 0.217513
[LightGBM] [Info] Iteration:1000, training auc : 0.999187
[LightGBM] [Info] Iteration:1000, valid_1 binary_logloss : 0.494112
[LightGBM] [Info] Iteration:1000, valid_1 auc : 0.840815
[LightGBM] [Info] 43.695520 seconds elapsed, finished iteration 1000
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished linking network in 40.289819 seconds

for data

[LightGBM] [Info] Iteration:1000, training binary_logloss : 0.214978
[LightGBM] [Info] Iteration:1000, training auc : 0.999072
[LightGBM] [Info] Iteration:1000, valid_1 binary_logloss : 0.491479
[LightGBM] [Info] Iteration:1000, valid_1 auc : 0.841767
[LightGBM] [Info] 189.375081 seconds elapsed, finished iteration 1000
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished linking network in 184.785459 seconds

for Voting

[LightGBM] [Info] Iteration:1000, training binary_logloss : 0.215433
[LightGBM] [Info] Iteration:1000, training auc : 0.999089
[LightGBM] [Info] Iteration:1000, valid_1 binary_logloss : 0.491557
[LightGBM] [Info] Iteration:1000, valid_1 auc : 0.840541
[LightGBM] [Info] 256.181240 seconds elapsed, finished iteration 1000
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished linking network in 245.827818 seconds

Note: take those results with a grain of salt, as my machines were already (consistently) busy, so I took the slowest score for the 'serial', I might redo this test once the machines are done with their current work.

Note: For info the mpi version doesn't compile with the latest code on windows, it is missing the mpi.h file. I'm not using mpi as of now as I'm using the socket version. Would the mpi version be way faster than socket?

Thanks @jameslamb @jmoralez @jacopocecch and @shiyu1994 !

We can close this thread.

@jameslamb
Copy link
Collaborator

Would the mpi version be way faster than socket?

Not necessarily. The MPI version exists specifically for users who are working in an environment where MPI programs are the main supported pattern for analytical workloads, e.g. for those using Slurm as a workload manager (docs).

So it is there to allow LightGBM training to look and feel like users' other workloads... "be faster than other approaches" is not one of its design goals.

For info the mpi version doesn't compile with the latest code on windows, it is missing the mpi.h file

We'd welcome a separate bug report with a specific, minimal, reproducible example describing exactly what you're talking about. If you choose to open that, please provide evidence that it's a bug in LightGBM and not just that you've installed the MPI headers in a non-standard location or failed to install them at all.

We can close this thread.

Great, we will close this.

Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants