-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Details of the Data Parallel Tree Learner #5798
Comments
Hey @adfea9c0, thanks for using LightGBM. I'll try to answer your questions.
Please let us know if you have further doubts. |
Thank you for the response! 1./2. I understand how the data is partitioned -- but I didn't understand really how the actual algorithm works. From your description it sounds like the workers do binning locally, which results in non-deterministic behaviour? Otherwise I still don't understand the source of non-determinism. One more clarification -- do i understand correctly that generally speaking you want the number of partitions |
1./2. Yes, exactly. Even if the partitions are the same, the feature bins for the same feature might be computed using a different partition, thus they may be different. Yes, that's a good value because as you say the first step is combining all partitions that a worker has into a single one, so if each worker holds a single partition that step will be faster. |
Thanks for your help! |
Hey sorry I had one follow-up question about this:
How is this information merged to determine the best global split? So each worker gets its own subset of the data, determines the cost of splitting on each boundary, and then communicates this for central processing? But then how is this info combined? E.g. if Worker A has bin boundaries for some feature at 1.0 and at 3.0, but Worker B has a boundary at 2.0, how can we use histogram information to find the gain by splitting on 2.0 for Worker A? I find the idea of local bins very confusing. EDIT: This would also imply that for the data parellel tree learner it is also important to have each worker receive a representative subset of the data, correct? Otherwise my bins might be completely different than the ones from some other worker. |
I believe the process is:
Yes, that's a very important assumption. Please take this with a grain of salt, I may be wrong in some step, @shiyu1994 is the person who knows this for sure. |
And these bins are computed based on the local sample of data? I feel like it should still be possible to implement this in a deterministic way, i.e. if the data is partitioned the same way between runs, and features are distributed across workers in a consistent way (assuming a fixed seed), then bin boundaries should be consistent across runs, no? |
It is in the CLI because you manually specify the order of the machines, however for the dask case the order of the workers isn't deterministic, so the same worker may get assigned a different subset of features each time. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
I'm working to understand the data parallel tree learner included in LightGBM. I'm using dask on lightgbm==3.3.2. I have a few open questions:
The example from my first point:
The text was updated successfully, but these errors were encountered: