-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] [dask] add initial dask integration #3515
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SfinxCZ ! I left a small suggestion for setup.py
.
Please also fix the the linting issues (https://travis-ci.org/github/microsoft/LightGBM/jobs/740698092). You can replicate them by running this from the root of the repo:
pycodestyle --ignore=E501,W503 --exclude=./compute,./.nuget .
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^compute|test|example).*" --match="(?!^test_|setup).*\.py" .
@SfinxCZ I want to be sure to respect your time...would you like me to open a pull request into this branch that tries to fix the other CI jobs? Once CI is working, then we can have reviews that discuss the API, documentation, etc. |
As you look at CI, you can safely ignore these:
We have open PRs trying to fix some issues with them right now. |
ok, these PRs I referenced have been merged. So if you merge the latest |
One thing I noticed that I want to bring to your attention @SfinxCZ ....I notice that all of your commits look anonymous (not connected to your GitHub account) This probably means you are pushing from a place where you git config has an email that is not linked to your GitHub. I want to be sure that you get credit in the contributors list when we merge this, so I think you should overwrite these commits and be sure to configure git with an email tied to your GitHub account. I've documented how to do this here, linking in case it helps: uptake/pkgnet#284 (review) |
c3eef74
to
9c1db8c
Compare
@SfinxCZ how can I help with this? I see the Dask tests are failing on Windows. I think it's absolutely ok to skip the Dask tests on Windows for the sake of this PR. When we add
|
I've fixed the CI scripts, the only problem is in https://travis-ci.org/github/microsoft/LightGBM/jobs/743647169 which I assume is some random fail, however I cannot rerun the pipeline. @jameslamb, can you please rerun the pipeline to check whether the fail was indeed some random fluctuation? The code is ready for review. There are some restrictions that I needed to set. First, only linux is supported for now (can be extended in different PR, as suggested by @jameslamb). And second is that only python >= 3.6 is supported. The reason to drop support for python 2.7 is that dask does not support python 2.7 either since version 2.0. If you really need the support I can try to add it but I believe that supporting python 2.7 is not necessary these days. |
We are going to drop Python 2.7 support in a next few days with |
Yep, no problem! I just restarted it. That task (Travis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for all the efforts and for being open to this, @SfinxCZ !
I left a few suggestions. I tried to only focus my review on things like the public API of this new module. We can consider other changes to the internals of these functions / classes through normal maintenance in later PRs.
I'd like to make two additional requests, related to governance over this code after this PR:
-
Can you please add a line to
CODEOWNERS
(https://github.com/microsoft/LightGBM/blob/master/.github/CODEOWNERS#L32) that makes sure I'm added as a reviewer on any PRs that touch the Dask module? I committed to the other LightGBM maintainers that if we bringdask-lightgbm
into this project, I'll take on the primary maintenance burden ([Discussion] efficiency improvements #2791 (comment)) -
Can you please let me know how involved you want to be after this PR is merged? I want to be respectful of all the awesome work you and the other
dask-lightgbm
maintainers have done, but I also want to give you the opportunity to eventually step away from it if you'd like. It sounds like @striajan is ok not being involved (improving LightGBM, XGBoost experience with Dask dask/community#104 (comment)).
Notes for other LightGBM maintainers
I think we should not merge this until 3.1.0 is released (#3484 ), and then try to target including it in an eventual 3.2.0 release. That way, we can make breaking changes if necessary, add documentation with examples, try to add support for Mac and Windows, etc.
Sorry, got busy with other things today. Will add more Dask issues to the backlog tomorrow. |
FYI: Fix for early stopping: h2oai/dask-lightgbm@8cc8e83 Early stopping would fail and I noticed dask-lightgbm package basically had no real dask support for it. So the above adds. It also fixes bug in pred_contribs and a poor setting for num_threads. These are pretty critical changes to dask-lightgbm. |
@pseudotensor this is the first time I've seen I'd prefer that you create pull requests here instead of making changes like that on a fork, so that LightGBM users have a single place to go for LightGBM-on-Dask, and so we don't duplicate efforts (the goal of dask/community#104). The changes you linked above look general enough that they could be contributed directly to LightGBM. |
@jameslamb Purpose is to actually use dask-lightgbm. I wasn't getting nearly any responses for bugs I was reporting to the dask-lightgbm repo, so I didn't have much hope for making PRs. I only right now became aware that dask-lightgbm is being put into lightgbm, which is excellent, so I'm posting those changes so someone can make a PR to lightgbm directly. |
Ya, apart from the 3rd commit that has some import line, yes, all 3 should be applicable to lightgbm. I can make a PR in a few days if that is requested. |
Thanks very much!
I'd be happy to take these changes and PR them, if you'd like! Then you don't have to deal with tests and CI. But if you'd rather contribute them I'm also happy to let you do it.
You should expect much more active development on the integration now that the code is here in LightGBM. So if you're using it and run into problems or feature gaps, we'd very much appreciate you creating issues and pull requests so we know what's missing. Thanks for helping us improve LightGBM! |
Yes please make the relevant PRs if you can, thanks! I maybe didn't do the cleanest/shorted code, but I didn't like the indexing by raw index that wouldn't work in general for the parts. |
Hey @pseudotensor heads up, I've opened #3708, noticed your changes to support validation sets. Happy to take these over for you so that we both get what we want - support for validation sets and |
Ya, go for it. Just coordinate with @jameslamb since he also mentioned he would do some stuff. |
I've added the suggestions from #3515 (comment) as individual issues, so let's continue the conversation there if either of you have further thoughts.
I'll start working on these this week and will review @ffineis 's ranker PR |
@jameslamb apologies for the ping on the PR, not sure where a better location would be I'm trying to build a distributed example on Azure Machine Learning using But more importantly, an internal Bing team is looking to move a large training workload from an internal tool to the public LightGBM package. We are prototyping the move now, and hoping to use an official release of When can we expect the release? Is there a place this is documented? |
@lostmygithubaccount no problem! That's so exciting! Short answer: there is no place where we've documented when the next release (3.2.0) will come out. Long answer: We've recently been trying to do LightGBM releases at least once every two months. This time of year, with people out for various holidays, you can expect it to be a little longer. The most recent release (3.1.1) was in early December, so I think it's reasonable to expect that 3.2.0 will be out in February. But that's only a best guess, not a commitment...it depends on the availability of maintainers and on discussions about what we want to be in that release, which have not been had yet. I know it's not ideal, but if you want to start testing the new Dask integration you can try to adopt my testing code from #3515 (review). I'd actually really really appreciate any feedback and bug reports your team could offer on it. I'm putting a lot of attention into it over the next few weeks so we can feel confident in it when we release next. If you have additional questions, please open a new issue and ask there. Other maintainers and contributors will see it better in an issue than on a closed PR. |
@lostmygithubaccount in preparation for some meetup talks, I recently put together an example for using LightGBM with Dask that I think is a lot easier than the comment I linked you to above. https://github.com/jameslamb/talks/tree/main/recent-developments-in-lightgbm I'd love to hear more about what your team is working on and what features really matter for you in the new Dask module. Feel free to contact me at any of the places documented in https://github.com/jameslamb, and maybe we could set up some time to talk. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
As agreed with @jameslamb in dask/community#104 this pull request moves all functionality from https://github.com/dask/dask-lightgbm repo directly into main LightGBM repo.
Note that this is WIP and most of the code should be changed to match LightGBM standards.