-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] require scikit-learn>=0.24.2
, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev
#6651
[python-package] require scikit-learn>=0.24.2
, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev
#6651
Conversation
Update: The change introduced in scikit-learn/scikit-learn#29677 makes it hard to subclass a sklearn estimator in a codebase while being compatible with sklearn < 1.6.0 and sklearn >= 1.6.0. Essentially the former looks up The issue is discussed here: and it looks like a relaxation of the impossibility of having both |
@vnherdeiro note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment)), however, the version check and |
Correct I am waiting for that PR to go in to bring back _more_tags
Using @available_if would require another sklearn import and make the code
less readable I reckon
…On Thu, 12 Sept 2024, 11:34 am Adrin Jalali, ***@***.***> wrote:
@vnherdeiro <https://github.com/vnherdeiro> note that it's possible
already to support both with this method (scikit-learn/scikit-learn#29677
(comment)
<scikit-learn/scikit-learn#29677 (comment)>),
however, the version check and @available_if are going to be unnecessary
once we merge scikit-learn/scikit-learn#29801
<scikit-learn/scikit-learn#29801>
—
Reply to this email directly, view it on GitHub
<#6651 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE4CNVUURU6AMLDYUXKPFTTZWFU2TAVCNFSM6AAAAABOAVNTLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBVHA4DCOJWGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for starting on this @vnherdeiro . I've documented it in an issue: #6653 (and added that to the PR description). Note there that I intentionally put the exact errors messages in plain text instead of just referring to Note also that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting on this! Please see scikit-learn/scikit-learn#29801 (comment):
The story becomes "If you want to support multiple scikit-learn versions, define both."
I think we should leave _more_tags()
untouched and add __sklearn_tags__()
. And have self.__sklearn_tags__()
call self._more_tags()
to get its data, so we don't define things like _xfail_checks
twice.
Do you have time to do that in the next few days? We need to fix this to unblock CI here, so if you don't have time to fix it this week please let me know and I will work on this.
scikit-learn>=1.16
scikit-learn>=1.16
scikit-learn>=1.16
@jameslamb Have just pushe a sklearn_tags trying a conversion from _more_tags. I added a out of current argument scope warning to catch a change from the arguments in _more_tags (they don't seem to change much though). |
scikit-learn>=1.16
scikit-learn>=1.6.0dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a maintainer here, but coming from sklearn side. Leaving thoughts hoping it'd help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
I've reviewed the dataclasses at https://github.com/scikit-learn/scikit-learn/blob/e2ee93156bd3692722a39130c011eea313628690/sklearn/utils/_tags.py and agree with the choices you've made about how to map the dictionary-formatted values from _more_tags()
to the dataclass attributes scikit-learn
now prefers.
Please see the other comments about simplifying this.
@@ -1067,6 +1137,21 @@ def n_features_in_(self) -> int: | |||
raise LGBMNotFittedError("No n_features_in found. Need to call fit beforehand.") | |||
return self._n_features_in | |||
|
|||
@n_features_in_.setter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass reset=True
to sklearn.utils.validation.validate_data()
, it will try to:
We want the "set estimator.n_features_in_
" behavior, because without it we have to manually set estimator.n_features_in_
in fit()
.
Doing that requires determining the number of features in X
, which requires either re-implementing something like sklearn.utils.validation._num_features()
(as I originally tried to do) or just calling that function directly. But that function can't safely be called directly before calling check_array()
, because it raises a TypeError
on 1-D inputs, which violates the check_fit1d
estimator check (code link).
So here, I'm proposing that we do the following:
- add a setter for
n_features_in_
and a deleter forfeature_names_in_
- pass
reset=True
atfit()
time tovalidate_data()
- modify the pre-1.6 implementation of
validate_data()
incompat.py
to match
scikit-learn>=1.6.0dev
scikit-learn>=0.24.2
, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev
@StrikerRUS your comments were definitely not "minor", they really helped a lot! I've re-thought a lot of this PR based on trying to implement those suggestions. This is ready for another review. Thank you for all your reviewing effort here, I know this change has become quite complex and there are many competing constraints it's trying to satisfy. |
…rn_more_tags_deprecation
…eiro/LightGBM into fix_sklearn_more_tags_deprecation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this time my review comments will be really minor 😄
python-package/lightgbm/compat.py
Outdated
|
||
# NOTE: check_X_y() calls check_array() internally, so only need to call one or the other of them here | ||
if no_val_y: | ||
X = check_array(X, accept_sparse=accept_sparse, force_all_finite=ensure_all_finite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds ensure_min_samples=ensure_min_samples,
X = check_array(X, accept_sparse=accept_sparse, force_all_finite=ensure_all_finite) | |
X = check_array( | |
X, | |
accept_sparse=accept_sparse, | |
force_all_finite=ensure_all_finite, | |
ensure_min_samples=ensure_min_samples, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally omitted ensure_min_samples
. It's already not being passed in the one place it's used on master
:
LightGBM/python-package/lightgbm/sklearn.py
Line 1007 in 0643230
X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False) |
This call to check_array()
only happens in predict()
, so I also think we should avoid any more validation than absolutely necessary to comply with the scikit-learn
API, since applications calling predict()
might care more about latency than those calling fit()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Fine with me. However, I think you should be aware that the default argument is 1
, not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not realize that! I'm glad you mentioned it, just checked and it looks like omitting this argument from the call still results in that validation being performed.
if ensure_min_samples > 0:
n_samples = _num_samples(array)
if n_samples < ensure_min_samples:
raise ValueError(
"Found array with %d sample(s) (shape=%s) while a"
" minimum of %d is required%s."
% (n_samples, array.shape, ensure_min_samples, context)
)
If that's the case, then my point about avoiding the overhead at predict()
time doesn't matter... we're getting that overhead anyway. I guess the scikit-learn
interface is probably not what you'd choose for low-latency predictions anyway... I will change this to pass through ensure_min_samples
and set that to 1
in the call in sklearn.py
, to make it explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made this change in 8ef1deb. Now ensure_min_samples=1
will be passed at predict()
time.
Thanks for the suggestion and talking through it with me.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
That was really challenging!
@vnherdeiro Thank you for starting the work!
@adrinjalali Thanks for your help here!
@jameslamb Thanks a ton for the huge work done here!
Thanks everyone for the help, and especially @StrikerRUS for a thorough review of a very complex change! |
Thanks for all the work @jameslamb Feeling glad this went in! |
@jameslamb I think it's more important to set |
Sure, that is ok with me. I just made that change, it should be reflected the next time release-drafter regenerates the release notes. |
Fixes #6653
(edit: taken over by @jameslamb, description re-written below)
scikit-learn>=0.24.2
__sklearn_tags__()
(replacement for_more_tags()
) forscikit-learn
estimatorssklearn.utils.validation.validate_data()
infit()
andpredict()
scikit-learn
estimators reject inputs with the wrong number of featuresNotes for Reviewers
see https://scikit-learn.org/dev/whats_new/v1.6.html and scikit-learn/scikit-learn#29677