Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

mmccarty · 2021-09-29T13:14:25Z

What happened:
Test failures when fitting sklearn estimators with dataframes. As of scikit-learn=1.0, all estimators store feature_names_in_ when fitted on dataframes and column name consistency checks issue a FutureWarning when column names are not consistent with the X columns used to fit. dask-ml's pytest configuration fails tests with sklearn warnings.

What you expected to happen:
Tests to pass with scikit-learn=1.0 and dask-ml should be updated to hand dataframes contently with scikit-learn>=1.0

Minimal Complete Verifiable Example:
From tests/test_partial.py, one of the failing tests.

df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
ddf = dd.from_pandas(df, npartitions=2)

with dask.config.set(scheduler="single-threaded"):
    sgd = SGDClassifier(max_iter=5, tol=1e-3)

    sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])

    sol = sgd.predict(df[["x"]])
    result = predict(sgd, ddf[["x"]])

Should result in the following

_________________________________________________________________________________________ test_dataframes _________________________________________________________________________________________

    def test_dataframes():
        df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
        ddf = dd.from_pandas(df, npartitions=2)
    
        with dask.config.set(scheduler="single-threaded"):
            sgd = SGDClassifier(max_iter=5, tol=1e-3)
    
            sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])
    
            sol = sgd.predict(df[["x"]])
>           result = predict(sgd, ddf[["x"]])

tests/test_partial.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask_ml/_partial.py:183: in predict
    dt = model.predict(xx).dtype
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:425: in predict
    scores = self.decision_function(X)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:407: in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/base.py:543: in _validate_data
    self._check_feature_names(X, reset=reset)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = SGDClassifier(max_iter=5), X = array([[0]])

    def _check_feature_names(self, X, *, reset):
        """Set or check the `feature_names_in_` attribute.
    
        .. versionadded:: 1.0
    
        Parameters
        ----------
        X : {ndarray, dataframe} of shape (n_samples, n_features)
            The input samples.
    
        reset : bool
            Whether to reset the `feature_names_in_` attribute.
            If False, the input will be checked for consistency with
            feature names of data provided when reset was last True.
            .. note::
               It is recommended to call `reset=True` in `fit` and in the first
               call to `partial_fit`. All other methods that validate `X`
               should set `reset=False`.
        """
    
        if reset:
            feature_names_in = _get_feature_names(X)
            if feature_names_in is not None:
                self.feature_names_in_ = feature_names_in
            return
    
        fitted_feature_names = getattr(self, "feature_names_in_", None)
        X_feature_names = _get_feature_names(X)
    
        if fitted_feature_names is None and X_feature_names is None:
            # no feature names seen in fit and in X
            return
    
        if X_feature_names is not None and fitted_feature_names is None:
            warnings.warn(
                f"X has feature names, but {self.__class__.__name__} was fitted without"
                " feature names"
            )
            return
    
        if X_feature_names is None and fitted_feature_names is not None:
>           warnings.warn(
                "X does not have valid feature names, but"
                f" {self.__class__.__name__} was fitted with feature names"
            )
E           UserWarning: X does not have valid feature names, but SGDClassifier was fitted with feature names

Anything else we need to know?:
In this case, the problem comes from the way dataframes are coerced to arrays in partial.predict. In scikit-learn=1.0, the model is expecting X to be a dataframe.

Environment:

Dask version: 2021.9.1
Dask-ML version: latest from main
Python version: 3.8.12
Operating System: Ubuntu 21.04
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2021-09-29T14:06:40Z

Thanks for the report.

Since we don't really have much maintenance bandwidth here, I think we should drop support for scikit-learn<1.0, get the tests passing with scikit-learn 1.0, and get a release out.

I should have a bit of time to work on this in ~2 weeks.

mmccarty · 2021-09-29T16:41:40Z

Hi Tom - I could probably spent some time working on this, but wanted to check first. Do you think the general approach of dropping array coercion is correct?

TomAugspurger · 2021-09-29T17:40:29Z

Yep. Anywhere scikit-learn preserves dataframes, we should too.

We have a few places where we explicitly preserve dataframes where scikit-learn didn't (search preserve_dataframe, e.g.

dask-ml/dask_ml/compose/_column_transformer.py

Line 88 in 80eb0d8

preserve_dataframe : bool, (default=True)

). If scikit-learn is now preserving dataframes in those cases we can perhaps deprecate our keywords.

TomAugspurger · 2021-10-16T16:49:48Z

Done in #863. I'm going to make a release now.

2021.10.17 is up on PyPI now if you want to try.

mmccarty · 2021-10-18T12:41:38Z

Thanks Tom!

…

________________________________ From: Tom Augspurger ***@***.***> Sent: Saturday, October 16, 2021 12:49:59 PM To: dask/dask-ml ***@***.***> Cc: Mike McCarty ***@***.***>; Author ***@***.***> Subject: Re: [dask/dask-ml] Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 (#858) Done in #863<#863>. I'm going to make a release now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#858 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAEY2GU6ROHUGW6QGWBJOA3UHGULPANCNFSM5E73NIUA>.

TomAugspurger mentioned this issue Oct 16, 2021

Sklearn 1.0 #863

Merged

TomAugspurger closed this as completed in #863 Oct 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

mmccarty commented Sep 29, 2021 •

edited

Loading

TomAugspurger commented Sep 29, 2021

mmccarty commented Sep 29, 2021

TomAugspurger commented Sep 29, 2021

TomAugspurger commented Oct 16, 2021 •

edited

Loading

mmccarty commented Oct 18, 2021 via email

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

Comments

mmccarty commented Sep 29, 2021 • edited Loading

TomAugspurger commented Sep 29, 2021

mmccarty commented Sep 29, 2021

TomAugspurger commented Sep 29, 2021

TomAugspurger commented Oct 16, 2021 • edited Loading

mmccarty commented Oct 18, 2021 via email

mmccarty commented Sep 29, 2021 •

edited

Loading

TomAugspurger commented Oct 16, 2021 •

edited

Loading