Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

Closed
mmccarty opened this issue Sep 29, 2021 · 5 comments · Fixed by #863
Closed

Estimators fit with dataframes cause UserWarnings on scikit-learn 1.0 #858

mmccarty opened this issue Sep 29, 2021 · 5 comments · Fixed by #863

Comments

@mmccarty
Copy link
Member

mmccarty commented Sep 29, 2021

What happened:
Test failures when fitting sklearn estimators with dataframes. As of scikit-learn=1.0, all estimators store feature_names_in_ when fitted on dataframes and column name consistency checks issue a FutureWarning when column names are not consistent with the X columns used to fit. dask-ml's pytest configuration fails tests with sklearn warnings.

What you expected to happen:
Tests to pass with scikit-learn=1.0 and dask-ml should be updated to hand dataframes contently with scikit-learn>=1.0

Minimal Complete Verifiable Example:
From tests/test_partial.py, one of the failing tests.

df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
ddf = dd.from_pandas(df, npartitions=2)

with dask.config.set(scheduler="single-threaded"):
    sgd = SGDClassifier(max_iter=5, tol=1e-3)

    sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])

    sol = sgd.predict(df[["x"]])
    result = predict(sgd, ddf[["x"]])

Should result in the following

_________________________________________________________________________________________ test_dataframes _________________________________________________________________________________________

    def test_dataframes():
        df = pd.DataFrame({"x": range(10), "y": [0, 1] * 5})
        ddf = dd.from_pandas(df, npartitions=2)
    
        with dask.config.set(scheduler="single-threaded"):
            sgd = SGDClassifier(max_iter=5, tol=1e-3)
    
            sgd = fit(sgd, ddf[["x"]], ddf.y, classes=[0, 1])
    
            sol = sgd.predict(df[["x"]])
>           result = predict(sgd, ddf[["x"]])

tests/test_partial.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask_ml/_partial.py:183: in predict
    dt = model.predict(xx).dtype
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:425: in predict
    scores = self.decision_function(X)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/linear_model/_base.py:407: in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
../../../miniconda3/envs/dask-ml-dev/lib/python3.8/site-packages/sklearn/base.py:543: in _validate_data
    self._check_feature_names(X, reset=reset)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = SGDClassifier(max_iter=5), X = array([[0]])

    def _check_feature_names(self, X, *, reset):
        """Set or check the `feature_names_in_` attribute.
    
        .. versionadded:: 1.0
    
        Parameters
        ----------
        X : {ndarray, dataframe} of shape (n_samples, n_features)
            The input samples.
    
        reset : bool
            Whether to reset the `feature_names_in_` attribute.
            If False, the input will be checked for consistency with
            feature names of data provided when reset was last True.
            .. note::
               It is recommended to call `reset=True` in `fit` and in the first
               call to `partial_fit`. All other methods that validate `X`
               should set `reset=False`.
        """
    
        if reset:
            feature_names_in = _get_feature_names(X)
            if feature_names_in is not None:
                self.feature_names_in_ = feature_names_in
            return
    
        fitted_feature_names = getattr(self, "feature_names_in_", None)
        X_feature_names = _get_feature_names(X)
    
        if fitted_feature_names is None and X_feature_names is None:
            # no feature names seen in fit and in X
            return
    
        if X_feature_names is not None and fitted_feature_names is None:
            warnings.warn(
                f"X has feature names, but {self.__class__.__name__} was fitted without"
                " feature names"
            )
            return
    
        if X_feature_names is None and fitted_feature_names is not None:
>           warnings.warn(
                "X does not have valid feature names, but"
                f" {self.__class__.__name__} was fitted with feature names"
            )
E           UserWarning: X does not have valid feature names, but SGDClassifier was fitted with feature names

Anything else we need to know?:
In this case, the problem comes from the way dataframes are coerced to arrays in partial.predict. In scikit-learn=1.0, the model is expecting X to be a dataframe.

Environment:

  • Dask version: 2021.9.1
  • Dask-ML version: latest from main
  • Python version: 3.8.12
  • Operating System: Ubuntu 21.04
  • Install method (conda, pip, source): conda
@TomAugspurger
Copy link
Member

Thanks for the report.

Since we don't really have much maintenance bandwidth here, I think we should drop support for scikit-learn<1.0, get the tests passing with scikit-learn 1.0, and get a release out.

I should have a bit of time to work on this in ~2 weeks.

@mmccarty
Copy link
Member Author

Hi Tom - I could probably spent some time working on this, but wanted to check first. Do you think the general approach of dropping array coercion is correct?

@TomAugspurger
Copy link
Member

Yep. Anywhere scikit-learn preserves dataframes, we should too.

We have a few places where we explicitly preserve dataframes where scikit-learn didn't (search preserve_dataframe, e.g.

preserve_dataframe : bool, (default=True)
). If scikit-learn is now preserving dataframes in those cases we can perhaps deprecate our keywords.

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 16, 2021

Done in #863. I'm going to make a release now.

2021.10.17 is up on PyPI now if you want to try.

@mmccarty
Copy link
Member Author

mmccarty commented Oct 18, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants