Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full type hints for the module base #1668

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

e10e3
Copy link
Contributor

@e10e3 e10e3 commented Jan 26, 2025

Continuing #1592, this PR adds type hints for the basemodule.

I am not entirely familiar with the code, so we may have things to adjust. Hence, I start with a draft.

Because the PR is quite big, I tried to have clean commits, you can use them to help the review.

Notable changes:

  • I changed the regression target to float, because Python's type pyramid recommends not to use the numbers module for type annotations. And MyPy doesn't support it anyway.
    Also, numers.Number also includes complex numbers, but I am confident saying all the code expects floats. Integers are converted automatically to floats if needed.
  • Where transformers are used, the annotation is now BaseTransformer to allow all transformers, not just the supervised ones.

A few things I would like to discuss before leaving the "draft" status:

  • The Estimator Problem: many functions expect a generic model that has the predict_* and learn_* methods, but there is no single type representing this capability.
    I tagged them as requiring an Estimator, but this is not satisfactory.
    Should we:

    • Enumerate all possible types every time (Classifier, Regressor, Clusterer, Pipeline)
    • Create a Protocol (a Python interface) for models that can learn, predict, and do both
    • Create a new abstract class that introduces learn_* and predict_*
  • The Ensemble class inherits from UserList. Because it is an implementation of the list type, it is mutable, leading it to be type-invariant. Since Ensemble is now typed with Estimator, all models are bare estimators. An ensemble of Hoeffding trees would not be able to access the HT-specific methods (even things like learn_one!) without violating the typing -- runtime behaviour is not impacted, Python is dynamic.

e10e3 added 20 commits December 18, 2024 18:24
We use Autotyping to add annotations where they are obvious from the
context.
Arbitrary argument lists should be annotated with the type of the
arguments and not as a tuple containing the type; unless the argument
should all be tuples.
typing.Iterator is deprecated since Python 3.9.
The wrapped model can be any type of model, but only the classifiers
have a _multiclass property. Before acessing it, we must make sure the
attribute exists.
base.Transformer corresponds only to unbatched unsupervied transformers.
Other transformers should also be accepted for grouping.
The Number interface matches all Python numbers and more. This includes
complex numbers. Most formulas are intended to deal with real numbers
only, and will error out if applied on ccomplex numbers.
Moreover, MyPy does not recognise the interfaces for the 'number'
module, and recommends to use Python's default numeric tower as
proposed in PEP 484.
These error were indirectly caused by changes in `base`, where the
stricter typing invalidates the previously ambiguous (but valid) type
checks.
These errors can be treated in their own modules.
@e10e3 e10e3 marked this pull request as ready for review January 27, 2025 13:11
@MaxHalford
Copy link
Member

Thanks for the info about numbers.Number, makes sense 👍

The Estimator Problem: many functions expect a generic model that has the predict_* and learn_* methods, but there is no single type representing this capability.

I like the typing.Protocol solution, it's modern and probably the right solution here.

The Ensemble class inherits from UserList (but only partially implements its methods). Because it is an implementation of the list type, it is mutable, leading it to be type-invariant. Since Ensemble is now typed with Estimator, all models are bare estimators. An ensemble of Hoeffding trees would not be able to access the HT-specific methods (even things like learn_one!) without violating the typing -- runtime behaviour is not impacted, Python is dynamic.

We could definitely simplify here: we don't have to inherit from UserList. We could instead implement the necessary methods manually. We could also define an option to make the list appendable or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants