Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Add infrequent_categories_ attribute to OneHotEncoder and MultiHotEncoder #27357

Closed
bveeramani opened this issue Aug 2, 2022 · 2 comments
Labels
enhancement Request for new feature and/or capability stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@bveeramani
Copy link
Member

bveeramani commented Aug 2, 2022

Description

Title.

class OneHotEncoder:
    ...
    @property
    def infrequent_categories_(self) -> list[str]:
        """Infrequent categories for each feature."""
        ...

See attributes section of https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

Use case

If you specify max_categories, it's not clear how you figure out which categories are dropped.

For example, suppose you have a column language with strings

language: python, java, c, python, go, ...

And you one-hot encode the dataset.

encoder = OneHotEncoder(["language"], max_features={"language": 3})
transformed_ds = encoder.fit_transform(ds)

It's not obvious which categories get dropped. One approach is to look at the column names

transformed_ds.schema().names  # ['language_python', 'language_java', 'language_c']

But it'd be nice if you could directly get which categories are dropped

encoder.infrequent_categories_  # ["go", "rust", "haskell", ...]
@bveeramani bveeramani added enhancement Request for new feature and/or capability air labels Aug 2, 2022
@bveeramani bveeramani changed the title [AIR] Add categories_ attribute to OneHotEncoder and MultiHotEncoder [AIR] Add infrequent_categories_ attribute to OneHotEncoder and MultiHotEncoder Aug 2, 2022
@stale
Copy link

stale bot commented Dec 3, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 3, 2022
@stale
Copy link

stale bot commented Dec 17, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Dec 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

1 participant