-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Improve preprocessor documentation #27215
[AIR] Improve preprocessor documentation #27215
Conversation
LabelEncoder
docstringSigned-off-by: Richard Liaw <rliaw@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
>>> def fn(batch: pd.DataFrame) -> pd.DataFrame: | ||
... return batch.drop("Y", axis="columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; ideally we choose an example that is actually not supported by Ray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you drop a column without BatchMapper
?
will be preserved. | ||
This preprocessor concatenates numeric columns and stores the result in a new | ||
column. The new column contains | ||
:class:`~ray.air.util.tensor_extensions.pandas.TensorArrayElement` objects of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like TensorArrayElement
is not a documented class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. If we add TensorArrayElement
to the data reference in a future PR, this link will work.
0 Shaolin Soccer [1, 1, 1] | ||
1 Moana [1, 1, 0] | ||
2 The Smartest Guys in the Room [0, 0, 0] | ||
>>> encoder.stats_ # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this attribute explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we including Preprocessor
attributes in the public interface? Talked to @matthewdeng and he mentioned it's currently undefined.
I added stats_
to this example because, without it, there's no way to know which categories are used (max_categories
doesn't encode infrequent categories).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we're added an attribute to the public interface, we should add categories_
or infrequent_categories_
. stats_
looks weird. See #27357
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Let's merge #27610 into this to make cherrypicking easier. |
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Why are these changes needed?
User Guide Changes
We've implemented many preprocessors, and it's unclear which you should use. I've added a section that describes when you should use each preprocessor.
API Reference Changes
The preprocessor reference has problems:
This PR:
In addition, this PR also
seealso
sections.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.