[AIR] Improve preprocessor documentation #27215

bveeramani · 2022-07-28T20:52:09Z

Why are these changes needed?

User Guide Changes
We've implemented many preprocessors, and it's unclear which you should use. I've added a section that describes when you should use each preprocessor.

API Reference Changes

The preprocessor reference has problems:

Examples are broken or non-existent.
Explanations are minimal or confusing.
Preprocessors aren't organized in a useful way.

This PR:

Reorganizes the preprocessor reference.
Adds tested examples for every built-in preprocessor.
Clarifies and elaborates every preprocessor's description

In addition, this PR also

Adds cross-references.
Adds seealso sections.
Adds Latex math (where appropriate)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…encoder-docstring

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

clarkzinzow

LGTM!

richardliaw · 2022-08-09T07:49:16Z

python/ray/data/preprocessors/batch_mapper.py

+        >>> def fn(batch: pd.DataFrame) -> pd.DataFrame:
+        ...     return batch.drop("Y", axis="columns")


nit; ideally we choose an example that is actually not supported by Ray

How do you drop a column without BatchMapper?

richardliaw · 2022-08-09T07:54:35Z

python/ray/data/preprocessors/concatenator.py

-    will be preserved.
+    This preprocessor concatenates numeric columns and stores the result in a new
+    column. The new column contains
+    :class:`~ray.air.util.tensor_extensions.pandas.TensorArrayElement` objects of


seems like TensorArrayElement is not a documented class?

It's not. If we add TensorArrayElement to the data reference in a future PR, this link will work.

richardliaw · 2022-08-09T07:58:07Z

python/ray/data/preprocessors/encoder.py

+        0                 Shaolin Soccer  [1, 1, 1]
+        1                          Moana  [1, 1, 0]
+        2  The Smartest Guys in the Room  [0, 0, 0]
+        >>> encoder.stats_  # doctest: +SKIP


Can we document this attribute explicitly?

Are we including Preprocessor attributes in the public interface? Talked to @matthewdeng and he mentioned it's currently undefined.

I added stats_ to this example because, without it, there's no way to know which categories are used (max_categories doesn't encode infrequent categories).

Also, if we're added an attribute to the public interface, we should add categories_ or infrequent_categories_. stats_ looks weird. See #27357

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw · 2022-08-09T15:43:22Z

Let's merge #27610 into this to make cherrypicking easier.

…coder-docstring

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

bveeramani added 15 commits July 21, 2022 11:32

Improve MaxAbsScaler docstring

3a0525b

Appease lint

37634d8

Improve MinMaxScaler docstring

2a30cdb

Fix typo

a055dc6

Improve StandardScaler docstring and remove ddof parameter

0dfbf0f

Remove see-also section

0acf2b9

Improve Normalizer docstring

45f09db

Revert accidental commit

78c7a26

Improve RobustScaler docstring

6b8a845

Remove whitespace

2f6bbce

Improve SimpleImputer docstring

81dbd5d

Update docstring

b7fca15

[AIR] Improve Tokenizer docstring

07dfa3f

[AIR] Improve LabelEncoder docstring

62e071b

Shorten sentence

e4c072c

bveeramani requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners July 28, 2022 20:52

bveeramani changed the title ~~[AIR] Improve LabelEncoder docstring~~ [AIR] Improve encoder docstrings Jul 28, 2022

bveeramani assigned matthewdeng Jul 28, 2022

bveeramani marked this pull request as draft July 28, 2022 20:54

bveeramani added 7 commits July 28, 2022 16:23

Update encoder.py

66287e2

Merge remote-tracking branch 'upstream/master' into bveeramani/label-…

49de066

…encoder-docstring

Add power transform and encoder docs

cba88e7

Update concatenator.py

93f0aed

Update encoder.py

0b00893

Update chain.py

4ee8d3b

Update batch_mapper.py

da37326

update-preprocessors

05d1e8f

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw approved these changes Aug 9, 2022

View reviewed changes

richardliaw added v2.0.0-pick and removed copyediting-required labels Aug 9, 2022

richardliaw self-assigned this Aug 9, 2022

clarkzinzow approved these changes Aug 9, 2022

View reviewed changes

richardliaw reviewed Aug 9, 2022

View reviewed changes

update-starter-text

fe2c9a3

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Update guide

9878fa4

bveeramani mentioned this pull request Aug 10, 2022

[AIR] [Docs] Add "Which preprocessor do I use?" section #27610

Closed

7 tasks

bveeramani added 3 commits August 10, 2022 11:10

Format preprocessors.py

52d36eb

Fix broken reference

df0a42a

Merge branch 'bveeramani/preprocessor-guide' into bveeramani/label-en…

39343c3

…coder-docstring

bveeramani changed the title ~~[AIR] Improve preprocessor reference~~ [AIR] Improve preprocessor documentation Aug 10, 2022

Naming consistency

a541ab0

richardliaw merged commit 7da7dbe into ray-project:master Aug 11, 2022

bveeramani deleted the bveeramani/label-encoder-docstring branch August 11, 2022 01:35

bveeramani added a commit to bveeramani/ray that referenced this pull request Aug 11, 2022

[AIR] Improve preprocessor documentation (ray-project#27215)

56f2390

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

bveeramani mentioned this pull request Aug 11, 2022

[Pick] [AIR] Improve preprocessor documentation #27809

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Improve preprocessor documentation #27215

[AIR] Improve preprocessor documentation #27215

bveeramani commented Jul 28, 2022 •

edited

Loading

clarkzinzow left a comment

richardliaw Aug 9, 2022

bveeramani Aug 9, 2022

richardliaw Aug 9, 2022

bveeramani Aug 9, 2022

richardliaw Aug 9, 2022

bveeramani Aug 9, 2022

bveeramani Aug 9, 2022 •

edited

Loading

richardliaw commented Aug 9, 2022

		>>> def fn(batch: pd.DataFrame) -> pd.DataFrame:
		... return batch.drop("Y", axis="columns")

[AIR] Improve preprocessor documentation #27215

[AIR] Improve preprocessor documentation #27215

Conversation

bveeramani commented Jul 28, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

richardliaw Aug 9, 2022

Choose a reason for hiding this comment

bveeramani Aug 9, 2022

Choose a reason for hiding this comment

richardliaw Aug 9, 2022

Choose a reason for hiding this comment

bveeramani Aug 9, 2022

Choose a reason for hiding this comment

richardliaw Aug 9, 2022

Choose a reason for hiding this comment

bveeramani Aug 9, 2022

Choose a reason for hiding this comment

bveeramani Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

richardliaw commented Aug 9, 2022

bveeramani commented Jul 28, 2022 •

edited

Loading

bveeramani Aug 9, 2022 •

edited

Loading