Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Improve preprocessor documentation #27215

Merged
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
3a0525b
Improve `MaxAbsScaler` docstring
bveeramani Jul 21, 2022
37634d8
Appease lint
bveeramani Jul 21, 2022
2a30cdb
Improve `MinMaxScaler` docstring
bveeramani Jul 21, 2022
a055dc6
Fix typo
bveeramani Jul 21, 2022
0dfbf0f
Improve `StandardScaler` docstring and remove `ddof` parameter
bveeramani Jul 21, 2022
0acf2b9
Remove see-also section
bveeramani Jul 21, 2022
45f09db
Improve `Normalizer` docstring
bveeramani Jul 21, 2022
78c7a26
Revert accidental commit
bveeramani Jul 21, 2022
6b8a845
Improve `RobustScaler` docstring
bveeramani Jul 21, 2022
2f6bbce
Remove whitespace
bveeramani Jul 21, 2022
81dbd5d
Improve `SimpleImputer` docstring
bveeramani Jul 22, 2022
b7fca15
Update docstring
bveeramani Jul 22, 2022
07dfa3f
[AIR] Improve `Tokenizer` docstring
bveeramani Jul 27, 2022
62e071b
[AIR] Improve `LabelEncoder` docstring
bveeramani Jul 28, 2022
e4c072c
Shorten sentence
bveeramani Jul 28, 2022
66287e2
Update encoder.py
bveeramani Jul 28, 2022
49de066
Merge remote-tracking branch 'upstream/master' into bveeramani/label-…
bveeramani Aug 3, 2022
cba88e7
Add power transform and encoder docs
bveeramani Aug 3, 2022
93f0aed
Update concatenator.py
bveeramani Aug 4, 2022
0b00893
Update encoder.py
bveeramani Aug 4, 2022
4ee8d3b
Update chain.py
bveeramani Aug 4, 2022
da37326
Update batch_mapper.py
bveeramani Aug 4, 2022
2074fec
Multihot
bveeramani Aug 4, 2022
5fe7937
Update vectorizer.py
bveeramani Aug 4, 2022
4cd6d92
Update python/ray/data/preprocessors/batch_mapper.py
bveeramani Aug 5, 2022
3507322
Update python/ray/data/preprocessors/encoder.py
bveeramani Aug 5, 2022
ae74ddf
Update python/ray/data/preprocessors/transformer.py
bveeramani Aug 5, 2022
acb6e01
Update python/ray/data/preprocessors/transformer.py
bveeramani Aug 5, 2022
d8f7c0f
Update python/ray/data/preprocessors/encoder.py
bveeramani Aug 5, 2022
c8581d4
Update python/ray/data/preprocessors/encoder.py
bveeramani Aug 5, 2022
28c495f
Update python/ray/data/preprocessors/encoder.py
bveeramani Aug 5, 2022
36901f1
Fix concatenator
bveeramani Aug 5, 2022
5f8a8cc
Update python/ray/data/preprocessors/encoder.py
bveeramani Aug 5, 2022
7a713d3
Fix indent
bveeramani Aug 5, 2022
d4b7ea5
Update encoder.py
bveeramani Aug 5, 2022
d94d727
Merge branch 'bveeramani/label-encoder-docstring' of https://github.c…
bveeramani Aug 5, 2022
23f2437
Update vectorizer.py
bveeramani Aug 5, 2022
a051c8b
Move examples
bveeramani Aug 5, 2022
3aabd44
Move examples
bveeramani Aug 5, 2022
eed0781
Update encoder.py
bveeramani Aug 5, 2022
ef39c2a
Update tokenizer.py
bveeramani Aug 5, 2022
b3a7cc9
Merge branch 'bveeramani/tokenizer-docstring' into bveeramani/label-e…
bveeramani Aug 5, 2022
e33edb2
Merge branch 'simplerimputer-doc' into bveeramani/label-encoder-docst…
bveeramani Aug 5, 2022
e13f86b
Update scaler.py
bveeramani Aug 5, 2022
3d50dd3
Merge branch 'robustscaler-doc' into bveeramani/label-encoder-docstring
bveeramani Aug 5, 2022
66d44b7
Update python/ray/data/preprocessors/normalizer.py
bveeramani Aug 5, 2022
29a88fc
Update python/ray/data/preprocessors/normalizer.py
bveeramani Aug 5, 2022
fdf2356
Update python/ray/data/preprocessors/normalizer.py
bveeramani Aug 5, 2022
e3fe45b
Update normalizer.py
bveeramani Aug 5, 2022
243a2d1
Update normalizer.py
bveeramani Aug 5, 2022
ec47f45
Merge branch 'noramlizer-doc' of https://github.com/bveeramani/ray in…
bveeramani Aug 5, 2022
dc321db
Update normalizer.py
bveeramani Aug 5, 2022
e27e04f
Merge branch 'noramlizer-doc' into bveeramani/label-encoder-docstring
bveeramani Aug 5, 2022
6983af6
Merge branch 'minmaxscaler-doc' into bveeramani/label-encoder-docstring
bveeramani Aug 5, 2022
62f9531
Update python/ray/data/preprocessors/scaler.py
bveeramani Aug 5, 2022
d2ccc3c
Update scaler.py
bveeramani Aug 5, 2022
8fee7f9
Update scaler.py
bveeramani Aug 5, 2022
0013e3f
Merge branch 'maxabsscaler-doc' into bveeramani/label-encoder-docstring
bveeramani Aug 5, 2022
92a47c5
Update scaler.py
bveeramani Aug 5, 2022
9fd4709
Merge branch 'standardscaler-doc' into bveeramani/label-encoder-docst…
bveeramani Aug 5, 2022
2f7bb3a
Update stuff
bveeramani Aug 5, 2022
18c37ad
Add toc
bveeramani Aug 5, 2022
5d12807
Fix doctests
bveeramani Aug 5, 2022
53d2a8b
Appease lint
bveeramani Aug 5, 2022
eba4e81
Merge remote-tracking branch 'upstream/master' into bveeramani/label-…
bveeramani Aug 5, 2022
79fa1c5
Fix stuff
bveeramani Aug 5, 2022
e628d87
Skip doctests
bveeramani Aug 5, 2022
d11c431
Update vectorizer.py
bveeramani Aug 5, 2022
801c4b5
Skip doctests
bveeramani Aug 5, 2022
17a3245
Initial commit
bveeramani Aug 7, 2022
6c82a1d
Update encoder.py
bveeramani Aug 8, 2022
f323b51
Update normalizer.py
bveeramani Aug 8, 2022
beecb73
Update scaler.py
bveeramani Aug 8, 2022
8af5a82
Update scaler.py
bveeramani Aug 8, 2022
1e5e8ee
Update scaler.py
bveeramani Aug 8, 2022
cd53a4a
Update vectorizer.py
bveeramani Aug 8, 2022
d913cc1
Update vectorizer.py
bveeramani Aug 8, 2022
c180903
Merge branch 'master' into bveeramani/label-encoder-docstring
bveeramani Aug 8, 2022
9e8544b
Merge remote-tracking branch 'upstream/master' into bveeramani/label-…
bveeramani Aug 8, 2022
0433797
Update hasher.py
bveeramani Aug 8, 2022
9e632ca
Rename sections
bveeramani Aug 8, 2022
05d1e8f
update-preprocessors
richardliaw Aug 9, 2022
fe2c9a3
update-starter-text
richardliaw Aug 9, 2022
9878fa4
Update guide
bveeramani Aug 10, 2022
52d36eb
Format `preprocessors.py`
bveeramani Aug 10, 2022
df0a42a
Fix broken reference
bveeramani Aug 10, 2022
39343c3
Merge branch 'bveeramani/preprocessor-guide' into bveeramani/label-en…
bveeramani Aug 10, 2022
a541ab0
Naming consistency
bveeramani Aug 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 66 additions & 4 deletions doc/source/ray-air/package-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,78 @@ Preprocessor
.. autoclass:: ray.data.preprocessor.Preprocessor
:members:

Built-in Preprocessors
######################
General Preprocessors
#####################

.. automodule:: ray.data.preprocessors
:members:
.. autoclass:: ray.data.preprocessors.BatchMapper
:show-inheritance:

.. autoclass:: ray.data.preprocessors.Chain
:show-inheritance:

.. autoclass:: ray.data.preprocessors.Concatenator
:show-inheritance:

.. autoclass:: ray.data.preprocessors.SimpleImputer
:show-inheritance:

.. automethod:: ray.data.Dataset.train_test_split
:noindex:

Categorical Encoders
####################

.. autoclass:: ray.data.preprocessors.Categorizer
:show-inheritance:

.. autoclass:: ray.data.preprocessors.LabelEncoder
:show-inheritance:

.. autoclass:: ray.data.preprocessors.MultiHotEncoder
:show-inheritance:

.. autoclass:: ray.data.preprocessors.OneHotEncoder
:show-inheritance:

.. autoclass:: ray.data.preprocessors.OrdinalEncoder
:show-inheritance:

Feature Scalers
###############

.. autoclass:: ray.data.preprocessors.MaxAbsScaler
:show-inheritance:

.. autoclass:: ray.data.preprocessors.MinMaxScaler
:show-inheritance:

.. autoclass:: ray.data.preprocessors.Normalizer
:show-inheritance:

.. autoclass:: ray.data.preprocessors.PowerTransformer
:show-inheritance:

.. autoclass:: ray.data.preprocessors.RobustScaler
:show-inheritance:

.. autoclass:: ray.data.preprocessors.StandardScaler
:show-inheritance:

Text Encoders
#############

.. autoclass:: ray.data.preprocessors.CountVectorizer
:show-inheritance:

.. autoclass:: ray.data.preprocessors.FeatureHasher
:show-inheritance:

.. autoclass:: ray.data.preprocessors.HashingVectorizer
:show-inheritance:

.. autoclass:: ray.data.preprocessors.Tokenizer
:show-inheritance:

.. _air-abstract-trainer-ref:

Trainer
Expand Down
16 changes: 9 additions & 7 deletions python/ray/data/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,12 @@ class Preprocessor(abc.ABC):

If you are implementing your own Preprocessor sub-class, you should override the
following:
* ``_fit`` - if your preprocessor is stateful. Otherwise, set
``_is_fittable=False``.
* ``_transform_pandas`` and/or ``_transform_arrow`` - for best performance,
implement both. Otherwise, the data will be converted to the match the
implemented method.

* ``_fit`` if your preprocessor is stateful. Otherwise, set
``_is_fittable=False``.
* ``_transform_pandas`` and/or ``_transform_arrow`` for best performance,
implement both. Otherwise, the data will be converted to the match the
implemented method.
"""

class FitStatus(str, Enum):
Expand Down Expand Up @@ -129,7 +130,7 @@ def transform(self, dataset: Dataset) -> Dataset:
ray.data.Dataset: The transformed Dataset.

Raises:
PreprocessorNotFittedException, if ``fit`` is not called yet.
PreprocessorNotFittedException: if ``fit`` is not called yet.
"""
fit_status = self.fit_status()
if fit_status in (
Expand All @@ -154,7 +155,8 @@ def transform_batch(self, df: "DataBatchType") -> "DataBatchType":
df: Input data batch.

Returns:
DataBatchType: The transformed data batch. This may differ
DataBatchType:
The transformed data batch. This may differ
from the input type depending on which ``_transform_*`` method(s)
are implemented.
"""
Expand Down
36 changes: 30 additions & 6 deletions python/ray/data/preprocessors/batch_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,38 @@


class BatchMapper(Preprocessor):
"""Apply ``fn`` to batches of records of given dataset.

This is meant to be generic and supports low level operation on records.
One could easily leverage this preprocessor to achieve operations like
adding a new column or modifying a column in place.
"""Apply an arbitrary operation to a dataset.

:class:`BatchMapper` applies a user-defined function to batches of a dataset. A
richardliaw marked this conversation as resolved.
Show resolved Hide resolved
batch is a Pandas ``DataFrame`` that represents a small amount of data. By modifying
batches instead of individual records, this class can efficiently transform a
dataset with vectorized operations.

Use this preprocessor to apply stateless operations that aren't already built-in.

.. tip::
:class:`BatchMapper` doesn't need to be fit. You can call
``transform`` without calling ``fit``.

Examples:
Use :class:`BatchMapper` to apply arbitrary operations like dropping a column.

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import BatchMapper
>>>
>>> df = pd.DataFrame({"X": [0, 1, 2], "Y": [3, 4, 5]})
>>> ds = ray.data.from_pandas(df) # doctest: +SKIP
>>>
>>> def fn(batch: pd.DataFrame) -> pd.DataFrame:
... return batch.drop("Y", axis="columns")
Comment on lines +33 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; ideally we choose an example that is actually not supported by Ray

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you drop a column without BatchMapper?

>>>
>>> preprocessor = BatchMapper(fn)
>>> preprocessor.transform(ds) # doctest: +SKIP
Dataset(num_blocks=1, num_rows=3, schema={X: int64})

Args:
fn: The udf function for batch operation.
fn: The function to apply to data batches.
"""

_is_fittable = False
Expand Down
32 changes: 27 additions & 5 deletions python/ray/data/preprocessors/chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,36 @@


class Chain(Preprocessor):
"""Chain multiple Preprocessors into a single Preprocessor.
"""Combine multiple preprocessors into a single :py:class:`Preprocessor`.

Calling ``fit`` will invoke ``fit_transform`` on the input preprocessors,
so that one preprocessor can ``fit`` based on columns/values produced by
the ``transform`` of a preceding preprocessor.
When you call ``fit``, each preprocessor is fit on the dataset produced by the
preceeding preprocessor's ``fit_transform``.

Example:
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import *
>>>
>>> df = pd.DataFrame({
... "X0": [0, 1, 2],
... "X1": [3, 4, 5],
... "Y": ["orange", "blue", "orange"],
... })
>>> ds = ray.data.from_pandas(df) # doctest: +SKIP
>>>
>>> preprocessor = Chain(
... StandardScaler(columns=["X0", "X1"]),
... Concatenator(include=["X0", "X1"], output_column_name="X"),
... LabelEncoder(label_column="Y")
... )
>>> preprocessor.fit_transform(ds).to_pandas() # doctest: +SKIP
Y X
0 1 [-1.224744871391589, -1.224744871391589]
1 0 [0.0, 0.0]
2 1 [1.224744871391589, 1.224744871391589]

Args:
preprocessors: The preprocessors that should be executed sequentially.
preprocessors: The preprocessors to sequentially compose.
"""

def fit_status(self):
Expand Down
114 changes: 86 additions & 28 deletions python/ray/data/preprocessors/concatenator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,43 +6,101 @@


class Concatenator(Preprocessor):
"""Creates a tensor column via concatenation.
"""Combine numeric columns into a column of type
:class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`.

A tensor column is a column consisting of ndarrays as elements.
The tensor column will be generated from the provided list
of columns and will take on the provided "output" label.
Columns that are included in the concatenation
will be dropped, while columns that are not included in concatenation
will be preserved.
This preprocessor concatenates numeric columns and stores the result in a new
column. The new column contains
:class:`~ray.air.util.tensor_extensions.pandas.TensorArrayElement` objects of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like TensorArrayElement is not a documented class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. If we add TensorArrayElement to the data reference in a future PR, this link will work.

shape :math:`(m,)`, where :math:`m` is the number of columns concatenated.
The :math:`m` concatenated columns are dropped after concatenation.

Example:
>>> import ray
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator
>>> df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 3, 4],})

:py:class:`Concatenator` combines numeric columns into a column of
:py:class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df) # doctest: +SKIP
>>> concatenator = Concatenator()
>>> concatenator.fit_transform(ds).to_pandas() # doctest: +SKIP
concat_out
0 [0.0, 0.5]
1 [3.0, 0.2]
2 [1.0, 0.9]

By default, the created column is called `"concat_out"`, but you can specify
a different name.

>>> concatenator = Concatenator(output_column_name="tensor")
>>> concatenator.fit_transform(ds).to_pandas() # doctest: +SKIP
tensor
0 [0.0, 0.5]
1 [3.0, 0.2]
2 [1.0, 0.9]

Sometimes, you might not want to concatenate all of of the columns in your
dataset. In this case, you can exclude columns with the ``exclude`` parameter.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]})
>>> ds = ray.data.from_pandas(df) # doctest: +SKIP
>>> prep = Concatenator(output_column_name="c") # doctest: +SKIP
>>> new_ds = prep.transform(ds) # doctest: +SKIP
>>> assert set(new_ds.take(1)[0]) == {"c"} # doctest: +SKIP
>>> concatenator = Concatenator(exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas() # doctest: +SKIP
Y concat_out
0 blue [0.0, 0.5]
1 orange [3.0, 0.2]
2 blue [1.0, 0.9]

Alternatively, you can specify which columns to concatenate with the
``include`` parameter.

>>> concatenator = Concatenator(include=["X0", "X1"])
>>> concatenator.fit_transform(ds).to_pandas() # doctest: +SKIP
Y concat_out
0 blue [0.0, 0.5]
1 orange [3.0, 0.2]
2 blue [1.0, 0.9]

Note that if a column is in both ``include`` and ``exclude``, the column is
excluded.

>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas() # doctest: +SKIP
Y concat_out
0 blue [0.0, 0.5]
1 orange [3.0, 0.2]
2 blue [1.0, 0.9]

By default, the concatenated tensor is a ``dtype`` common to the input columns.
However, you can also explicitly set the ``dtype`` with the ``dtype``
parameter.

>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32)
>>> concatenator.fit_transform(ds) # doctest: +SKIP
Dataset(num_blocks=1, num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})

Args:
output_column_name: output_column_name is a string that represents the
name of the outputted, concatenated tensor column. Defaults to
"concat_out".
include: A list of column names to be included for
concatenation. If None, then all columns will be included.
Included columns will be dropped after concatenation.
exclude: List of column names to be excluded
from concatenation. Exclude takes precedence over include.
dtype: Optional. The dtype to convert the output column array to.
raise_if_missing: Optional. If True, an error will be raised if any
of the columns to in 'include' or 'exclude' are
not present in the dataset schema.
output_column_name: The desired name for the new column.
Defaults to ``"concat_out"``.
include: A list of columns to concatenate. If ``None``, all columns are
concatenated.
exclude: A list of column to exclude from concatenation.
If a column is in both ``include`` and ``exclude``, the column is excluded
from concatenation.
dtype: The ``dtype`` to convert the output tensors to. If unspecified,
the ``dtype`` is determined by standard coercion rules.
raise_if_missing: If ``True``, an error is raised if any
of the columns in ``include`` or ``exclude`` don't exist.
Defaults to ``False``.

Raises:
ValueError if `raise_if_missing=True` and any column name in
`include` or `exclude` does not exist in the dataset columns.
"""
ValueError: if `raise_if_missing` is `True` and a column in `include` or
`exclude` doesn't exist in the dataset.
""" # noqa: E501

_is_fittable = False

Expand Down
Loading