Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Reorganize the tensor data support docs; general editing (#26952) #27355

Merged
merged 1 commit into from
Aug 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion doc/source/data/consuming-datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,6 @@ This is a common pattern useful for loading and sharding data between distribute

.. _saving_datasets:

===============
Saving Datasets
===============

Expand Down
34 changes: 19 additions & 15 deletions doc/source/data/creating-datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -388,11 +388,8 @@ futures.
``Dataset`` backed by the distributed Pandas DataFrame partitions that underly the
Dask DataFrame.

.. note::

This conversion should have near-zero overhead: it involves zero data copying and
zero data movement. Datasets simply reinterprets the existing Dask DataFrame partitions
as Ray Datasets partitions without touching the underlying data.
This conversion has near-zero overhead, since Datasets simply reinterprets existing
Dask-in-Ray partition objects as Dataset blocks.

.. literalinclude:: ./doc_code/creating_datasets.py
:language: python
Expand All @@ -418,11 +415,8 @@ futures.
Create a ``Dataset`` from a Modin DataFrame. This constructs a ``Dataset``
backed by the distributed Pandas DataFrame partitions that underly the Modin DataFrame.

.. note::

This conversion should have near-zero overhead: it involves zero data copying and
zero data movement. Datasets simply reinterprets the existing Modin DataFrame partitions
as Ray Datasets partitions without touching the underlying data.
This conversion has near-zero overhead, since Datasets simply reinterprets existing
Modin partition objects as Dataset blocks.

.. literalinclude:: ./doc_code/creating_datasets.py
:language: python
Expand All @@ -434,11 +428,8 @@ futures.
Create a ``Dataset`` from a Mars DataFrame. This constructs a ``Dataset``
backed by the distributed Pandas DataFrame partitions that underly the Mars DataFrame.

.. note::

This conversion should have near-zero overhead: it involves zero data copying and
zero data movement. Datasets simply reinterprets the existing Mars DataFrame partitions
as Ray Datasets partitions without touching the underlying data.
This conversion has near-zero overhead, since Datasets simply reinterprets existing
Mars partition objects as Dataset blocks.

.. literalinclude:: ./doc_code/creating_datasets.py
:language: python
Expand Down Expand Up @@ -527,6 +518,19 @@ converts it into a Ray Dataset directly.
ray_datasets["train"].take(2)
# [{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]

.. _datasets_from_images:

-------------------------------
From Image Files (experimental)
-------------------------------

Load image data stored as individual files using :py:class:`~ray.data.datasource.ImageFolderDatasource`:

.. literalinclude:: ./doc_code/tensor.py
:language: python
:start-after: __create_images_begin__
:end-before: __create_images_end__

.. _datasets_custom_datasource:

------------------
Expand Down
3 changes: 2 additions & 1 deletion doc/source/data/dataset-internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,10 @@ as either `Arrow Tables <https://arrow.apache.org/docs/python/generated/pyarrow.
or `Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__.

Different ways of creating Datasets leads to a different starting internal format:

* Reading tabular files (Parquet, CSV, JSON) creates Arrow blocks initially.
* Converting from Pandas, Dask, Modin, and Mars creates Pandas blocks initially.
* Reading NumPy files or converting from NumPy ndarrays creaates Arrow blocks.
* Reading NumPy files or converting from NumPy ndarrays creates Arrow blocks.

However, this internal format is not exposed to the user. Datasets converts between formats
as needed internally depending on the specified ``batch_format`` of transformations.
Loading