ray-project · ericl · Aug 2, 2022 · Aug 2, 2022
@@ -80,7 +80,6 @@ This is a common pattern useful for loading and sharding data between distribute
 
 .. _saving_datasets:
 
-===============
 Saving Datasets
 ===============
 

@@ -388,11 +388,8 @@ futures.
   ``Dataset`` backed by the distributed Pandas DataFrame partitions that underly the
   Dask DataFrame.
 
-  .. note::
-
-    This conversion should have near-zero overhead: it involves zero data copying and
-    zero data movement. Datasets simply reinterprets the existing Dask DataFrame partitions
-    as Ray Datasets partitions without touching the underlying data.
+  This conversion has near-zero overhead, since Datasets simply reinterprets existing
+  Dask-in-Ray partition objects as Dataset blocks.
 
   .. literalinclude:: ./doc_code/creating_datasets.py
     :language: python
@@ -418,11 +415,8 @@ futures.
   Create a ``Dataset`` from a Modin DataFrame. This constructs a ``Dataset``
   backed by the distributed Pandas DataFrame partitions that underly the Modin DataFrame.
 
-  .. note::
-
-    This conversion should have near-zero overhead: it involves zero data copying and
-    zero data movement. Datasets simply reinterprets the existing Modin DataFrame partitions
-    as Ray Datasets partitions without touching the underlying data.
+  This conversion has near-zero overhead, since Datasets simply reinterprets existing
+  Modin partition objects as Dataset blocks.
 
   .. literalinclude:: ./doc_code/creating_datasets.py
     :language: python
@@ -434,11 +428,8 @@ futures.
   Create a ``Dataset`` from a Mars DataFrame. This constructs a ``Dataset``
   backed by the distributed Pandas DataFrame partitions that underly the Mars DataFrame.
 
-  .. note::
-
-    This conversion should have near-zero overhead: it involves zero data copying and
-    zero data movement. Datasets simply reinterprets the existing Mars DataFrame partitions
-    as Ray Datasets partitions without touching the underlying data.
+  This conversion has near-zero overhead, since Datasets simply reinterprets existing
+  Mars partition objects as Dataset blocks.
 
   .. literalinclude:: ./doc_code/creating_datasets.py
     :language: python
@@ -527,6 +518,19 @@ converts it into a Ray Dataset directly.
     ray_datasets["train"].take(2)
     # [{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]
 
+.. _datasets_from_images:
+
+-------------------------------
+From Image Files (experimental)
+-------------------------------
+
+Load image data stored as individual files using :py:class:`~ray.data.datasource.ImageFolderDatasource`:
+
+.. literalinclude:: ./doc_code/tensor.py
+    :language: python
+    :start-after: __create_images_begin__
+    :end-before: __create_images_end__
+
 .. _datasets_custom_datasource:
 
 ------------------

@@ -135,9 +135,10 @@ as either `Arrow Tables <https://arrow.apache.org/docs/python/generated/pyarrow.
 or `Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__.
 
 Different ways of creating Datasets leads to a different starting internal format:
+
 * Reading tabular files (Parquet, CSV, JSON) creates Arrow blocks initially.
 * Converting from Pandas, Dask, Modin, and Mars creates Pandas blocks initially.
-* Reading NumPy files or converting from NumPy ndarrays creaates Arrow blocks.
+* Reading NumPy files or converting from NumPy ndarrays creates Arrow blocks.
 
 However, this internal format is not exposed to the user. Datasets converts between formats
 as needed internally depending on the specified ``batch_format`` of transformations.