[train/docs] Restructure Ray Train docs with framework-specific guides (

ray-project#37892) This PR restructures the Ray Train docs to better mimic typical user journeys. Primarily, we restructure the guides to be grouped by frameworks. Previously, we grouped by tasks (e.g. training, data loading, checkpointing) and had (tabbed) examples for some of the frameworks. Now, we group by framework on the first level and by task on the second level. The idea here is that users of e.g. PyTorch don't actually care about how things are done for XGBoost - they just want to be successful with training their PyTorch models. This PR emphasizes support for PyTorch, which is guided by user feedback showing that PyTorch and related libraries are most commonly used. Lastly, this PR also declutters the Ray Train documentation by removing duplicates (e.g. we had 4 different "quick start" examples for PyTorch before). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
lmco · Aug 31, 2023 · f6015b7 · f6015b7
1 parent ff723c1
commit f6015b7
Show file tree

Hide file tree

Showing 54 changed files with 1,810 additions and 2,221 deletions.
diff --git a/doc/source/_static/js/custom.js b/doc/source/_static/js/custom.js
@@ -37,20 +37,33 @@ document.addEventListener("DOMContentLoaded", function() {
         let navItem = navItems[i];
         const stringList = [
             "User Guides", "Examples",
+            // Ray Core
             "Ray Core", "Ray Core API",
             "Ray Clusters", "Deploying on Kubernetes", "Deploying on VMs",
             "Applications Guide", "Ray Cluster Management API",
+            // Ray AIR
             "Ray AIR API",
+            // Ray Data
             "Ray Data", "Ray Data API", "Integrations",
+            // Ray Train
             "Ray Train", "Ray Train API",
+            "Distributed PyTorch", "Advanced Topics", "More Frameworks",
+            "Ray Train Internals",
+            // Ray Tune
             "Ray Tune", "Ray Tune Examples", "Ray Tune API",
+            // Ray Serve
             "Ray Serve", "Ray Serve API",
             "Production Guide", "Advanced Guides",
             "Deploy Many Models",
+            // Ray RLlib
             "Ray RLlib", "Ray RLlib API",
+            // More libraries
             "More Libraries", "Ray Workflows (Alpha)",
+            // Monitoring/debugging
             "Monitoring and Debugging",
+            // References
             "References", "Use Cases",
+            // Developer guides
             "Developer Guides", "Getting Involved / Contributing",
         ];
 

diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -59,29 +59,35 @@ parts:
       - file: train/train
         title: Ray Train
         sections:
-          - file: train/getting-started
-            title: "Getting Started"
           - file: train/key-concepts
             title: "Key Concepts"
-          - file: train/user-guides
-            title: "User Guides"
+          - file: train/distributed-pytorch
+            sections:
+              - file: train/distributed-pytorch/converting-existing-training-loop
+              - file: train/distributed-pytorch/data-loading-preprocessing
+              - file: train/distributed-pytorch/using-gpus
+              - file: train/distributed-pytorch/persistent-storage
+                title: Configuring Persistent Storage
+              - file: train/distributed-pytorch/monitoring-logging
+              - file: train/distributed-pytorch/checkpoints
+              - file: train/distributed-pytorch/experiment-tracking
+              - file: train/distributed-pytorch/fault-tolerance
+              - file: train/distributed-pytorch/advanced
+                sections:
+                    - file: train/distributed-pytorch/reproducibility
+                    - file: train/distributed-pytorch/automatic-mixed-precision
+                    - file: train/distributed-pytorch/hyperparameter-optimization
+                      title: Hyperparameter optimization
+          - file: train/more-frameworks
+            sections:
+              - file: train/distributed-tensorflow-keras
+              - file: train/distributed-xgboost-lightgbm
+              - file: train/horovod
+          - file: train/internals/index
             sections:
-              - file: train/config_guide
-                title: "Configuring Ray Train"
-              - file: train/dl_guide
-                title: "Deep Learning Guide"
-              - file: train/hf_trainers
-                title: "Hugging Face Trainers"
-              - file: train/gbdt
-                title: "XGBoost/LightGBM guide"
-              - file: train/architecture
-                title: "Ray Train Architecture"
-              - file: train/train-with-tune
-                title: "Using Ray Train with Ray Tune"
-              - file: train/check-ingest
-                title: "Configuring Training Datasets"
-              - file: train/predictors
-              - file: train/benchmarks
+              - file: train/internals/architecture
+              - file: train/internals/benchmarks
+              - file: train/internals/environment-variables
           - file: train/examples
             title: "Examples"
             sections:

diff --git a/doc/source/data/batch_inference.rst b/doc/source/data/batch_inference.rst
@@ -462,7 +462,7 @@ Models that have been trained with :ref:`Ray Train <train-docs>` can then be use
 
     checkpoint = result.checkpoint
 
-**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the :ref:`framework-specific Checkpoint classes <train-framework-catalog>`.
+**Step 3:** Use Ray Data for batch inference. To load in the model from the :class:`Checkpoint <ray.air.checkpoint.Checkpoint>` inside the Python class, use one of the framework-specific Checkpoint classes.
 
 In this case, we use the :class:`XGBoostCheckpoint <ray.train.xgboost.XGBoostCheckpoint>` to load the model.
 

diff --git a/doc/source/data/iterating-over-data.rst b/doc/source/data/iterating-over-data.rst
@@ -273,7 +273,7 @@ into disjoint shards.
 
   If you're using :ref:`Ray Train <train-docs>`, you don't need to split the dataset.
   Ray Train automatically splits your dataset for you. To learn more, see
-  :ref:`Configuring training datasets <air-ingest>`.
+  :ref:`Configuring training datasets <data-ingest-torch>`.
 
 .. testcode::
 

diff --git a/doc/source/data/preprocessors.rst b/doc/source/data/preprocessors.rst
@@ -15,7 +15,7 @@ Ray AIR provides several common preprocessors out of the box and interfaces to d
 Overview
 --------
 
-The most common way of using a preprocessor is by passing it as an argument to the constructor of a Ray Train :ref:`Trainer <train-getting-started>` in conjunction with a :ref:`Ray Data dataset <data>`.
+The most common way of using a preprocessor is by passing it as an argument to the constructor of a Ray Train :ref:`Trainer <train-docs>` in conjunction with a :ref:`Ray Data dataset <data>`.
 For example, the following code trains a model with a preprocessor that normalizes the data.
 
 .. literalinclude:: doc_code/preprocessors.py

diff --git a/doc/source/data/working-with-pytorch.rst b/doc/source/data/working-with-pytorch.rst
@@ -82,7 +82,7 @@ Ray Data integrates with :ref:`Ray Train <train-docs>` for easy data ingest for
 
     ...
 
-For more details, see the :ref:`Ray Train user guide <train-datasets>`.
+For more details, see the :ref:`Ray Train user guide <data-ingest-torch>`.
 
 .. _transform_pytorch:
 

diff --git a/doc/source/ray-air/api/configs.rst b/doc/source/ray-air/api/configs.rst
@@ -4,10 +4,6 @@ Ray AIR Configurations
 
 .. TODO(ml-team): Add a general AIR configuration guide that covers all of these configs.
 
-.. seealso::
-
-    See :ref:`this Ray Train configuration user guide <train-config>` for more details.
-
 .. currentmodule:: ray
 
 .. autosummary::

diff --git a/doc/source/ray-air/api/dataset-ingest.rst b/doc/source/ray-air/api/dataset-ingest.rst
@@ -3,7 +3,7 @@ Ray Data Ingest into AIR Trainers
 
 .. seealso::
 
-    See this :ref:`AIR Data ingest guide <air-ingest>` for usage examples.
+    See this :ref:`AIR Data ingest guide <data-ingest-torch>` for usage examples.
 
 .. currentmodule:: ray
 

diff --git a/doc/source/ray-air/api/predictor.rst b/doc/source/ray-air/api/predictor.rst
@@ -1,11 +1,6 @@
 Predictor
 =========
 
-.. seealso::
-
-    See this :ref:`user guide on performing model inference <air-predictors>` in
-    AIR for usage examples.
-
 .. currentmodule:: ray.train
 
 Predictor Interface

diff --git a/doc/source/ray-air/computer-vision.rst b/doc/source/ray-air/computer-vision.rst
@@ -183,7 +183,7 @@ Training vision models
             :end-before: __torch_trainer_stop__
             :dedent:
 
-        For more in-depth examples, see :ref:`Using Trainers <train-getting-started>`.
+        For more in-depth examples, see :ref:`the Ray Train documentation <train-docs>`.
 
     .. tab-item:: TensorFlow
 
@@ -202,7 +202,7 @@ Training vision models
             :end-before: __tensorflow_trainer_stop__
             :dedent:
 
-        For more information, check out :ref:`the Ray Train documentation <train-getting-started>`.
+        For more information, check out :ref:`the Ray Train documentation <train-docs>`.
 
 Creating checkpoints
 --------------------
@@ -259,8 +259,6 @@ image datasets.
             :end-before: __torch_batch_predictor_stop__
             :dedent:
 
-        For more in-depth examples, read :ref:`Using Predictors for Inference <air-predictors>`.
-
     .. tab-item:: TensorFlow
 
         To create a :class:`~ray.train.batch_predictor.BatchPredictor`, call
@@ -272,8 +270,6 @@ image datasets.
             :end-before: __tensorflow_batch_predictor_stop__
             :dedent:
 
-        For more information, read :ref:`Using Predictors for Inference <air-predictors>`.
-
 Serving vision models
 ---------------------
 

diff --git a/doc/source/ray-air/examples/batch_forecasting.ipynb b/doc/source/ray-air/examples/batch_forecasting.ipynb
@@ -1167,7 +1167,7 @@
     "- We will restore a Prophet or ARIMA model directly from checkpoint, and demonstrate it can be used for prediction.\n",
     "\n",
     "```{tip}\n",
-    "[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n",
+    "Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
     "```\n"
    ]
   },

diff --git a/doc/source/ray-air/examples/batch_tuning.ipynb b/doc/source/ray-air/examples/batch_tuning.ipynb
@@ -984,7 +984,7 @@
    "metadata": {},
    "source": [
     "```{tip}\n",
-    "[Ray AIR Predictors](air-predictors) make batch inference easy since they have internal logic to parallelize the inference.\n",
+    "Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference.\n",
     "```\n",
     "\n",
     "Finally, we will restore the best and worst models from checkpoint and make predictions. \n",

diff --git a/doc/source/ray-air/examples/convert_existing_tf_code_to_ray_air.ipynb b/doc/source/ray-air/examples/convert_existing_tf_code_to_ray_air.ipynb
@@ -74,7 +74,7 @@
    "source": [
     "First, we load and preprocess the MNIST dataset.\n",
     "\n",
-    "Assumption for this tutorial: your existing code is using the `tf.data.Dataset` native to Tensorflow. This tutorial continues to use `tf.data.Dataset` to allow you to make as few code changes as possible. **Everything in this tutorial is also possible if you choose to use Ray Data, and you will also get the benefits of efficient preprocessing and multi-worker batch prediction.** See [here](train-datasets) for resources to get started with Ray Data."
+    "Assumption for this tutorial: your existing code is using the `tf.data.Dataset` native to Tensorflow. This tutorial continues to use `tf.data.Dataset` to allow you to make as few code changes as possible. **Everything in this tutorial is also possible if you choose to use Ray Data, and you will also get the benefits of efficient preprocessing and multi-worker batch prediction.** See [here](data-ingest-torch) for resources to get started with Ray Data."
    ]
   },
   {
@@ -519,9 +519,7 @@
     "\n",
     "A few notes on the configs set below:\n",
     "- `train_loop_config` sets the hyperparameters passed into the training loop as the `config` parameter\n",
-    "- `scaling_config` configures **how many parallel workers to use**, the **resources required per worker**, and whether we want to **enable GPU training** or not.\n",
-    "\n",
-    "See this [configuration guide](train-config) for more details on how to configure the trainer."
+    "- `scaling_config` configures **how many parallel workers to use**, the **resources required per worker**, and whether we want to **enable GPU training** or not."
    ]
   },
   {
@@ -617,8 +615,6 @@
     "\n",
     "In our [other examples](ref-ray-examples) you can learn how to do more things with Ray, such as **serving your model with Ray Serve** or **tune your hyperparameters with Ray Tune**. You can also learn how to perform {ref}`offline batch inference <batch_inference_home>` with Ray Data.\n",
     "\n",
-    "See [this table](train-framework-catalog) for a full catalog of frameworks that AIR supports out of the box.\n",
-    "\n",
     "We hope this tutorial gave you a good starting point to leverage Ray AIR. If you have any questions, suggestions, or run into any problems pelase reach out on [Discuss](https://discuss.ray.io/), [GitHub](https://github.com/ray-project/ray) or the [Ray Slack](https://forms.gle/9TSdDYUgxYs8SA9e8)!"
    ]
   }

diff --git a/doc/source/ray-air/examples/gptj_batch_prediction.ipynb b/doc/source/ray-air/examples/gptj_batch_prediction.ipynb
@@ -224,7 +224,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.train.Checkpoint>`, which we don't for this example. See {ref}`air-predictors` for more information and usage examples."
+    "You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.train.Checkpoint>`, which we don't for this example. See {class}`ray.train.predictor.Predictor` for more information and usage examples."
    ]
   }
  ],

diff --git a/doc/source/ray-air/examples/stablediffusion_batch_prediction.ipynb b/doc/source/ray-air/examples/stablediffusion_batch_prediction.ipynb
@@ -224,7 +224,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because AIR does not implement an out of the box Predictor for Diffusers. We could implement it ourselves, but Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.air.checkpoint.Checkpoint>`, and those are not necessary for this example. See {ref}`air-predictors` for more information and usage examples."
+    "You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because AIR does not implement an out of the box Predictor for Diffusers. We could implement it ourselves, but Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.air.checkpoint.Checkpoint>`, and those are not necessary for this example. See {class}`ray.train.predictor.Predictor` for more information and usage examples."
    ]
   }
  ],

diff --git a/doc/source/ray-overview/use-cases.rst b/doc/source/ray-overview/use-cases.rst
@@ -130,7 +130,7 @@ Learn more about the Tune library with the following talks and user guides.
 Distributed Training
 --------------------
 
-The :ref:`Ray Train <train-userguides>` library integrates many distributed training frameworks under a simple Trainer API,
+The :ref:`Ray Train <train-docs>` library integrates many distributed training frameworks under a simple Trainer API,
 providing distributed orchestration and management capabilities out of the box.
 
 In contrast to training many models, model parallelism partitions a large model across many machines for training. Ray Train has built-in abstractions for distributing shards of models and running training in parallel.

diff --git a/doc/source/ray-references/glossary.rst b/doc/source/ray-references/glossary.rst
@@ -99,7 +99,7 @@ documentation, sorted alphabetically.
         to compute and apply one gradient update to the model weights.
 
     Batch predictor
-        A :ref:`Ray AIR Batch Predictor<air-predictors>` builds on the Predictor class
+        A :class:`Ray AIR Batch Predictor<ray.train.predictor.Predictor>` builds on the Predictor class
         to parallelize inference on a large dataset. A Batch predictor shards the
         dataset to allow multiple workers to do inference on a smaller number of data
         points and then aggregating all the worker predictions at the end.
@@ -413,7 +413,7 @@ documentation, sorted alphabetically.
     .. TODO: Policy evaluation
 
     Predictor
-        :ref:`An interface for performing inference<air-predictors>` (prediction)
+        :class:`An interface for performing inference<ray.train.predictor.Predictor>` (prediction)
         on input data with a trained model.
 
     Preprocessor
@@ -603,7 +603,7 @@ documentation, sorted alphabetically.
         (e.g., for sharing computed gradients).
 
     Trainer configuration
-        :ref:`A Trainer can be configured in various ways<train-config>`. Some
+        A Trainer can be configured in various ways. Some
         configurations are shared across all trainers, like the RunConfig, which
         configures things like the experiment storage, and ScalingConfig, which
         configures the number of training workers as well as resources needed per
-Original file line number
+Diff line change
@@ Expand Up @@
         ...
-    For more details, see the :ref:`Ray Train user guide <train-datasets>`.
+    For more details, see the :ref:`Ray Train user guide <data-ingest-torch>`.
     .. _transform_pytorch:
@@ Expand Down @@