diff --git a/generative/maisi/maisi_diff_unet_training_tutorial.ipynb b/generative/maisi/maisi_diff_unet_training_tutorial.ipynb
index cf45070605..1072a617cc 100644
--- a/generative/maisi/maisi_diff_unet_training_tutorial.ipynb
+++ b/generative/maisi/maisi_diff_unet_training_tutorial.ipynb
@@ -22,7 +22,11 @@
    "id": "777b7dcb",
    "metadata": {},
    "source": [
-    "# Training a 3D Diffusion Model for Generating 3D Images with Various Sizes and Spacings"
+    "# Training a 3D Diffusion Model for Generating 3D Images with Various Sizes and Spacings\n",
+    "\n",
+    "![Generated image examples](https://developer-blogs.nvidia.com/wp-content/uploads/2024/06/image3.png)\n",
+    "\n",
+    "In this notebook, we detail the procedure for training a 3D latent diffusion model to generate high-dimensional 3D medical images. Due to the potential for out-of-memory issues on most GPUs when generating large images (e.g., those with dimensions of 512 x 512 x 512 or greater), we have structured the training process into two primary steps: 1) generating image embeddings and 2) training 3D latent diffusion models. The subsequent sections will demonstrate the entire process using a simulated dataset."
    ]
   },
   {
@@ -41,7 +45,8 @@
    "outputs": [],
    "source": [
     "!python -c \"import monai\" || pip install -q \"monai-weekly[pillow, tqdm]\"\n",
-    "!python -c \"import xformers\" || pip install -q xformers --index-url https://download.pytorch.org/whl/cu121"
+    "!python -c \"import xformers\" || pip install -q xformers --index-url https://download.pytorch.org/whl/cu121\n",
+    "# The Python package \"xformers\" is essential for improving model training efficiency and saving GPU memory footprints."
    ]
   },
   {
@@ -127,9 +132,9 @@
    "id": "d8e29c23",
    "metadata": {},
    "source": [
-    "## Simulate a special dataset\n",
+    "### Simulate a special dataset\n",
     "\n",
-    "It is well known that AI takes time to train. We will simulate a small dataset and run training only for multiple epochs. Due to the nature of AI, the performance shouldn't be highly expected, but the entire pipeline will be completed within minutes!\n",
+    "It is widely recognized that training AI models is a time-intensive process. In this instance, we will simulate a small dataset and conduct training over multiple epochs. While the performance may not reach optimal levels due to the abbreviated training duration, the entire pipeline will be completed within minutes.\n",
     "\n",
     "`sim_datalist` provides the information of the simulated datasets. It lists 2 training images. The size of the dimension is defined by the `sim_dim`."
    ]
@@ -151,9 +156,9 @@
    "id": "b9ac7677",
    "metadata": {},
    "source": [
-    "## Generate images\n",
+    "### Generate simulated images\n",
     "\n",
-    "Now we can use MONAI `create_test_image_3d` and `nib.Nifti1Image` functions to generate the 3D simulated images under the work_dir"
+    "Now we can use MONAI `create_test_image_3d` and `nib.Nifti1Image` functions to generate the 3D simulated images under the `work_dir`."
    ]
   },
   {
@@ -198,7 +203,9 @@
    "id": "c2389853",
    "metadata": {},
    "source": [
-    "## Set up directories and configurations"
+    "### Set up directories and configurations\n",
+    "\n",
+    "To optimize the demonstration for time efficiency, we have adjusted the training epochs to 2. Additionally, we modified the `num_splits` parameter in [AutoencoderKlMaisi](https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/generation/maisi/networks/autoencoderkl_maisi.py#L873) from its default value of 16 to 4. This adjustment reduces the spatial splitting of feature maps in convolutions, which is particularly beneficial given the smaller input size. (This change helps convert convolutions to a for-loop based approach, thereby conserving GPU memory resources.)"
    ]
   },
   {
@@ -327,7 +334,11 @@
    "id": "1c904f52",
    "metadata": {},
    "source": [
-    "## Step 1: Create Training Data"
+    "## Step 1: Create Training Data\n",
+    "\n",
+    "To train the latent diffusion model, we first store the latent features produced by the autoencoder's encoder in local storage. This allows the latent diffusion model to directly utilize these features, thereby conserving both time and GPU memory during the training process. Additionally, we have provided the script for multi-GPU processing to save latent features from all training images, significantly accelerating the creation of the entire training set.\n",
+    "\n",
+    "The diffusion model utilizes a U-shaped convolutional neural network architecture, requiring matching input and output dimensions. Therefore, it is advisable to resample the input image dimensions to be multiples of 2 for compatibility. In this case, we have chosen dimensions that are multiples of 128."
    ]
   },
   {
@@ -391,7 +402,9 @@
    "id": "ec5c0c4a",
    "metadata": {},
    "source": [
-    "## Create .json files for embedding files"
+    "### Create .json files for embedding files\n",
+    "\n",
+    "The diffusion model necessitates additional input attributes, including output dimension, output spacing, and top/bottom body region. These dimensions and spacing can be extracted from the header information of the training images. The top and bottom body region inputs can be determined through manual examination or by utilizing segmentation masks from tools such as [TotalSegmentator](https://github.com/wasserth/TotalSegmentator) or [MONAI VISTA](https://github.com/Project-MONAI/VISTA). The body regions are formatted as 4-dimensional one-hot vectors: the head and neck region is represented by [1,0,0,0], the chest region by [0,1,0,0], the abdomen region by [0,0,1,0], and the lower body region (below the abdomen) by [0,0,0,1]. The additional input attributes are saved in a separate .json file. In the following example, we assume that the images cover the chest and abdomen regions."
    ]
   },
   {
@@ -466,7 +479,11 @@
    "id": "e81a9e48",
    "metadata": {},
    "source": [
-    "## Step 2: Train the Model"
+    "## Step 2: Train the Model\n",
+    "\n",
+    "After all latent features have been created, we will initiate the multi-GPU script to train the latent diffusion model.\n",
+    "\n",
+    "The image generation process utilizes the [DDPM scheduler](https://arxiv.org/pdf/2006.11239) with 1,000 iterative steps. The diffusion model is optimized using L1 loss and a decayed learning rate scheduler. The batch size for this process is set to 1."
    ]
   },
   {
@@ -533,7 +550,9 @@
    "id": "4bdf7b17",
    "metadata": {},
    "source": [
-    "## Step 3: Infer using the Trained Model"
+    "## Step 3: Infer using the Trained Model\n",
+    "\n",
+    "Upon completing the training of the latent diffusion model, we can employ the multi-GPU script to perform inference. By integrating the diffusion model with the autoencoder's decoder, this process will generate 3D images with specified top/bottom body regions, spacing, and dimensions."
    ]
   },
   {