Skip to content

Commit

Permalink
Updated Dockerfile and notes for Windows users using gutenberg project (
Browse files Browse the repository at this point in the history
#100)

* Gutenberg for Windows users (#99)

* Updated Dockerfile with rsync for Gutenberg repo

* Updated Windows note for gutenberg project

* Restructure

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
  • Loading branch information
d-kleine and rasbt authored Apr 2, 2024
1 parent 607242c commit e9584ac
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 4 deletions.
1 change: 1 addition & 0 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y rsync && \
apt-get install -y git && \
rm -rf /var/lib/apt/lists/*

Expand Down
30 changes: 26 additions & 4 deletions ch05/03_bonus_pretraining_on_gutenberg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ As the Project Gutenberg website states, "the vast majority of Project Gutenberg
Please read the [Project Gutenberg Permissions, Licensing and other Common Requests](https://www.gutenberg.org/policy/permission.html) page for more information about using the resources provided by Project Gutenberg.

&nbsp;
## How to use this code
## How to Use This Code

&nbsp;

Expand All @@ -17,7 +17,11 @@ In this section, we download books from Project Gutenberg using code from the [`

As of this writing, this will require approximately 50 GB of disk space, but it may be more depending on how much Project Gutenberg grew since then.

Follow these steps to download the dataset:
&nbsp;
#### Download instructions for Linux and macOS users


Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below):


1. `git clone https://github.com/pgcorpus/gutenberg.git`
Expand All @@ -31,8 +35,26 @@ Follow these steps to download the dataset:
5. `cd ..`

&nbsp;
#### Special instructions for Windows users

The [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) code is compatible with both Linux and macOS. However, Windows users would have to make small adjustments, such as adding `shell=True` to the `subprocess` calls and replacing `rsync`.

Alternatively, an easier way to run this code on Windows is by using the "Windows Subsystem for Linux" (WSL) feature, which allows users to run a Linux environment using Ubuntu in Windows. For more information, please read [Microsoft's official installation instruction](https://learn.microsoft.com/en-us/windows/wsl/install) and [tutorial](https://learn.microsoft.com/en-us/training/modules/wsl-introduction/).

When using WSL, please make sure you have Python 3 installed (check via `python3 --version`, or install it for instance with `sudo apt-get install -y python3.10` for Python 3.10) and install following packages there:

```bash
sudo apt-get update && \
sudo apt-get upgrade -y && \
sudo apt-get install -y python3-pip && \
sudo apt-get install -y python-is-python3 && \
sudo apt-get install -y rsync && \
```

> [!NOTE]
> The [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) code is compatible with both Linux and macOS. However, Windows users would have to make small adjustments, such as adding `shell=True` to the `subprocess` calls and replacing `rsync`. Alternatively, an easier way to run this code on Windows is by using the "Windows Subsystem for Linux" feature, which allows users to run a Linux environment in Windows. For more information, please read [Microsoft's official installation instruction](https://learn.microsoft.com/en-us/windows/wsl/install) and [tutorial](https://learn.microsoft.com/en-us/training/modules/wsl-introduction/).
> Instructions about how to set up Python and installing packages can be found in [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md).
>
> Optionally, a Docker image running Ubuntu is provided with this repository. When having cloned the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository, copy the *.devcontainer* folder of this `LLMs-from-scratch` repository and paste it to the locally cloned `gutenberg` repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md).
&nbsp;
### 2) Prepare the dataset
Expand Down Expand Up @@ -125,4 +147,4 @@ Note that this code focuses on keeping things simple and minimal for educational
6. Add distributed data parallelism (DDP) and train the model on multiple GPUs (see section *A.9.3 Training with multiple GPUs* in appendix A; [DDP-script.py](../../appendix-A/03_main-chapter-code/DDP-script.py)).
7. Swap the from scratch `MultiheadAttention` class in the `previous_chapter.py` script with the efficient `MHAPyTorchScaledDotProduct` class implemented in the [Efficient Multi-Head Attention Implementations](../../ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb) bonus section, which uses Flash Attention via PyTorch's `nn.functional.scaled_dot_product_attention` function.
8. Speeding up the training by optimizing the model via [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) (`model = torch.compile`) or [thunder](https://github.com/Lightning-AI/lightning-thunder) (`model = thunder.jit(model)`).
9. Implement Gradient Low-Rank Projection (GaLore) to further speed up the pretraining process. This can be achieved by just replacing the `AdamW` optimizer with the provided `GaLoreAdamW` provided in the [GaLore Python library](https://github.com/jiaweizzhao/GaLore).
9. Implement Gradient Low-Rank Projection (GaLore) to further speed up the pretraining process. This can be achieved by just replacing the `AdamW` optimizer with the provided `GaLoreAdamW` provided in the [GaLore Python library](https://github.com/jiaweizzhao/GaLore).

0 comments on commit e9584ac

Please sign in to comment.