About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

qibin0506 · 2024-03-26T08:14:58Z

In https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py#L95
file, the '<|endoftext|>' symbol always appear at val_data_set, and train_data_set always not contains it.

rasbt · 2024-03-26T11:50:23Z

Thanks for the note! I think the text_to_token_ids should have included the <|endoftext|> token to be able to handle it if it's encountered. Just updated it. I.e.,

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # Add batch dimension
    return encoded_tensor

qibin0506 · 2024-04-01T08:30:42Z

Thanks for the note! I think the text_to_token_ids should have included the <|endoftext|> token to be able to handle it if it's encountered. Just updated it. I.e.,
def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # Add batch dimension
    return encoded_tensor

Thanks for your reply. But, I think the allowed_special key word in text_to_token_ids just indicates that it can handle, and there's nothing it can handle in the train set.

rasbt · 2024-04-01T13:47:26Z

Oh I see, do you mean that the pretraining dataset does not already contain the '<|endoftext|>' tokens in your case?

This is weird, it should have been used when joining the books here:

LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/prepare_dataset.py

Line 37 in 1a42630

target_file.write(separator.join(current_content))

Maybe it's due to a bug or gets stripped out later somehow 🤔, I'd have to investigate.

d-kleine · 2024-04-01T21:40:28Z

I am also not a big fan of the SPGC repo as it seems to not be maintained anymore:

last commits from 2 years ago, PRs from 2022 and later that have not been reviewed)
Python code is outdated, needs some manual fixes (at least on Windows)
no specifications of package versions in the requirements.txt file, etc.

rasbt · 2024-04-01T21:58:04Z

I am also not a big fan of the SPGC repo as it seems to not be maintained anymore:

It worked quite fine for me to be honest though, and it also downloaded all the recent books that have been uploaded since then. No need to update if it ain't broke I'd say.

rasbt · 2024-04-01T22:03:58Z

I was thinking of including the relevant code in this repo here and update it if necessary (with attribution of course), but I am not sure that's allowed under GNU license (but please correct me if I'm wrong). Or in other words it might force all the code in this repo to adopt the GNU license as well (instead of the current open source license), and I think a GNU license is too restrictive for most readers. I.e., if a reader wants to use some code from this repo in their work project, it will require making their work project open-source under GNU, which many companies can't or don't want to do for logical reasons. So it would do the readers more harm than good in my opinion. Or, in other words, I want to give readers maximal freedom in terms of how they can use the code in this repo, to make it maximally useful.

qibin0506 · 2024-04-02T02:01:07Z

Oh I see, do you mean that the pretraining dataset does not already contain the '<|endoftext|>' tokens in your case?

This is weird, it should have been used when joining the books here:

LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/prepare_dataset.py

Line 37 in 1a42630

target_file.write(separator.join(current_content))

Maybe it's due to a bug or gets stripped out later somehow 🤔, I'd have to investigate.

I see.. Thank you. The dataset used in pretraing_simple.py is preprocessed in prepare_dataset.py and already add the '<|endoftext|>' token

d-kleine · 2024-04-02T10:44:45Z

I was thinking of including the relevant code in this repo here and update it if necessary (with attribution of course), but I am not sure that's allowed under GNU license (but please correct me if I'm wrong). Or in other words it might force all the code in this repo to adopt the GNU license as well (instead of the current open source license), and I think a GNU license is too restrictive for most readers. I.e., if a reader wants to use some code from this repo in their work project, it will require making their work project open-source under GNU, which many companies can't or don't want to do for logical reasons. So it would do the readers more harm than good in my opinion. Or, in other words, I want to give readers maximal freedom in terms of how they can use the code in this repo, to make it maximally useful.

Yeah, seems reasonable. The issue for me was that the scripts are based on Linux - running Python on Windows, you need to update the scripts (subprocess needs a shell=True argument, there is no rsync, etc.). I have used a Docker container with Ubuntu image, and that worked fine for me - without needing to update code in the scripts.

As an idea, you could add this repo as a subrepo, and you could provide a Dockerfile with that (I don't know how this affects licensing). Or I could share my Dockerfile and add a set up guide to your README on how to use a Docker container running Ubuntu with SPGC.

rasbt · 2024-04-02T12:31:21Z

No worries @qibin0506

I see.. Thank you. The dataset used in pretraing_simple.py is preprocessed in prepare_dataset.py and already add the '<|endoftext|>' token

I double-checked and everything looks ok and the text files contain the <|endoftext|> tokens. They are inserted when running the prepare_dataset.py script, which concatenates multiple books into larger files for efficiency during training. This way you don't have 70,000 small files but only a few hundred larger files that need to be loaded during training. I.e., the prepare_dataset.py script creates files where the contents are

@d-kleine I see, thanks for the explanation!

I wish I could include their repo as a subrepo and make the respective changes. However I think this could impose their restrictive licensing terms on this entire combined work, which I don't want to risk. I wish they used a more permissive open source license like MIT, BSD, or Apache, so that there wouldn't be this problem.

Regarding Docker, would the current Docker container at https://github.com/rasbt/LLMs-from-scratch/tree/main/.devcontainer work for this?

I'd also be happy to write-up a README explanation section for Windows users mentioning what you said, regarding either using Docker or making the changes with shell=True and replacing rsync (what would you replace rsync with).

But a perhaps easier solution is to run this in the WSL (Windows Subsystem Linux) terminal? I heard that's what Windows user usually use for the command line terminal. Would that work?

d-kleine · 2024-04-02T12:52:58Z

Regarding Docker, would the current Docker container at https://github.com/rasbt/LLMs-from-scratch/tree/main/.devcontainer work for this?

Yes, I could update that and write instructions to your README how to use the container with to the gutenberg repo

I'd also be happy to write-up a README explanation section for Windows users mentioning what you said, regarding either using Docker or making the changes with shell=True and replacing rsync (what would you replace rsync with).

But a perhaps easier solution is to run this in the WSL (Windows Subsystem Linux) terminal? I heard that's what Windows user usually use for the command line terminal. Would that work?

Yeah, I agree, using WSL2 on Windows would be the most convenient way in this case. Windows users then need to install Ubuntu in order to run WSL2 Windows and install few packages in Ubuntu. I could write an instruction about how to set this up as well to your README.

rasbt · 2024-04-02T13:31:06Z

Thanks for offering. I took a knack at it and added a note about WSL in #99. Do you think it's sufficient to just link Microsoft's official docs for WSL?

d-kleine · 2024-04-02T13:52:33Z

Yeah, I think that's sufficient for now.
I will update the Dockerfile and create a PR then.

d-kleine · 2024-04-06T14:52:49Z

BTW it took me 6+ hours downloading only 50% of the data. I have tried that over different days and times, the download rate was always really slow (like 1,000 kbit/s, I have monitored that in real-time). At least you can stop the download via the terminal and resume downloading later - it downloaded files will be checked and only non-existing files will be retrieved then. And for learning, this amount of data was sufficient for me as I only trained the first combined 500 MB file (otherwise the training would have taken too long). This was a small but really cool real world project! 👍

rasbt · 2024-04-09T01:31:00Z

Whoa! And yeah, downloading such large amounts of data is pretty expensive. Ha, and the pretraining would taken even much longer. For individuals, pretraining your own LLMs really isn't feasible, and there is no magic trick for that. But hey, at least we can use pretrained weights for the finetuning stages :)

rasbt closed this as completed Mar 26, 2024

rasbt reopened this Apr 1, 2024

rasbt mentioned this issue Apr 2, 2024

Gutenberg for Windows users #99

Merged

rasbt closed this as completed in #99 Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

qibin0506 commented Mar 26, 2024 •

edited

Loading

rasbt commented Mar 26, 2024 •

edited

Loading

qibin0506 commented Apr 1, 2024

rasbt commented Apr 1, 2024

d-kleine commented Apr 1, 2024

rasbt commented Apr 1, 2024 •

edited

Loading

rasbt commented Apr 1, 2024

qibin0506 commented Apr 2, 2024

d-kleine commented Apr 2, 2024 •

edited

Loading

rasbt commented Apr 2, 2024

d-kleine commented Apr 2, 2024

rasbt commented Apr 2, 2024

d-kleine commented Apr 2, 2024

d-kleine commented Apr 6, 2024

rasbt commented Apr 9, 2024

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

Comments

qibin0506 commented Mar 26, 2024 • edited Loading

rasbt commented Mar 26, 2024 • edited Loading

qibin0506 commented Apr 1, 2024

rasbt commented Apr 1, 2024

d-kleine commented Apr 1, 2024

rasbt commented Apr 1, 2024 • edited Loading

rasbt commented Apr 1, 2024

qibin0506 commented Apr 2, 2024

d-kleine commented Apr 2, 2024 • edited Loading

rasbt commented Apr 2, 2024

d-kleine commented Apr 2, 2024

rasbt commented Apr 2, 2024

d-kleine commented Apr 2, 2024

d-kleine commented Apr 6, 2024

rasbt commented Apr 9, 2024

qibin0506 commented Mar 26, 2024 •

edited

Loading

rasbt commented Mar 26, 2024 •

edited

Loading

rasbt commented Apr 1, 2024 •

edited

Loading

d-kleine commented Apr 2, 2024 •

edited

Loading