Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

Closed
qibin0506 opened this issue Mar 26, 2024 · 14 comments · Fixed by #99
Closed

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86

qibin0506 opened this issue Mar 26, 2024 · 14 comments · Fixed by #99

Comments

@qibin0506
Copy link

qibin0506 commented Mar 26, 2024

In https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py#L95
file, the '<|endoftext|>' symbol always appear at val_data_set, and train_data_set always not contains it.

@rasbt
Copy link
Owner

rasbt commented Mar 26, 2024

Thanks for the note! I think the text_to_token_ids should have included the <|endoftext|> token to be able to handle it if it's encountered. Just updated it. I.e.,

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # Add batch dimension
    return encoded_tensor

@rasbt rasbt closed this as completed Mar 26, 2024
@qibin0506
Copy link
Author

Thanks for the note! I think the text_to_token_ids should have included the <|endoftext|> token to be able to handle it if it's encountered. Just updated it. I.e.,

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # Add batch dimension
    return encoded_tensor

Thanks for your reply. But, I think the allowed_special key word in text_to_token_ids just indicates that it can handle, and there's nothing it can handle in the train set.

@rasbt
Copy link
Owner

rasbt commented Apr 1, 2024

Oh I see, do you mean that the pretraining dataset does not already contain the '<|endoftext|>' tokens in your case?

This is weird, it should have been used when joining the books here:

target_file.write(separator.join(current_content))

Maybe it's due to a bug or gets stripped out later somehow 🤔, I'd have to investigate.

@rasbt rasbt reopened this Apr 1, 2024
@d-kleine
Copy link
Contributor

d-kleine commented Apr 1, 2024

I am also not a big fan of the SPGC repo as it seems to not be maintained anymore:

  • last commits from 2 years ago, PRs from 2022 and later that have not been reviewed)
  • Python code is outdated, needs some manual fixes (at least on Windows)
  • no specifications of package versions in the requirements.txt file, etc.

@rasbt
Copy link
Owner

rasbt commented Apr 1, 2024

I am also not a big fan of the SPGC repo as it seems to not be maintained anymore:

It worked quite fine for me to be honest though, and it also downloaded all the recent books that have been uploaded since then. No need to update if it ain't broke I'd say.

@rasbt
Copy link
Owner

rasbt commented Apr 1, 2024

I was thinking of including the relevant code in this repo here and update it if necessary (with attribution of course), but I am not sure that's allowed under GNU license (but please correct me if I'm wrong). Or in other words it might force all the code in this repo to adopt the GNU license as well (instead of the current open source license), and I think a GNU license is too restrictive for most readers. I.e., if a reader wants to use some code from this repo in their work project, it will require making their work project open-source under GNU, which many companies can't or don't want to do for logical reasons. So it would do the readers more harm than good in my opinion. Or, in other words, I want to give readers maximal freedom in terms of how they can use the code in this repo, to make it maximally useful.

@qibin0506
Copy link
Author

Oh I see, do you mean that the pretraining dataset does not already contain the '<|endoftext|>' tokens in your case?

This is weird, it should have been used when joining the books here:

target_file.write(separator.join(current_content))

Maybe it's due to a bug or gets stripped out later somehow 🤔, I'd have to investigate.

I see.. Thank you. The dataset used in pretraing_simple.py is preprocessed in prepare_dataset.py and already add the '<|endoftext|>' token

@d-kleine
Copy link
Contributor

d-kleine commented Apr 2, 2024

I was thinking of including the relevant code in this repo here and update it if necessary (with attribution of course), but I am not sure that's allowed under GNU license (but please correct me if I'm wrong). Or in other words it might force all the code in this repo to adopt the GNU license as well (instead of the current open source license), and I think a GNU license is too restrictive for most readers. I.e., if a reader wants to use some code from this repo in their work project, it will require making their work project open-source under GNU, which many companies can't or don't want to do for logical reasons. So it would do the readers more harm than good in my opinion. Or, in other words, I want to give readers maximal freedom in terms of how they can use the code in this repo, to make it maximally useful.

Yeah, seems reasonable. The issue for me was that the scripts are based on Linux - running Python on Windows, you need to update the scripts (subprocess needs a shell=True argument, there is no rsync, etc.). I have used a Docker container with Ubuntu image, and that worked fine for me - without needing to update code in the scripts.

As an idea, you could add this repo as a subrepo, and you could provide a Dockerfile with that (I don't know how this affects licensing). Or I could share my Dockerfile and add a set up guide to your README on how to use a Docker container running Ubuntu with SPGC.

@rasbt
Copy link
Owner

rasbt commented Apr 2, 2024

No worries @qibin0506

I see.. Thank you. The dataset used in pretraing_simple.py is preprocessed in prepare_dataset.py and already add the '<|endoftext|>' token

I double-checked and everything looks ok and the text files contain the <|endoftext|> tokens. They are inserted when running the prepare_dataset.py script, which concatenates multiple books into larger files for efficiency during training. This way you don't have 70,000 small files but only a few hundred larger files that need to be loaded during training. I.e., the prepare_dataset.py script creates files where the contents are

<book 1 content> <|endoftext|> <book 2 content> <|endoftext|> <book 3 content> <|endoftext|> ...


@d-kleine I see, thanks for the explanation!

I wish I could include their repo as a subrepo and make the respective changes. However I think this could impose their restrictive licensing terms on this entire combined work, which I don't want to risk. I wish they used a more permissive open source license like MIT, BSD, or Apache, so that there wouldn't be this problem.

Regarding Docker, would the current Docker container at https://github.com/rasbt/LLMs-from-scratch/tree/main/.devcontainer work for this?

I'd also be happy to write-up a README explanation section for Windows users mentioning what you said, regarding either using Docker or making the changes with shell=True and replacing rsync (what would you replace rsync with).

But a perhaps easier solution is to run this in the WSL (Windows Subsystem Linux) terminal? I heard that's what Windows user usually use for the command line terminal. Would that work?

@d-kleine
Copy link
Contributor

d-kleine commented Apr 2, 2024

Regarding Docker, would the current Docker container at https://github.com/rasbt/LLMs-from-scratch/tree/main/.devcontainer work for this?

Yes, I could update that and write instructions to your README how to use the container with to the gutenberg repo

I'd also be happy to write-up a README explanation section for Windows users mentioning what you said, regarding either using Docker or making the changes with shell=True and replacing rsync (what would you replace rsync with).

But a perhaps easier solution is to run this in the WSL (Windows Subsystem Linux) terminal? I heard that's what Windows user usually use for the command line terminal. Would that work?

Yeah, I agree, using WSL2 on Windows would be the most convenient way in this case. Windows users then need to install Ubuntu in order to run WSL2 Windows and install few packages in Ubuntu. I could write an instruction about how to set this up as well to your README.

@rasbt
Copy link
Owner

rasbt commented Apr 2, 2024

Thanks for offering. I took a knack at it and added a note about WSL in #99. Do you think it's sufficient to just link Microsoft's official docs for WSL?

@d-kleine
Copy link
Contributor

d-kleine commented Apr 2, 2024

Yeah, I think that's sufficient for now.
I will update the Dockerfile and create a PR then.

@rasbt rasbt closed this as completed in #99 Apr 2, 2024
@d-kleine
Copy link
Contributor

d-kleine commented Apr 6, 2024

BTW it took me 6+ hours downloading only 50% of the data. I have tried that over different days and times, the download rate was always really slow (like 1,000 kbit/s, I have monitored that in real-time). At least you can stop the download via the terminal and resume downloading later - it downloaded files will be checked and only non-existing files will be retrieved then. And for learning, this amount of data was sufficient for me as I only trained the first combined 500 MB file (otherwise the training would have taken too long). This was a small but really cool real world project! 👍

@rasbt
Copy link
Owner

rasbt commented Apr 9, 2024

Whoa! And yeah, downloading such large amounts of data is pretty expensive. Ha, and the pretraining would taken even much longer. For individuals, pretraining your own LLMs really isn't feasible, and there is no magic trick for that. But hey, at least we can use pretrained weights for the finetuning stages :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants