-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py #86
Comments
Thanks for the note! I think the def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # Add batch dimension
return encoded_tensor |
Thanks for your reply. But, I think the |
Oh I see, do you mean that the pretraining dataset does not already contain the This is weird, it should have been used when joining the books here:
Maybe it's due to a bug or gets stripped out later somehow 🤔, I'd have to investigate. |
I am also not a big fan of the SPGC repo as it seems to not be maintained anymore:
|
It worked quite fine for me to be honest though, and it also downloaded all the recent books that have been uploaded since then. No need to update if it ain't broke I'd say. |
I was thinking of including the relevant code in this repo here and update it if necessary (with attribution of course), but I am not sure that's allowed under GNU license (but please correct me if I'm wrong). Or in other words it might force all the code in this repo to adopt the GNU license as well (instead of the current open source license), and I think a GNU license is too restrictive for most readers. I.e., if a reader wants to use some code from this repo in their work project, it will require making their work project open-source under GNU, which many companies can't or don't want to do for logical reasons. So it would do the readers more harm than good in my opinion. Or, in other words, I want to give readers maximal freedom in terms of how they can use the code in this repo, to make it maximally useful. |
I see.. Thank you. The dataset used in |
Yeah, seems reasonable. The issue for me was that the scripts are based on Linux - running Python on Windows, you need to update the scripts (subprocess needs a As an idea, you could add this repo as a subrepo, and you could provide a Dockerfile with that (I don't know how this affects licensing). Or I could share my Dockerfile and add a set up guide to your README on how to use a Docker container running Ubuntu with SPGC. |
No worries @qibin0506
I double-checked and everything looks ok and the text files contain the
@d-kleine I see, thanks for the explanation! I wish I could include their repo as a subrepo and make the respective changes. However I think this could impose their restrictive licensing terms on this entire combined work, which I don't want to risk. I wish they used a more permissive open source license like MIT, BSD, or Apache, so that there wouldn't be this problem. Regarding Docker, would the current Docker container at https://github.com/rasbt/LLMs-from-scratch/tree/main/.devcontainer work for this? I'd also be happy to write-up a README explanation section for Windows users mentioning what you said, regarding either using Docker or making the changes with But a perhaps easier solution is to run this in the WSL (Windows Subsystem Linux) terminal? I heard that's what Windows user usually use for the command line terminal. Would that work? |
Yes, I could update that and write instructions to your README how to use the container with to the gutenberg repo
Yeah, I agree, using WSL2 on Windows would be the most convenient way in this case. Windows users then need to install Ubuntu in order to run WSL2 Windows and install few packages in Ubuntu. I could write an instruction about how to set this up as well to your README. |
Thanks for offering. I took a knack at it and added a note about WSL in #99. Do you think it's sufficient to just link Microsoft's official docs for WSL? |
Yeah, I think that's sufficient for now. |
BTW it took me 6+ hours downloading only 50% of the data. I have tried that over different days and times, the download rate was always really slow (like 1,000 kbit/s, I have monitored that in real-time). At least you can stop the download via the terminal and resume downloading later - it downloaded files will be checked and only non-existing files will be retrieved then. And for learning, this amount of data was sufficient for me as I only trained the first combined 500 MB file (otherwise the training would have taken too long). This was a small but really cool real world project! 👍 |
Whoa! And yeah, downloading such large amounts of data is pretty expensive. Ha, and the pretraining would taken even much longer. For individuals, pretraining your own LLMs really isn't feasible, and there is no magic trick for that. But hey, at least we can use pretrained weights for the finetuning stages :) |
In https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py#L95
file, the '<|endoftext|>' symbol always appear at val_data_set, and train_data_set always not contains it.
The text was updated successfully, but these errors were encountered: