Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MicroLlama training support #1457

Merged
merged 7 commits into from
Jun 4, 2024
Merged

Add MicroLlama training support #1457

merged 7 commits into from
Jun 4, 2024

Conversation

keeeeenw
Copy link
Contributor

@keeeeenw keeeeenw commented Jun 1, 2024

Hello,

I was previously contacted by your team regarding the addition of MicroLlama support to the latest litgpt codebase. Please review and accept this pull request if you agree with my approach. Otherwise, I welcome any questions and comments.

Additionally, I have not yet tested this change with Lightning AI Studios. I will work on any necessary code adjustments to support the studio and update the documentation as needed.

I made the following changes:

  1. Add microllama.yaml for training configuration
  2. Modify config.py to add MicroLlama model config
  3. Modify pretrain.py to import and handle MicroLlama data
  4. Add microllama.py for data loading during training
  5. Fixed a bug in prepare_slimpajama.py by adding is_generator attribute. Data processing with the latest litdata will crash if this attribute is missing.

Data Processing:
Same as TinyLlama

Training:
litgpt pretrain --config config_hub/pretrain/microllama.yaml

@keeeeenw
Copy link
Contributor Author

keeeeenw commented Jun 2, 2024

It looks like the unit test failed because I did not link hf_config to my hf repo correctly to get the tokenizer.

 hf_config=dict(org="MicroLlama", name="MicroLlama-300M{}"),

Let me make an update.

1. Add microllama.yaml for training configuration
2. Modify config.py to add MicroLlama model config
3. Modify pretrain.py to import and handle MicroLlama data
4. Add microllama.py for data loading during training
5. Fixed a bug in prepare_slimpajama.py by adding is_generator attribute.
Data processing with the latest litdata library will crash if this attribute is missing.

Data Processing:
Same as TinyLlama but you don't need startcoder data.

Training:
litgpt pretrain --config config_hub/pretrain/microllama.yaml
@rasbt
Copy link
Contributor

rasbt commented Jun 2, 2024

Awesome, this is great! Really appreciate it! And that's a really clean PR. I will test this either tonight or tomorrow before merging.

Copy link
Contributor

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks all good to me. There's one minor suggestion: Instead of duplicating most of the code in the MicroLlama data module, what do you think about making a LlamaDataModule base class with a starcoder toggle from which you can then initialize a TinyLlama and MicroLlama subclass. Basically just a minor refactoring to avoid code duplication. Let me know in case you want my help with this.

@rasbt rasbt requested a review from williamFalcon as a code owner June 2, 2024 22:21
@rasbt rasbt removed the request for review from williamFalcon June 2, 2024 22:22
@keeeeenw
Copy link
Contributor Author

keeeeenw commented Jun 2, 2024

Thanks for the review and updating the doc! Let me refactor the code today! I can run some basic checks on my side but I will need your help regression test TinyLlama.

litgpt/config.py Outdated Show resolved Hide resolved
litgpt/pretrain.py Outdated Show resolved Hide resolved
@Andrei-Aksionov
Copy link
Contributor

Thanks @keeeeenw for the PR 👍

@rasbt
Copy link
Contributor

rasbt commented Jun 3, 2024

Let me refactor the code today! I can run some basic checks on my side but I will need your help regression test TinyLlama.

awesome! thanks for the quick update! I will test this today

@morphpiece
Copy link

Point No. 5 of OP brought me here. Both prepare_slimpajama() and prepare_starcoder() are crashing for lack of is_generator attribute in their DataRecipe classes.

@keeeeenw
Copy link
Contributor Author

keeeeenw commented Jun 3, 2024

Point No. 5 of OP brought me here. Both prepare_slimpajama() and prepare_starcoder() are crashing for lack of is_generator attribute in their DataRecipe classes.

Yeah, I suspected prepare_starcoder could have the same issue but I don't have the star coder dataset locally to test any fixes. Let me add the same change to prepare_starcoder. Please let me know if it works for you.

litgpt/data/llama_data.py Outdated Show resolved Hide resolved
litgpt/pretrain.py Outdated Show resolved Hide resolved
@awaelchli
Copy link
Contributor

Thanks for contributing the model and training config!

@rasbt
Copy link
Contributor

rasbt commented Jun 4, 2024

LGTM, big thanks for this PR!

@keeeeenw
Copy link
Contributor Author

keeeeenw commented Jun 4, 2024

Great! Thank you all for the review and thoughtful comments. According to https://lightning.ai/blog/contribute-to-lightning/, you will merge my change, right? Please let me know if there is anything else you need from my side!

@rasbt
Copy link
Contributor

rasbt commented Jun 4, 2024

Yes, this is great and ready to merge! In fact, I will merge it now! Thanks again, we really appreciate the effort!

@rasbt rasbt merged commit fa88952 into Lightning-AI:main Jun 4, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants