-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Gemma #940
Comments
Based on the paper:
|
There doesn't seem to be an official GeGLU implementation in PyTorch, yet, but this looks good: https://github.com/pfnet-research/deep-table/blob/237c8be8a405349ce6ab78075234c60d9bfe60b7/deep_table/nn/layers/activation.py#L22 |
Can you be specific? It seems to be exactly what we implement: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L154-L166 and seems to match what's in HF: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L621-L640 |
BTW this is up-for-grabs if one of you want to add this quickly |
Hm, dunno what they mean. Even GPT-2 had a LayerNorm before and after each multihead attention module, so I thought they meant something different since they specifically highlighted that. From the paper:
Maybe they mean that they added an additional normalization after the feedforward module. |
Maybe something wrong with the HF implementation? I think it's better to check how it's implemented in Keras. Update:
Oh, I thought that GeGlu is a some smart variant that improves Gelu, but it's just that weird thing as in Olmo. Update 2: In theory, the code should work with just updating the config file. |
When I read it I imagined something like this:
|
Yes, same. It's weird. I agree we should check the Keras code. |
Ok, here is their TransformerBlock from KerasNLP.
|
Keras does implement It's also interesting that they use In conclusion, Gemma needs a new MLP class that is a mix of both |
Thanks for checking. In that case, let me submit a PR. Almost done. |
geglu is gelu but only applied on half of the input. I agree that the HF impl doesn't look equal to that in Keras |
Didn't know that this thing is called geglu. I seriously expected something more math heavy. What's confusing me is that why do you need to specify an intermediate size, which is used only in DecoderBlock, just to half it in the process. Maybe scaling factor
Yep. |
The weird thing though is they don't seem to have a 3rd layernorm weight there in the HF checkpoint. It's were the paper and the implementation seem to differ. |
Announcement: https://blog.google/technology/developers/gemma-open-models/
Technical report: https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
HF Hub weights: https://huggingface.co/google/gemma-7b
HF Transformers PR: huggingface/transformers#29167 with https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py as the model implementation
From a brief skim, I think it just needs to add
geglu
as the activationThe text was updated successfully, but these errors were encountered: