Enable flash attention for gemma #1454

atakaha · 2024-10-23T18:52:55Z

Add missing flash attention flags to gemma

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

atakaha · 2024-10-23T21:57:50Z

@tthakkal , @libinta @mandy-li , please review this PR.

tthakkal · 2024-10-23T22:19:48Z

@tthakkal , @libinta @mandy-li , please review this PR.

@atakaha Have you verified for accuracy and performance with these commands added?
bf16
single card & multi card

fp8
single card and multi card

atakaha · 2024-10-24T00:57:01Z

@tthakkal , @libinta @mandy-li , please review this PR.

@atakaha Have you verified for accuracy and performance with these commands added? bf16 single card & multi card

fp8 single card and multi card

All flash related flag combinations with batch size 1 were passed. But batch size 8 with flash_attention + causal_mask cases generate junk. Need to investigate why it happen multiple batch scenario.

Add missing flag handling to gemma --reuse_cache --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

vidyasiv · 2024-11-04T17:43:19Z

@atakaha , is the PR ready for review yet or waiting on something?
update: i found the internal ticket and will track that. thnx

atakaha · 2024-11-04T17:58:20Z

@atakaha , is the PR ready for review yet or waiting on something?

In the point of missing flags I/F is fixed and confirmed output quality and a little memory usage improvement for BF16 single and multi cards with the flags.
I'm observing FP8 output quality issue from original code (without this code code change). I'm not sure this is expected behavior not. If this is not expected then we need investigate and fix it.

atakaha · 2024-11-05T00:14:59Z

@tthakkal, @vidyasiv , please review this PR.

vidyasiv · 2024-11-05T18:33:09Z

@atakaha can you paste commands and outputs(throughput, text) for 1 and 8 HPU w/ bf16 and fp8 with these changes as Thanaji had requested?
As mentioned on ticket perhaps you can file new ones for issues you discovered.

vidyasiv · 2024-11-05T19:46:32Z

1 HPU sanity testing at my end:

bf16 (works)
bf16 w/ flash attention (works, improves throughput)
fp8 w/ flash attention (inaccurate text outputs)
bf16 w/ reuse_cache(works)

python run_generation.py --model_name_or_path google/gemma-7b \
--attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --max_new_tokens 64 \
--bf16 --batch_size 8
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that enables training of large-scale models on commodity hardware. It is designed to be a drop-in replacement for PyTorch, and it is compatible with the existing PyTorch ecosystem. DeepSpeed is designed to be easy to use, and it provides a number of features that make it easy to train large-scale models',)

input 2: ('He is working on',)
output 1: ('He is working on a new project, which is a sequel to his 2016 film, <em>The Legend of Michael Mishra</em>.\n\n“I am working on a sequel to <em>The Legend of Michael Mishra</em>. It is a comedy film. I am writing the script and I will start shooting for it in the',)

input 3: ('He has a',)
output 1: ('He has a very good knowledge of the market and is very professional. He is very helpful and always available to answer any questions.\n\nI would highly recommend him to anyone looking to buy or sell a property.\n\nWe were very happy with the service provided by the team at Ray White. They were very professional and knowledgeable, and they',)

input 4: ('He got all',)
output 1: ('He got all the way to the final of the 2019 edition of the show, but this year he’s back with a bang.\n\nThe 26-year-old from the Isle of Wight is a professional dancer and choreographer who has worked with the likes of Little Mix, Olly Murs and Fleur',)

input 5: ('Everyone is happy and I can',)
output 1: ('Everyone is happy and I can’t wait to see what the future holds for us.\n\nI’m so happy to have found a place that I can call home.\n\nI’m so happy to have found a place that I can call home.\n\nI’m so happy to have found a place that I can call home.\n\n',)

input 6: ('The new movie that got Oscar this year',)
output 1: ('The new movie that got Oscar this year is a movie that is based on a true story. The movie is called “The Imitation Game”. The movie is about a man named Alan Turing who was a mathematician and a code breaker. He was a very smart man and he was able to break the code that the Germans were using to communicate with each other. He',)

input 7: ('In the far far distance from our galaxy,',)
output 1: ('In the far far distance from our galaxy, there is a planet called Earth. On this planet, there are many different species of animals. One of them is the human.\n\nThe human is a very special species. They have a very high intelligence and they can create many things. They can create a lot of things that can help them to survive.\n\nOne',)

input 8: ('Peace is the only way',)
output 1: ('Peace is the only way to solve the conflict in South Sudan, the country’s President Salva Kiir has said.\n\nKiir made the remarks on Wednesday during the 10th anniversary of the Comprehensive Peace Agreement (CPA) in Juba.\n\n“The only way to solve the conflict in South Sudan is through peace. We have',)


Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 788.1560337240456 tokens/second
Memory allocated                    = 18.53 GB
Max memory allocated                = 18.66 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.8224462040000162 seconds
----------------------------------------------------------------------------------


python run_generation.py --model_name_or_path google/gemma-7b \
 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --max_new_tokens 64 \
 --bf16 --batch_size 8 --use_flash_attention

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that enables the training of large-scale models on commodity hardware. It is designed to be flexible and extensible, allowing researchers to easily add new algorithms and optimizations to the framework. DeepSpeed is also designed to be efficient, using techniques such as data parallelism and mixed-precision training to reduce the amount of time and resources required',)

input 2: ('He is working on',)
output 1: ('He is working on a new project, which is a sequel to his 2016 film, <em>The Legend of Michael Mishra</em>.\n\n“I am working on a sequel to <em>The Legend of Michael Mishra</em>. It is a comedy film. I am writing the script and I will start shooting for it in the',)

input 3: ('He has a',)
output 1: ('He has a very good knowledge of the market and is very professional. He is very helpful and always available to answer any questions.\n\nI would highly recommend him to anyone looking to buy or sell a property.\n\nHe is very professional and knowledgeable. He was always available to answer any questions we had and made the process of buying a',)

input 4: ('He got all',)
output 1: ('He got all the way to the final of the 2019 edition of the show, but this year he’s back with a bang.\n\nThe 26-year-old from the Isle of Wight is a professional dancer and choreographer who has worked with the likes of Little Mix, Olly Murs and Fleur',)

input 5: ('Everyone is happy and I can',)
output 1: ('Everyone is happy and I can’t wait to see what the future holds for us.\n\nI’m so happy to have found a place that I can call home.\n\nI’m so happy to have found a place that I can call home.\n\nI’m so happy to have found a place that I can call home.\n\n',)

input 6: ('The new movie that got Oscar this year',)
output 1: ('The new movie that got Oscar this year is a movie that is based on a true story. The movie is called “The Imitation Game”. The movie is about a man named Alan Turing who was a mathematician and a code breaker. He was a very smart man and he was able to break the code that the Germans were using to communicate with each other. He',)

input 7: ('In the far far distance from our galaxy,',)
output 1: ('In the far far distance from our galaxy, there is a planet called Earth. On this planet, there are many different species of animals. One of them is the human.\n\nThe human is a very special species. They have a very high intelligence and they can create many things. They can create a lot of things that can help them to survive.\n\nOne',)

input 8: ('Peace is the only way',)
output 1: ('Peace is the only way to end the war in Ukraine, the Russian president, Vladimir Putin, has said, as he accused the west of trying to “dismember” his country.\n\nIn a speech to mark the 80th anniversary of the Soviet victory over Nazi Germany in the second world war, Putin said the west was trying to',)


Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 817.3428140417304 tokens/second
Memory allocated                    = 18.55 GB
Max memory allocated                = 18.72 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.686410730000034 seconds

QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path google/gemma-7b \
--attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --max_new_tokens 64 \
--bf16 --batch_size 1

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that enables training of large-scale models on commodity hardware. It is designed to be a drop-in replacement for PyTorch, and it is compatible with the existing PyTorch ecosystem. DeepSpeed is designed to be easy to use, and it provides a number of features that make it easy to train large-scale models',)


Stats:
-----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 107.65460798022197 tokens/second
Memory allocated                    = 19.16 GB
Max memory allocated                = 20.52 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.4955870979999872 seconds
-----------------------------------------------------------------------------------

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path google/gemma-7b \
--attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --max_new_tokens 64 \
--bf16 --batch_size 8 --use_flash_attention

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that suspic suspic suspicispecially unifore unif unifore enthusi unif unif enthusi enthusi enthusi infinites enthusi enthusi infinites enthusi infinites enthusi infinites enthusi infinites infinites premia enthusi infinites enthusi infinites enthusi infinites premia enthusi infinites infinites premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia premia',)

input 2: ('He is working on',)
output 1: ('He is working on my imago. He has my imago, my imago antem Idem, my imago. My imago imago imago imago imago imago. fepdhdhd madonna my imago. My imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago imago',)

input 3: ('He has a',)
output 1: ('He has a mysterical past',)

input 4: ('He got all',)
output 1: ('He got all the upvotes, upvotes, and upvotes, but when it came time to get his upvotes upvotes ① upvotes, ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ① ①',)

input 5: ('Everyone is happy and I can',)
output 1: ('Everyone is happy and I can’t wait for my niece’s first birthday party, my daughter’s first day of kindergarten or my son’s first day of exorbitantly profanely alphabe smartypants mef alphabe alphabe alphabe alphabe alphabe smartypants alphabe alphabe alphabe smartypants alphabe smartypants alphabe alphabe smartypants smartypants alphabe',)

input 6: ('The new movie that got Oscar this year',)
output 1: ('The new movie that got Oscar this year, The indestructibles, has alre manikul than the alphabe disadpecially disespecially alphabe alphabe alphabe alphabe alphabe encre alphabe alphabe alphabe alphabe alphabe alphabe alphabe alphabe encre alphabe encre disespecially alphabe alphabe encre dises manikul alphabe alphabe alphabe encre dises alphabe alphabe dises encre dises alphabe manikul alphabe manufact alphabe alphabe alphabe alphabe encre dises alphabe alphabe',)

input 7: ('In the far far distance from our galaxy,',)
output 1: ('In the far far distance from our galaxy, we can see that the milky way has a prominant bump',)

input 8: ('Peace is the only way',)
output 1: ('Peace is the only way, my friend,\n',)


Stats:
-----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 1336.5589212865011 tokens/second
Memory allocated                    = 10.46 GB
Max memory allocated                = 11.18 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 8.916372483999794 seconds
-----------------------------------------------------------------------------------

python run_generation.py --model_name_or_path google/gemma-7b \
 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --max_new_tokens 64 \
 --bf16 --batch_size 8 --reuse_cache

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that enables training of large-scale models on commodity hardware. It is designed to be a drop-in replacement for PyTorch, and it is compatible with the existing PyTorch ecosystem. DeepSpeed is designed to be easy to use, and it provides a number of features that make it easy to train large-scale models',)

input 2: ('He is working on',)
output 1: ('He is working on a new project, which is a sequel to his 2016 film, <em>The Legend of Michael Mishra</em>.\n\n“I am working on a sequel to <em>The Legend of Michael Mishra</em>. It is a comedy film. I am writing the script and I will start shooting for it in the',)

input 3: ('He has a',)
output 1: ('He has a very good knowledge of the market and is very professional. He is very helpful and always available to answer any questions.\n\nI would highly recommend him to anyone looking to buy or sell a property.\n\nWe were very happy with the service provided by the team at Ray White. They were very professional and knowledgeable, and they',)

input 4: ('He got all',)
output 1: ('He got all the way to the final of the 2019 edition of the show, but this year he’s back with a bang.\n\nThe 26-year-old from the Isle of Wight is a professional dancer and choreographer who has worked with the likes of Little Mix, Olly Murs and Fleur',)

input 5: ('Everyone is happy and I can',)
output 1: ('Everyone is happy and I can’t wait to see what the future holds for us.\n\nI’m so happy to have found a place that I can call home.\n\nI’m so happy to have found a place that I can call home.\n\nI’m so happy to have found a place that I can call home.\n\n',)

input 6: ('The new movie that got Oscar this year',)
output 1: ('The new movie that got Oscar this year is a movie that is based on a true story. The movie is called “The Imitation Game”. The movie is about a man named Alan Turing who was a mathematician and a code breaker. He was a very smart man and he was able to break the code that the Germans were using to communicate with each other. He',)

input 7: ('In the far far distance from our galaxy,',)
output 1: ('In the far far distance from our galaxy, there is a planet called Earth. On this planet, there are many different species of animals. One of them is the human.\n\nThe human is a very special species. They have a very high intelligence and they can create many things. They can create a lot of things that can help them to survive.\n\nOne',)

input 8: ('Peace is the only way',)
output 1: ('Peace is the only way to solve the conflict in South Sudan, the country’s President Salva Kiir has said.\n\nKiir made the remarks on Wednesday during the 10th anniversary of the Comprehensive Peace Agreement (CPA) in Juba.\n\n“The only way to solve the conflict in South Sudan is through peace. We have',)


Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 786.3954218674475 tokens/second
Memory allocated                    = 18.53 GB
Max memory allocated                = 18.66 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.820120850999956 seconds
----------------------------------------------------------------------------------

atakaha · 2024-11-05T21:08:24Z

FP8 is same quality on my side. And FP8 with flash attention drops throughput.

BF16 base command line
python run_generation.py --model_name_or_path google/gemma-7b --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --max_input_tokens 128 --max_new_tokens 128 --bf16 --batch_size 128

quantize	batch_size	max_input_tokens	max_new_tokens	use_flash_attention	flash_attention_recompute	attn_softmax_bf16	Throughput	Memory allocated	Max memory allocated
bf16	128	128	128				4515.519	79	80.97
bf16	128	128	128			✓	4514.927	79.02	81
bf16	128	128	128	✓			4540.38	78.99	80.97
bf16	128	128	128	✓	✓		4535.465	78.98	80.97

FP8 managements with/without flash attention are done separately, sine script path is different and it cause error.
- without flash attention
  QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path google/gemma-7b --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --max_input_tokens 128 --max_new_tokens 128 --bf16 --batch_size 1
- with flash attention
  QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path google/gemma-7b --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --max_input_tokens 128 --max_new_tokens 128 --bf16 --batch_size 1 --use_flash_attention
FP8 base command line.
UANT_CONFIG=./quantization_config/maxabs_quant.json run_generation.py --model_name_or_path google/gemma-7b --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --max_input_tokens 128 --max_new_tokens 128 --bf16 --batch_size 128

quantize	batch_size	max_input_tokens	max_new_tokens	use_flash_attention	flash_attention_recompute	attn_softmax_bf16	Throughput	Memory allocated	Max memory allocated
fp8	128	128	128				8029.392	64.04	65.9
fp8	128	128	128			✓	8043.598	64.04	65.9
fp8	128	128	128	✓			3596.054	64.03	65.88
fp8	128	128	128	✓	✓		3593.654	64.03	65.88

atakaha · 2024-11-05T21:45:43Z

@atakaha can you paste commands and outputs(throughput, text) for 1 and 8 HPU w/ bf16 and fp8 with these changes as Thanaji had requested? As mentioned on ticket perhaps you can file new ones for issues you discovered.

@vidyasiv, tickets are created.

vidyasiv · 2024-11-06T18:56:38Z

@regisss could you take a look. Pending issue (FP8 with flash attention drops throughput.)has ticket filed

vidyasiv

lgtm

atakaha · 2024-11-06T21:12:01Z

For FP8, we need to use quantization_config/maxabs_quant_gemma.json for measurement. Then we get accurate output for FP8

atakaha · 2024-11-14T01:40:16Z

@regisss , Please review this PR.

HuggingFaceDocBuilderDev · 2024-11-15T21:17:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Add flag to run inference with partial dataset (huggingface#1420) * Add peft generation example (huggingface#1427) * Upgrade to SynapseAI 1.18.0 (huggingface#1418) * Simplify HQT config files (huggingface#1219) * unify_measurements.py script support to unify PCQ 70B 8x (huggingface#1322) * Add misc. training args (huggingface#1346) * Add quantization config for low bs case (huggingface#1377) * Remove HQT from OHF (huggingface#1257) Co-authored-by: Adam Stachowicz <astachowicz@habana.ai> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> Co-authored-by: Yeonsil Yoon <yyoon@habana.ai> * Load INC GPTQ checkpoint & rename params (huggingface#1364) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com> Co-authored-by: Harish Subramony <81822986+hsubramony@users.noreply.github.com> Co-authored-by: Yeonsil Yoon <yyoon@habana.ai> * Enable FusedSDPA fp8 in Llama FT (huggingface#1388) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com> Co-authored-by: Harish Subramony <81822986+hsubramony@users.noreply.github.com> * Valid sequence length for sdpa (huggingface#1183) Co-authored-by: Harish <hsubramony@habana.ai> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Multiple fixes (dynamo graph break, qwen-moe, multicard) (huggingface#1410) * datasets downgrade version to 2.21.0 (huggingface#1413) * Update ci sentence_transformer.sh (huggingface#1424) * Fix load INC load weights compile error due to Transformer 4.45 upgrade. (huggingface#1421) * Update language-modeling README.md, add trust_remote_code for flan-t5-xl (huggingface#1422) * Update unify_measurements.py support info (huggingface#1425) * GPT2 torch.compile fix (huggingface#1434) * Added missing allocate_kv_cache() call in CausalLM class (huggingface#1431) * Fix merge error and update text-to-speech readme (huggingface#1436) * Fix OOM error for code llama (huggingface#1437) * Fix error on 4bit checkpoint load with run_lm_eval on TF4.45.2 (huggingface#1439) * Fix scoped linear all-reduce for starcoder model (huggingface#1432) * Fixed recursion error in SentenceTransformer (huggingface#1428) * Fix Llama 3.1 generation (huggingface#1444) * Update text-gen README.md to add auto-gptq fork install steps (huggingface#1442) * Added gemma specific fp8 quantization file (huggingface#1445) * Remove cache folder from image data folder (huggingface#1446) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Bump dev version * Enable DeepSpeed for image-to-text example (huggingface#1455) * Fix bug when loading 4bit checkpoint quantized in INC (huggingface#1447) * Fixes 'Tokenizer does not have padding token' introduced by huggingface#1444 for Llama3.1 (huggingface#1457) * Fix facebook/hf-seamless-m4t-medium crash (huggingface#1433) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Fix bias update in scoped all reduce (huggingface#1456) * Added skip for unsuported tests for mistral/mixtral (huggingface#1462) * Update sentence transformer to v3.2.1 (huggingface#1470) * Optimized inference of Cohere model on HPU (huggingface#1329) Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com> * Idefics2 (huggingface#1270) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Remove deprecated Mixed precision flags (huggingface#1471) Change-Id: I1c2e2460dc2072ba7b311f239441b304694918c8 * Optimized inference of XGLM model on HPU (huggingface#1323) Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com> * Add mllama support (huggingface#1419) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Enable flash attention for gemma (huggingface#1454) * Readme: replace tabs with spaces (huggingface#1485) * Move fast tests to Gaudi2 (huggingface#1498) * Support loading 4 bit Qwen2 (huggingface#1476) Signed-off-by: Mengni Wang <mengni.wang@intel.com> * Add textual inversion XL for Gaudi (huggingface#868) Signed-off-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> * Remove torch req from LM example (huggingface#1491) * Remove keep_input_mutations (huggingface#1492) * Fix trust_remote_code (huggingface#1493) * Upgrade ViT README with torch.compile (huggingface#1494) * Tests for text gen output text (huggingface#1411) * Corrected Throughput measure for GaudiDDPMPipeline (huggingface#1460) * Fix text generation test * Add G3 in T5-L README (huggingface#1523) * Fix tuple object error (huggingface#1354) * Add warmup time and compile time log for the eval/prediction. (huggingface#1489) * Fix style * Enable `paligemma` model for image-to-text example (huggingface#1407) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Add support for MLPERF optimized pipeline from example (huggingface#1465) Co-authored-by: sushil dubey <sdubey@habana.ai> * Enable Gemma2 Inference on Gaudi (huggingface#1504) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com> Signed-off-by: Mengni Wang <mengni.wang@intel.com> Signed-off-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: billishyahao <yahao.he@intel.com> Co-authored-by: Harish Subramony <81822986+hsubramony@users.noreply.github.com> Co-authored-by: Yeonsil Yoon <yyoon@habana.ai> Co-authored-by: Seunghyuk Park (shepark) <separk@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: xinhe <xin3.he@intel.com> Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: Soila Kavulya <soila.p.kavulya@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com> Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: Akihiro Takahashi <akihiro.takahashi@intel.com> Co-authored-by: Miroslav Goncharenko <miroslav.goncharenko@intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com> Co-authored-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> Co-authored-by: Vidya Galli <vidya.s.galli@intel.com> Co-authored-by: deepak-gowda-narayana <140652370+deepak-gowda-narayana@users.noreply.github.com> * Add check_neural_compressor_min_version for 4 bit behavior (huggingface#1500) Signed-off-by: Xin <xin3.he@intel.com> Signed-off-by: xinhe3 <xinhe3@habana.ai> Co-authored-by: xinhe3 <xinhe3@habana.ai> * Fixed Gemma FP8 flash_attention lower throughput issue (huggingface#1510) * Pass "lazy_mode" arg to GaudiLlamaModel GaudiTrainer (huggingface#1515) Co-authored-by: Marcin Łapiński <mlapinskix@habana.ai> * Removed workaround for NaN bug causing graph break. (huggingface#1516) Co-authored-by: Marcin Łapiński <mlapinskix@habana.ai> * Disable default sdpa in Albert (#22) (huggingface#1517) Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com> * Implement fused sdpa for wav2vec2 (#18) (huggingface#1520) * Memory optimization for gpt_bitcode (#4) (huggingface#1513) Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com> * text_generation: improve parameters check (huggingface#1527) * transformers: fixed some typos (huggingface#1528) * Update DeepSpeed CI baselines * Update FSDP CI baseline * Optimum-Habana docs re-org (huggingface#1488) Signed-off-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Greg Serochi <greg.serochi@intel.com> Co-authored-by: Kiangpeng Lau <kiangpeng.lau@intel.com> Co-authored-by: Seethong Vang <seethong.vang@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> Co-authored-by: Anastasia Uvarova <anastasia.uvarova@intel.com> Co-authored-by: Mohit Deopujari <mohit.deopujari@intel.com> Co-authored-by: Chen Levkovich <chen.levkovich@intel.com> Co-authored-by: Libin Tang <libin.tang@intel.com> * Makes the with_stack of the profiler changeable (huggingface#1497) * FLUX with diffusers 0.31.0 (huggingface#1450) Signed-off-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Baochen Yang <baochen.yang@intel.com> Co-authored-by: Huijuan Zhou <huijuan.zhou@intel.com> Co-authored-by: Sergey Plotnikov <sergey.plotnikov@intel.com> Co-authored-by: Deepak Narayana <deepak.narayana@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix some CI baselines * Add split runners to CI (2 devices per runner for fast tests) * Fix fast CI to work with split runners (huggingface#1534) * Fix dtype issue with valid sequence length in torch.compile bs=1 (huggingface#1532) * Support beam search with reuse_cache and bucket_internal (huggingface#1472) * Add mixtral trl sft (huggingface#1349) * Enable tiiuae/falcon-11B-vlm in image_to_text example (huggingface#1490) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Add Llama 3.1 ft to CI (huggingface#1529) * Migrate OH CLIP (roberta-clip) training to torch.compile (huggingface#1507) * test_text_generation: fix non-Gaudi2 case (huggingface#1530) * text-generation: improve output printing (huggingface#1486) * Text-generation, model set-up: torch.compile for attributes instead of models' types (huggingface#1452) * FLUX Fine-Tuning for Gaudi (huggingface#1482) Signed-off-by: Daniel Socek <daniel.socek@intel.com> * Enable fusedsdpa kernel for vision part of mllama (huggingface#1531) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Minicpm enabling (huggingface#1342) Signed-off-by: Daniel Huang <daniel1.huang@intel.com> * Fix bridgetower example (#312) (huggingface#1481) * Migrate OH Wave2Vec-AC training to torch.compile - README update (huggingface#1537) Co-authored-by: Chaojun Zhang <chzhang@habana.ai> * Flux Image-To-Image pipeline (huggingface#1524) Signed-off-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> * Enable Falcon-mamba (huggingface#1480) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Enable dynamic compile for mpi(training) (huggingface#1509) * Migrate OH T5-large training to torch.compile (huggingface#1506) * Add support for Baichuan2 (huggingface#1479) Signed-off-by: Haihao Xiang <haihao.xiang@intel.com> Co-authored-by: Jianqian Zhou <jianqian.zhou@intel.com> Co-authored-by: Wei Lin <wei2.lin@intel.com> * trainer: fixed spelling (huggingface#1538) * Create CI Eager/Lazy for Language Modeling (huggingface#1448) * Fixes for llava-next test failures in 1.19 (huggingface#1535) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Enable DeepSeek-V2 (huggingface#1475) Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Refactor Qwen2 Family (huggingface#1541) * Add support for optimized SDXL pipeline (huggingface#1519) * Make style * Add the checkout parameters of falcon-mamba pytest (huggingface#1540) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Avoid negative values in eval metrics (huggingface#1533) * Fixes in unify_measurements (huggingface#1496) Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> Co-authored-by: Eran Geva <egeva@habana.ai> * Fix lm_eval script for starcoder and gemma (huggingface#1463) * Add option to use bf16 in PT sdp (#5) (huggingface#1514) Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com> * Fix tests.test_peft_inference failure (huggingface#1543) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * [wav2vec2] Remove tensor.item and dynamic slicing operations in the loop that cause graph break (huggingface#1508) * Update lm_eval version (huggingface#1473) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix lm_eval script for starcoder and gemma (huggingface#1463) * Add option to use bf16 in PT sdp (#5) (huggingface#1514) Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com> * Fix tests.test_peft_inference failure (huggingface#1543) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update lm_eval version (huggingface#1473) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix bad import in Baichuan code (huggingface#1547) * Restore performance in generate (huggingface#1546) Signed-off-by: Urszula Golowicz <urszula.golowicz@intel.com> Co-authored-by: Marcin Łapiński <mlapinskix@habana.ai> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> * Enable pyTorch-IMage-Models (TIMM) with HPUs (huggingface#1459) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Add HF login for 8x Gaudi2 CI * Adding support for Context Parallelism using Deepseed's DistributedAttention (huggingface#1501) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix bad import in Baichuan code (huggingface#1547) * Restore performance in generate (huggingface#1546) Signed-off-by: Urszula Golowicz <urszula.golowicz@intel.com> Co-authored-by: Marcin Łapiński <mlapinskix@habana.ai> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> * Enable pyTorch-IMage-Models (TIMM) with HPUs (huggingface#1459) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Add HF login for 8x Gaudi2 CI * Adding support for Context Parallelism using Deepseed's DistributedAttention (huggingface#1501) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix Llama CI * Fix Llama CI * Add DynamicMoE support for Mixtral (huggingface#1511) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> * Fix for llava models not generating text with test failures in 1.19 (huggingface#1548) * Refactor KV cache, Rope , reduce common code (huggingface#1148) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Adjust Qwen2-7B test case (huggingface#1551) * [run_lm_eval.py] Fixed too many print dump json info (huggingface#1553) Signed-off-by: Focus Luo <focus.luo@intel.com> * Fix for single_card llama7b and falcon40b CI errors (huggingface#1549) * Implemented fusedSDPA for stable diffusion (#36) (huggingface#1545) Co-authored-by: Yixiu Chen <yixiu.chen@intel.com> Co-authored-by: Libin Tang <litang@habana.ai> * Apply --sdp_on_bf16 to image-to-text examples (huggingface#1557) * Fix accuracy regression in Gemma (huggingface#1556) * Fix FusedSDPA wrapper from TransformerEngine (huggingface#1562) * Add DynamicMoE support for Mixtral (huggingface#1511) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> * Fix for llava models not generating text with test failures in 1.19 (huggingface#1548) * Refactor KV cache, Rope , reduce common code (huggingface#1148) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Adjust Qwen2-7B test case (huggingface#1551) * [run_lm_eval.py] Fixed too many print dump json info (huggingface#1553) Signed-off-by: Focus Luo <focus.luo@intel.com> * Fix for single_card llama7b and falcon40b CI errors (huggingface#1549) * Implemented fusedSDPA for stable diffusion (#36) (huggingface#1545) Co-authored-by: Yixiu Chen <yixiu.chen@intel.com> Co-authored-by: Libin Tang <litang@habana.ai> * Apply --sdp_on_bf16 to image-to-text examples (huggingface#1557) * Fix accuracy regression in Gemma (huggingface#1556) * Fix FusedSDPA wrapper from TransformerEngine (huggingface#1562) * Run albert-xxlarge-v1 CI as torch.compile mode (huggingface#1563) * Update README commands for the models to use --sdp_on_bf16 (huggingface#1566) * Minicpm patch (huggingface#1567) Signed-off-by: Daniel Huang <daniel1.huang@intel.com> * Updated gemma_2b_it CI (huggingface#1561) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fixed Adalora Test for OH 1.15 (huggingface#1564) * Fixed LORACP Test for OH 1.15 (huggingface#1568) * Run albert-xxlarge-v1 CI as torch.compile mode (huggingface#1563) * Update README commands for the models to use --sdp_on_bf16 (huggingface#1566) * Minicpm patch (huggingface#1567) Signed-off-by: Daniel Huang <daniel1.huang@intel.com> * Updated gemma_2b_it CI (huggingface#1561) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fixed Adalora Test for OH 1.15 (huggingface#1564) * Fixed LORACP Test for OH 1.15 (huggingface#1568) * Add requirements.txt * Update the baseline for 1.18 to reflect performance in 1.19 (huggingface#1571) * Fix prefix llama ci failure (huggingface#1570) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fusedsdpa for stable diffusion xl (huggingface#1565) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix prefix llama ci failure (huggingface#1570) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Add sdp_on_bf16 to tests,text-gen (huggingface#1559) * Fix mllama test (huggingface#1569) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Fix lazy_mode assignment (huggingface#1558) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com> * Fix mllama test (huggingface#1569) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Fix lazy_mode assignment (huggingface#1558) Co-authored-by: Yaser Afshar <yaser.afshar@intel.com> * Fix diffusers import (huggingface#1574) * Update README commands for more models to use --sdp_on_bf16 (huggingface#1575) Co-authored-by: Libin Tang <litang@habana.ai> * Generation utils update (minor) (huggingface#1468) * style: removed tabs (huggingface#1577) * Add chatglm (huggingface#1478) Co-authored-by: Wei Lin <wei2.lin@intel.com> Co-authored-by: Jianqian Zhou <jianqian.zhou@intel.com> Co-authored-by: Leo Zhao <leo.zhao@intel.com> * Enable num_return_sequences in beam search (huggingface#1536) * gpt_bigcode: added internal bucketing fix (huggingface#1526) * Update the Gaudi trainer with transformers 4.45.2 (huggingface#1398) * Revert "add check_neural_compressor_min_version for 4 bit behavior" (huggingface#1578) * Revert PR huggingface#1473 (huggingface#1582) * Enable num_return_sequences in beam search (huggingface#1536) * gpt_bigcode: added internal bucketing fix (huggingface#1526) * Revert "add check_neural_compressor_min_version for 4 bit behavior" (huggingface#1578) * Revert PR huggingface#1473 (huggingface#1582) * Remove deprecated env variables * Add sdp_on_bf16 argument to CI for run_image2text_lora_finetune and a… (huggingface#1585) * Remove unnecessary neural compressor fix for 1.19 release (huggingface#1584) * Make style * Fixed spelling (huggingface#1576) * Update docs for baichuan2 training (huggingface#1586) * Fixed spelling (huggingface#1576) * Update docs for baichuan2 training (huggingface#1586) * Adjust bert and roberta targets (huggingface#1588) * Update text-gen readme for autogptq (huggingface#1589) * Update README to Include Information on Performance Degradation and Mitigation Options (huggingface#1555) * Fix Accuracy Calculation Issue in GPT-NeoX (huggingface#1591) * Readme update for llama-405B (huggingface#1587) Co-authored-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Seunghyuk Park (shepark) <separk@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix Accuracy Calculation Issue in GPT-NeoX (huggingface#1591) * Add WA flag for falcon-180b to resolve text-gen critical reset error during tests (huggingface#1590) * Add WA flag for falcon-180b to resolve text-gen critical reset error during tests (huggingface#1590) * Add sdp_on_bf16 option to diffusers and image/audio classicifation tests (huggingface#1592) * Update transformers tests generation util v4.45.2 (huggingface#1441) Co-authored-by: Gustavo <gustavo.malkomes> Co-authored-by: Yaser Afshar <yaser.afshar@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Update README.md (huggingface#1595) * Limit position embeddings in inference (huggingface#1598) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> * Verify model output is provided when check_output is enabled (huggingface#1597) * Limit position embeddings in inference (huggingface#1598) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> * Verify model output is provided when check_output is enabled (huggingface#1597) * Update README.md (huggingface#1595) * Fix scikit-learn to 1.5.2 to fix f1 evaluation crash in 1.6.0 (huggingface#1596) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Revert common KVCache not to check token_idx (huggingface#1594) * Update language-modeling README file (huggingface#1599) Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Update readme for audio-classification example (huggingface#1602) * SDPA flag update - static code analysis (huggingface#1601) * Revert common KVCache not to check token_idx (huggingface#1594) * Remove unwanted merged changes in SD pipeline * Revert LlamaKVCache due to memory increase (huggingface#1605) * Check rope_scaling attr (huggingface#1609) * skip certain tests for G1 with empty param list (huggingface#1613) * Revert "Update transformers tests generation util v4.45.2 (huggingface#1441)" (huggingface#1614) This reverts commit 2ba520a. * audio classification readme update (huggingface#1604) * fix readme cmds for clip-roberta (huggingface#1603) * fix readme cmds for clip-roberta * comments and cleanup * Fix run_generation test commands for TRL out usage example (huggingface#1624) Fix run_generation example * Add arbitrary scales (#15) (huggingface#1625) Co-authored-by: Linoy Buchnik <linoybu@gmail.com> * Modify Qwen2 TRL command to avoid OOM. (huggingface#1630) Add --use_flash_attention to avoid OOM for Qwen2 * Replace the UNET custom attention processors (huggingface#1608) Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> * Falcon Model Support (huggingface#1612) Co-authored-by: leopck <sckphoong@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Update sdp_on_bf16 option for ST example (huggingface#1615) * Update save lora weights for diffusers with text_encoder_2 layers (huggingface#1626) * Fix `save_lora_weights` in `pipeline_utils.py` (huggingface#1643) * Refactor mixtral moe block. (huggingface#1635) * speech-recognition: downgrade datasets version (huggingface#1646) * add sdp_on_bf16 to controlnet (huggingface#1631) * add sdp_on_bf16 to controlnet * Update pipeline_controlnet.py pass sdp_on_bf16 to controlnet_pipeline * Update text_to_image_generation.py * Update text_to_image_generation.py * Quick fix for quantization/custom op list loading (huggingface#1657) Signed-off-by: Daniel Socek <daniel.socek@intel.com> * Update multi-node test dockerfile (huggingface#1662) * Fixes on OH 1.15 pre release (huggingface#1661) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Fix distributed issue for ST Trainer (huggingface#1649) * Fix distributed issue for timm (huggingface#1653) Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> * Added missing parameter for llama function call (huggingface#1663) Co-authored-by: Libin Tang <litang@habana.ai> * Add reuse_cache for llama3-405b measurement (huggingface#1664) * Update EFA dockerfile to SynapseAI 1.19.0 (huggingface#1665) Co-authored-by: Libin Tang <litang@habana.ai> * Fix bug for GaudiMixtralAttentionLongSequence forward (huggingface#1650) Signed-off-by: kaixuanliu <kaixuan.liu@intel.com> * Update to SynapseAI v1.19 * Release: v1.15.0 * Fix style * save_model - incorrect conflict resolution * Fix style --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com> Signed-off-by: Mengni Wang <mengni.wang@intel.com> Signed-off-by: Daniel Socek <daniel.socek@intel.com> Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Signed-off-by: Xin <xin3.he@intel.com> Signed-off-by: xinhe3 <xinhe3@habana.ai> Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Signed-off-by: yuanwu <yuan.wu@intel.com> Signed-off-by: Haihao Xiang <haihao.xiang@intel.com> Signed-off-by: Matrix YAO <matrix.yao@intel.com> Signed-off-by: Urszula Golowicz <urszula.golowicz@intel.com> Signed-off-by: Focus Luo <focus.luo@intel.com> Signed-off-by: kaixuanliu <kaixuan.liu@intel.com> Co-authored-by: Pramod Kumar <144990617+pramodkumar-habanalabs@users.noreply.github.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> Co-authored-by: Roi Tiefenbrunn <roi.tief97@gmail.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Konrad Drozd <konrad.drozd@intel.com> Co-authored-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Yeonsil Yoon <yyoon@habana.ai> Co-authored-by: Danny Semiat <dsemiat@habana.ai> Co-authored-by: Yaser Afshar <yaser.afshar@intel.com> Co-authored-by: Harish Subramony <81822986+hsubramony@users.noreply.github.com> Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com> Co-authored-by: Sayantan Sarkar <supersarkar@gmail.com> Co-authored-by: Harish <hsubramony@habana.ai> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com> Co-authored-by: Jimin Ha <jha@habana.ai> Co-authored-by: Seunghyuk Park (shepark) <separk@habana.ai> Co-authored-by: Dmitry <dmitry.smertin@intel.com> Co-authored-by: Soila Kavulya <soila.p.kavulya@intel.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: xinhe <xin3.he@intel.com> Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: Akihiro Takahashi <akihiro.takahashi@intel.com> Co-authored-by: Miroslav Goncharenko <miroslav.goncharenko@intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com> Co-authored-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Vidya Galli <vidya.s.galli@intel.com> Co-authored-by: deepak-gowda-narayana <140652370+deepak-gowda-narayana@users.noreply.github.com> Co-authored-by: Supreet Singh <100715017+SupreetSinghPalne@users.noreply.github.com> Co-authored-by: kaixuanliu <kaixuan.liu@intel.com> Co-authored-by: ANSHUMAN TRIPATHY <a.tripathy87@gmail.com> Co-authored-by: sushil dubey <sdubey@habana.ai> Co-authored-by: Luca Calabria <luca.calabria@intel.com> Co-authored-by: billishyahao <yahao.he@intel.com> Co-authored-by: xinhe3 <xinhe3@habana.ai> Co-authored-by: KP (Edwin) Lau <kiangpeng.lau@intel.com> Co-authored-by: Marcin Łapiński <mlapinskix@habana.ai> Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com> Co-authored-by: Greg Serochi <greg.serochi@intel.com> Co-authored-by: Seethong Vang <seethong.vang@intel.com> Co-authored-by: Anastasia Uvarova <anastasia.uvarova@intel.com> Co-authored-by: Mohit Deopujari <mohit.deopujari@intel.com> Co-authored-by: Chen Levkovich <chen.levkovich@intel.com> Co-authored-by: Libin Tang <libin.tang@intel.com> Co-authored-by: ranzhejiang <zhejiang.ran@intel.com> Co-authored-by: Baochen Yang <baochen.yang@intel.com> Co-authored-by: Huijuan Zhou <huijuan.zhou@intel.com> Co-authored-by: Sergey Plotnikov <sergey.plotnikov@intel.com> Co-authored-by: Deepak Narayana <deepak.narayana@intel.com> Co-authored-by: Witold Szczurek <152967125+wszczurekhabana@users.noreply.github.com> Co-authored-by: Wei Lin <forever871001@163.com> Co-authored-by: lkk <33276950+lkk12014402@users.noreply.github.com> Co-authored-by: Chaojun Zhang <chzhang@habana.ai> Co-authored-by: Daniel Huang <daniel1.huang@intel.com> Co-authored-by: Yuan Wu <yuan.wu@intel.com> Co-authored-by: Xiang, Haihao <haihao.xiang@intel.com> Co-authored-by: Jianqian Zhou <jianqian.zhou@intel.com> Co-authored-by: Wei Lin <wei2.lin@intel.com> Co-authored-by: Thanaji Rao Thakkalapelli <tthakkalapelli@habana.ai> Co-authored-by: Yao Matrix <yaoweifeng0301@126.com> Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> Co-authored-by: Eran Geva <egeva@habana.ai> Co-authored-by: Alexey Belyakov <alexey.belyakov@intel.com> Co-authored-by: Bhargav <beede@habana.ai> Co-authored-by: Krzysztof Wiśniewski <krzysztof2.wisniewski@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: FocusLuo <focus.luo@intel.com> Co-authored-by: Yixiu Chen <yixiu.chen@intel.com> Co-authored-by: Nariman Piroozan <87953329+npiroozan@users.noreply.github.com> Co-authored-by: Edward Mascarenhas <edward.mascarenhas@intel.com> Co-authored-by: Shiv Kaul <skaul@habana.ai> Co-authored-by: bmengke <mengkejiergeli.ba@intel.com> Co-authored-by: Leo Zhao <leo.zhao@intel.com> Co-authored-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Harshvardhan Chauhan <hchauhan@habana.ai> Co-authored-by: Gustavo Malkomes <gustavo.malkomes@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Alexey Fadeev <alexey.fadeev@intel.com> Co-authored-by: leopck <sckphoong@habana.ai>

atakaha requested a review from regisss as a code owner October 23, 2024 18:52

atakaha force-pushed the gemma_flash_attention branch 4 times, most recently from 29d4814 to 14c442e Compare October 30, 2024 20:33

atakaha marked this pull request as draft October 31, 2024 01:01

Enable flash attention and reuse_cache for gemma

5a2ee0e

Add missing flag handling to gemma --reuse_cache --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

atakaha force-pushed the gemma_flash_attention branch from 14c442e to 5a2ee0e Compare November 2, 2024 01:12

atakaha marked this pull request as ready for review November 4, 2024 17:58

atakaha force-pushed the gemma_flash_attention branch from f9bad35 to 5a2ee0e Compare November 5, 2024 01:45

vidyasiv approved these changes Nov 6, 2024

View reviewed changes

libinta added the run-test Run CI for PRs from external contributors label Nov 14, 2024

regisss approved these changes Nov 15, 2024

View reviewed changes

regisss merged commit ef83544 into huggingface:main Nov 15, 2024
3 of 5 checks passed

Luca-Calabria pushed a commit to Luca-Calabria/optimum-habana that referenced this pull request Nov 25, 2024

Enable flash attention for gemma (huggingface#1454)

f6b3bc0

Liangyx2 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Jan 20, 2025

Enable flash attention for gemma (huggingface#1454)

1646c30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable flash attention for gemma #1454

Enable flash attention for gemma #1454

atakaha commented Oct 23, 2024

atakaha commented Oct 23, 2024

tthakkal commented Oct 23, 2024

atakaha commented Oct 24, 2024

vidyasiv commented Nov 4, 2024 •

edited

Loading

atakaha commented Nov 4, 2024 •

edited

Loading

atakaha commented Nov 5, 2024

vidyasiv commented Nov 5, 2024

vidyasiv commented Nov 5, 2024 •

edited

Loading

atakaha commented Nov 5, 2024 •

edited

Loading

atakaha commented Nov 5, 2024 •

edited

Loading

vidyasiv commented Nov 6, 2024 •

edited

Loading

vidyasiv left a comment

atakaha commented Nov 6, 2024

atakaha commented Nov 14, 2024

HuggingFaceDocBuilderDev commented Nov 15, 2024

Enable flash attention for gemma #1454

Enable flash attention for gemma #1454

Conversation

atakaha commented Oct 23, 2024

What does this PR do?

Before submitting

atakaha commented Oct 23, 2024

tthakkal commented Oct 23, 2024

atakaha commented Oct 24, 2024

vidyasiv commented Nov 4, 2024 • edited Loading

atakaha commented Nov 4, 2024 • edited Loading

atakaha commented Nov 5, 2024

vidyasiv commented Nov 5, 2024

vidyasiv commented Nov 5, 2024 • edited Loading

atakaha commented Nov 5, 2024 • edited Loading

atakaha commented Nov 5, 2024 • edited Loading

vidyasiv commented Nov 6, 2024 • edited Loading

vidyasiv left a comment

Choose a reason for hiding this comment

atakaha commented Nov 6, 2024

atakaha commented Nov 14, 2024

HuggingFaceDocBuilderDev commented Nov 15, 2024

vidyasiv commented Nov 4, 2024 •

edited

Loading

atakaha commented Nov 4, 2024 •

edited

Loading

vidyasiv commented Nov 5, 2024 •

edited

Loading

atakaha commented Nov 5, 2024 •

edited

Loading

atakaha commented Nov 5, 2024 •

edited

Loading

vidyasiv commented Nov 6, 2024 •

edited

Loading