You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encounter this error when I tried to use TinyLlama-1.1B-intermediate-step-240k-503b as the draft model
/root/miniconda3/lib/python3.10/site-packages/torch/__init__.py:635: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:450.)
_C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision (if it doesn't exist)...
Loading 'model/chaoscodes/TinyLlama-1.1B-intermediate-step-240k-503b' model weights from the cache...
Loading weight file tok_embeddings_weight
Loading weight file layers_0_attention_norm_weight
Loading weight file layers_0_attention_wq_weight
Loading weight file layers_0_attention_wk_weight
load attention data error 1048576, 8388608, 1, /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision/layers_0_attention_wk_weight
python: /tmp/pip-install-ijvow1hh/flexflow_0192abbf2b1a40128377649dca2ea9f0/inference/file_loader.cc:252: void load_attention_weights_v2(DT*, int, int, size_t, size_t, std::string, std::string, size_t, int) [with DT = __half; size_t = long unsigned int; std::string = std::__cxx11::basic_string<char>]: Assertion `false && "data size mismatch"' failed.
Aborted (core dumped)
code I use:
import flexflow.serve as ff
ff.init(
num_gpus=4,
memory_per_gpu=23000,
zero_copy_memory_per_node=30000,
tensor_parallelism_degree=4,
pipeline_parallelism_degree=1
)
# Specify the LLM
llm = ff.LLM("model/Llama-2-7b-hf")
# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("model/TinyLlama-1.1B-intermediate-step-240k-503b")
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0.9, topp=0.8, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config, ssms=ssms)
result = llm.generate("Here are some travel tips for Tokyo:\n")
I believe this is probably not an issue with the TinyLlama weight. More than likely there is a bug in FlexFlow dealing with GQA weight/ rope.
The reason why I claim that is the TinyLlama weight works fine with HuggingFace and llama.cpp.
Previously I spotted a bug in the llama.cpp: ggerganov/llama.cpp#3364. Basically, a bug exists when converting the HF weight to llama.cpp weight (from GPT-NeoX style rope to GPT-J style). Nobody spotted this because the previous GQA model like llama-2-70B has num_heads = kv_heads ** 2, while for TinyLlama its num_heads = 32 and kv_heads = 4.
This bug exists in repos such as llama.cpp (fixed now), llama2.c and llama2.mojo.
I am wondering if similar things may happen here.
Right now I am thinking if this line in FlexFlow is correct:
The TinyLlama project aims to pretrain a 1.1B Llama on 3T tokens. So that model should be an ideal draft model for speculative inference.
https://github.com/jzhang38/TinyLlama
https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b
I encounter this error when I tried to use TinyLlama-1.1B-intermediate-step-240k-503b as the draft model
code I use:
I believe this is probably not an issue with the TinyLlama weight. More than likely there is a bug in FlexFlow dealing with GQA weight/ rope.
The reason why I claim that is the TinyLlama weight works fine with HuggingFace and llama.cpp.
Previously I spotted a bug in the llama.cpp: ggerganov/llama.cpp#3364. Basically, a bug exists when converting the HF weight to llama.cpp weight (from GPT-NeoX style rope to GPT-J style). Nobody spotted this because the previous GQA model like llama-2-70B has num_heads = kv_heads ** 2, while for TinyLlama its num_heads = 32 and kv_heads = 4.
This bug exists in repos such as llama.cpp (fixed now), llama2.c and llama2.mojo.
I am wondering if similar things may happen here.
Right now I am thinking if this line in FlexFlow is correct:
https://github.com/flexflow/FlexFlow/blob/1d5e0c593a956b7fcc789a1b034e6ff920aad1d4/python/flexflow/serve/serve.py#L265
(Above is a hypothesis from me that may not be correct. My point is it would be nice if someone can make FlexFlow work with TinyLlama :) ).
The text was updated successfully, but these errors were encountered: