Helps needed in testing out TinyLlama #1154

jzhang38 · 2023-09-28T08:46:33Z

The TinyLlama project aims to pretrain a 1.1B Llama on 3T tokens. So that model should be an ideal draft model for speculative inference.

https://github.com/jzhang38/TinyLlama
https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b

I encounter this error when I tried to use TinyLlama-1.1B-intermediate-step-240k-503b as the draft model

/root/miniconda3/lib/python3.10/site-packages/torch/__init__.py:635: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:450.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision (if it doesn't exist)...
Loading 'model/chaoscodes/TinyLlama-1.1B-intermediate-step-240k-503b' model weights from the cache...
Loading weight file tok_embeddings_weight
Loading weight file layers_0_attention_norm_weight
Loading weight file layers_0_attention_wq_weight
Loading weight file layers_0_attention_wk_weight
load attention data error 1048576, 8388608, 1, /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision/layers_0_attention_wk_weight
python: /tmp/pip-install-ijvow1hh/flexflow_0192abbf2b1a40128377649dca2ea9f0/inference/file_loader.cc:252: void load_attention_weights_v2(DT*, int, int, size_t, size_t, std::string, std::string, size_t, int) [with DT = __half; size_t = long unsigned int; std::string = std::__cxx11::basic_string<char>]: Assertion `false && "data size mismatch"' failed.
Aborted (core dumped)

code I use:

import flexflow.serve as ff

ff.init(
        num_gpus=4,
        memory_per_gpu=23000,
        zero_copy_memory_per_node=30000,
        tensor_parallelism_degree=4,
        pipeline_parallelism_degree=1
    )

# Specify the LLM
llm = ff.LLM("model/Llama-2-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("model/TinyLlama-1.1B-intermediate-step-240k-503b")
ssms.append(ssm)


# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(generation_config)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config, ssms=ssms)

result = llm.generate("Here are some travel tips for Tokyo:\n")

I believe this is probably not an issue with the TinyLlama weight. More than likely there is a bug in FlexFlow dealing with GQA weight/ rope.

The reason why I claim that is the TinyLlama weight works fine with HuggingFace and llama.cpp.

Previously I spotted a bug in the llama.cpp: ggerganov/llama.cpp#3364. Basically, a bug exists when converting the HF weight to llama.cpp weight (from GPT-NeoX style rope to GPT-J style). Nobody spotted this because the previous GQA model like llama-2-70B has num_heads = kv_heads ** 2, while for TinyLlama its num_heads = 32 and kv_heads = 4.
This bug exists in repos such as llama.cpp (fixed now), llama2.c and llama2.mojo.

I am wondering if similar things may happen here.

Right now I am thinking if this line in FlexFlow is correct:

https://github.com/flexflow/FlexFlow/blob/1d5e0c593a956b7fcc789a1b034e6ff920aad1d4/python/flexflow/serve/serve.py#L265

(Above is a hypothesis from me that may not be correct. My point is it would be nice if someone can make FlexFlow work with TinyLlama :) ).

The text was updated successfully, but these errors were encountered:

goliaro · 2023-09-30T01:05:23Z

@jzhang38 Let me give it a try!

goliaro · 2023-10-01T03:01:09Z

@jzhang38 thanks for bringing this up. PR #1162 should fix the issue :)

lockshaw assigned goliaro Sep 30, 2023

goliaro mentioned this issue Oct 1, 2023

Fix model configs (Falcon in C++, LLAMA in Python) #1162

Merged

goliaro closed this as completed in #1162 Oct 1, 2023

DDDDDYTS mentioned this issue Dec 16, 2024

Issue with C++ inference for model meta-llama/Llama-2-70b-hf flexflow/flexflow-serve#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helps needed in testing out TinyLlama #1154

Helps needed in testing out TinyLlama #1154

jzhang38 commented Sep 28, 2023

goliaro commented Sep 30, 2023

goliaro commented Oct 1, 2023

Helps needed in testing out TinyLlama #1154

Helps needed in testing out TinyLlama #1154

Comments

jzhang38 commented Sep 28, 2023

goliaro commented Sep 30, 2023

goliaro commented Oct 1, 2023