-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llm #2
Add llm #2
Conversation
@apaniukov is there C++ API for tokenization and detokenization? |
Yes, there is. Right now is on openvin_contrib branch. Here is an instruction for building and installation. It does not support llama tokenizers from Then you can convert tokenizer to OV models: from transformers import AutoTokenizer
from openvino import save_model
from ov_tokenizer import init_extension, convert_tokenizer
init_extension("path/to/libuser_ov_extensions.so")
hf_tokenizer = AutoTokenizer.from_pretrained("microsoft/Llama2-7b-WhoIsHarryPotter")
ov_tokenizer, ov_detokenizer = convert_tokenizer(hf_tokenizer, with_decoder=True)
save_model(ov_tokenizer, "tokenizer.xml")
save_model(ov_detokenizer, "detokenizer.xml") From here you can work with them like with any other OV model. You can also add similar postprocessing to the llama model to get Models cannot work with strings at this point, so you need to convert the input to uint8 tensor with a predefined format, see pack_strings. To get strings from the detokenizer uint8 output, see unpack_strings. |
llm/llm.cpp
Outdated
throw std::runtime_error("Model and vocab number of tokens don't match"); | ||
} | ||
float* logits = ireq.get_tensor("logits").data<float>() + (prompt.size() - 1) * n_vocab; | ||
ptrdiff_t out_token = std::max_element(logits, logits + n_vocab) - logits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yury-gorbachev, I was told that you requested to add beam search. Should it be a separate application or should I provide beam search implementation only given that beam size of 1 is greedy sampling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge greedy sampling for now to unblock others
llm/README.md
Outdated
|
||
## Supported models | ||
|
||
1. [LLaMA 2](https://huggingface.co/meta-llama/Llama-2-13b-hf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably any models of these families?
} | ||
ireq.get_tensor("input_ids").set_shape(tokenizer.get_tensor("input_ids").get_shape()); // TODO: replace with ireq.set_tensor("input_ids", tokenizer.get_tensor("input_ids")); after it's fixed | ||
ireq.get_tensor("attention_mask").set_shape(tokenizer.get_tensor("input_ids").get_shape()); | ||
std::copy_n(tokenizer.get_tensor("input_ids").data<int32_t>(), tokenizer.get_tensor("input_ids").get_size(), ireq.get_tensor("input_ids").data<int32_t>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have Tensor::copy_to
, which can also allocate output tensor
) | ||
else() | ||
target_compile_options(llm PRIVATE -Wall) # Display all warnings | ||
target_compile_options(sentencepiece-static PRIVATE -Wno-stringop-overflow) # Disable the warning from openvino_contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this code to contrib
Verify beam_idx added, upgrade OpenVINO
Adding lib to bazel
…tion-pavel Integrate JinjaCpp
Added embedding pipeline and embedding handle, encapsulated handle code
NPU Static pipeline: Unroll SDPA - move K transpose to MatMul
Ticket 121865